Azure Lesson 39 of 137

Durable Functions in Production: Orchestrations, Fan-out/Fan-in, and Entity State

Durable Functions is the part of Azure Functions that lets you write stateful, long-running workflows as plain code instead of stitching together queues, tables, and state machines by hand. The catch is that the programming model is not what it looks like. An orchestrator function reads top to bottom like normal C# or TypeScript, but underneath it is a replay engine that re-executes your code from the start every time it makes progress. If you do not internalize that, you will ship orchestrations that work in the demo and corrupt their own state under load. This guide builds the core patterns the right way — chaining, fan-out/fan-in, human interaction, eternal orchestrations, and durable entities — and ends with how to debug them when they get stuck at 2 a.m.

The whole field reduces to one sentence: the orchestrator is the brain and must be pure; activities are the hands and may touch the outside world; the Durable Task backend (Azure Storage, Netherite, or MSSQL) is the memory that survives crashes. Everything that bites you in production — NonDeterministicOrchestrationException, a settlement run that wedges at 95,000 merchants, double-applied payments, a history table that grows to tens of GB — is a violation of one of those three roles. Because this is a reference you will keep open mid-incident, every pattern, setting, error and limit here is laid out as a scannable table alongside the prose and the code: read the prose once, then keep the tables open.

All examples use the .NET isolated worker model, which is the supported path going forward; the concepts map directly to the JavaScript, Python, and PowerShell SDKs. By the end you will stop guessing — when an orchestration hangs you will know within ninety seconds whether you face a non-deterministic body, an unbounded fan-out starving the control queue, a non-idempotent activity double-applying a side effect, a WaitForExternalEvent with no timeout, or simply history bloat from a missing ContinueAsNew.

What problem this solves

Long-running, stateful workflows are the swamp of cloud engineering. You need to call five services in order, fan out ten thousand parallel jobs and wait for all of them, pause for a human approval that might take three days, or run a per-device aggregator forever — and you need all of it to survive a worker crash, a deployment, a scale-in event, and a transient API failure halfway through. The naive answer is to hand-roll it: a queue per step, a table to hold state, a poller to advance the state machine, a dead-letter queue for failures, and a pile of correlation IDs to tie it together. That code is mostly plumbing, it is where the bugs live, and every team rewrites it.

Durable Functions collapses that plumbing into code you can read. The state is the event-sourced history; you do not manage it. But the abstraction has a sharp edge: because the orchestrator body replays, anything non-deterministic in it silently diverges history and corrupts the workflow — or, if the SDK catches it, throws NonDeterministicOrchestrationException and wedges the instance. What breaks without this knowledge is specific and expensive: a settlement job that scales fine to 40,000 items and falls over at 95,000; a reconcile activity that double-posts to a partner ledger when its retry fires; a “stuck Running” instance nobody can explain; a history table that grows until queries time out.

Who hits this: any team using Durable Functions for orchestration (order processing, ETL, batch media work, approvals, sagas), anyone who fanned out without bounding the width, anyone whose activities have side effects but aren’t idempotent, and anyone running an eternal orchestration without ContinueAsNew. To frame the whole field before the deep dive, here is every failure class this guide covers, what it looks like, and the one place to look first.

Failure class What you observe First question First place to look Most common single cause
Non-determinism NonDeterministicOrchestrationException on replay Did the orchestrator schedule different work than history? The exception + showHistory=true DateTime.UtcNow/Guid.NewGuid/I/O in the orchestrator
Stuck “Running” forever Instance never reaches a terminal state Is it waiting on an event, or retrying a poison item? Status API; KQL for non-terminal instances WaitForExternalEvent with no timeout
Double-applied side effect Duplicate charges/adjustments Did an activity retry after the original succeeded? dependencies failures + duplicate rows Non-idempotent activity + retry policy
Slow / wedged fan-out Used to finish in 40 min, now 6 h Did fan-out width outgrow the backend? Control-queue latency; instance duration Unbounded Task.WhenAll over 10k+ activities
History bloat Queries time out; storage in tens of GB Large payloads or missing ContinueAsNew? History table size; payload sizes Returning big blobs by value; eternal loop without reset
Wrong app ran it “My orchestration ran on the other app” Do two apps share a storage account + hub name? host.json hubName Two apps sharing a task hub

Learning objectives

By the end of this article you can:

Prerequisites & where this fits

You should already be comfortable with Azure Functions fundamentals — triggers and bindings, the consumption/premium/dedicated hosting model, app settings, and deploying with func/az. You should be able to run az in Cloud Shell, read JSON output, and write enough C# to follow async/await and Task.WhenAll/Task.WhenAny. Familiarity with event sourcing helps but isn’t required — this article teaches the model from first principles. If you’re new to plain (non-durable) serverless patterns, read Azure Functions: Serverless Patterns & Best Practices and Build a Simple Serverless API on Azure first.

This sits in the Serverless / application-architecture track, one layer above plain function triggers. It assumes the hosting and scaling mechanics covered in Azure Functions Flex Consumption: VNet, Scaling & Cold Start, and it pairs tightly with the messaging primitives — when you outgrow Durable’s built-in queues you reach for Azure Service Bus: Sessions, Dedup & Dead-Letter Patterns and Azure Event Grid: MQTT, Event-Driven Routing & Dead-Letter. The diagnostic half leans on Azure Monitor & Application Insights for Observability and KQL for Azure Monitor & Log Analytics, because Application Insights is the single most useful tool for triaging a stuck instance.

A quick map of who owns what during an incident, so you escalate to the right place fast:

Layer What lives here Who usually owns it Failure classes it can cause
Trigger / client HTTP/event start, raiseEvent, status App / dev team Wrong instance ID; lost external event
Orchestrator body Determinism, control flow, WhenAll App / dev team Non-determinism; unbounded fan-out; no timeout
Activities / entities All I/O, side effects, shared state App / dev team Double-apply; poison item; large payloads
Durable backend History, queues, partitions App + platform Throughput ceiling; control-queue latency
Storage account Tables/blobs/queues, or Event Hubs/SQL Platform team Hub-name collisions; storage throttling (429)
Observability Traces, status API, purge App / SRE “Stuck Running” invisible without queries

Core concepts

Six mental models make every later diagnosis obvious.

The orchestrator replays; it does not run once. An orchestrator runs, awaits an activity, and unloads from memory. When that activity completes, the Durable Task Framework replays the orchestrator from line one, feeding already-completed results from a history table instead of calling the activities again. Replay stops at the first await whose result is not yet in history, and real execution resumes there. This is how an orchestration survives a worker crash, a deployment, or a scale-in: its state is the event-sourced history, not the process memory.

Determinism is non-negotiable. Because the body replays repeatedly, it must make the same decisions and schedule the same activities in the same order given the same history. That forbids ambient clocks, randomness, direct I/O, and non-deterministic collection ordering inside the orchestrator. The replacements live on the context (context.CurrentUtcDateTime, context.NewGuid()). The SDK detects divergence and throws NonDeterministicOrchestrationException rather than silently corrupting state — treat that as a code defect, never a transient error to retry.

Activities are the hands. All I/O — HTTP, database, blob, reading config — happens in activity functions, which run once per logical call (with retries) and whose inputs/outputs are serialized to JSON and recorded in history. Anything non-deterministic belongs here or comes from the context.

Entities hold state; orchestrations coordinate. A durable entity is an addressable, persistent object (a tiny actor) identified by entityName@key, with single-threaded access per entity so updates serialize without locks. Use an orchestration for a workflow with a start and end; use an entity for long-lived mutable state many callers update.

The task hub is the namespace. hubName in host.json namespaces all the queues and tables. Two function apps sharing one storage account must use different hub names or they fight over each other’s work items — the classic “my orchestration ran on the wrong app” incident.

The backend is finite and shared. Whatever provider you choose, its queues and partitions have throughput limits. On Azure Storage the control queues (default ~128 partitions across a small number of queues) and work-item queue can become the binding constraint under heavy fan-out; saturating them spikes latency and slows every replay.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the model side by side.

Concept One-line definition Where it lives Why it matters
Orchestrator function The deterministic “brain” that schedules work Your code ([OrchestrationTrigger]) Replays; must be pure
Activity function A unit of real work / I/O Your code ([ActivityTrigger]) Runs once per call; can do I/O
Durable entity Addressable single-threaded state object Your code ([EntityTrigger]) Race-free shared state, no locks
Client (binding) Starts/queries/signals orchestrations [DurableClient] The only way in from outside
History table Event-sourced record of an instance Backend (Table/SQL/Event Hubs) Source of replay and of bloat
Task hub Namespace for all queues/tables host.json hubName Collisions = cross-app interference
Instance ID Unique key for one orchestration run Generated or supplied Address for status/event/terminate
Replay Re-executing the body from the start Framework behaviour Why determinism is required
ContinueAsNew Restart with fresh state + clean history Orchestrator API Bounds eternal-orchestration history
External event A named signal delivered to an instance raiseEvent API Human/async-in pattern
Durable timer A persisted, replay-safe deadline context.CreateTimer Survives host restart; never Task.Delay
Storage provider Backend that persists all state Azure Storage / Netherite / MSSQL Throughput + cost + ops profile

The five built-in application patterns, side by side — this is the map of the deep sections that follow.

Pattern Shape Use it for Key API Main pitfall
Function chaining A → B → C, output feeds next Ordered pipelines (ingest → parse → store) CallActivityAsync in sequence Passing large payloads by value
Fan-out / fan-in Parallel N, then aggregate Batch jobs, per-item processing Task.WhenAll over many activities Unbounded width starves the queue
Async HTTP / human-in Pause, wait for a signal/timeout Approvals, callbacks, 2FA WaitForExternalEvent + CreateTimer No timeout → stuck forever
Eternal orchestration Loop forever, bounded Monitors, recurring cleanup, aggregators ContinueAsNew while(true) → history grows unbounded
Durable entities Addressable stateful actor Counters, carts, per-tenant budgets SignalEntityAsync / CallEntityAsync Treating an entity like an orchestration

The replay execution model and why determinism is non-negotiable

An orchestration survives a worker crash, a deployment, or a scale-in because its state is the event-sourced history, not the process memory. That same mechanism is the source of every Durable Functions bug. Because the orchestrator body is replayed repeatedly, it must be deterministic — given the same history, it must make the same decisions and schedule the same activities in the same order.

The replacements for non-deterministic constructs live on the orchestration context:

[Function(nameof(ProcessOrder))]
public async Task<OrderResult> ProcessOrder(
    [OrchestrationTrigger] TaskOrchestrationContext context,
    OrderInput input)
{
    // Deterministic, replay-safe equivalents:
    DateTime now = context.CurrentUtcDateTime;        // NOT DateTime.UtcNow
    Guid id = context.NewGuid();                       // NOT Guid.NewGuid()
    ILogger logger = context.CreateReplaySafeLogger<OrderProcessor>();

    // Skip log statements during replay so you don't see every line twice:
    if (!context.IsReplaying)
        logger.LogInformation("Starting order {OrderId}", input.OrderId);

    // All real work happens in activities, which CAN do I/O:
    var validated = await context.CallActivityAsync<bool>(nameof(ValidateOrder), input);
    return new OrderResult(input.OrderId, validated);
}

The mental model that sticks: the orchestrator is the brain and must be pure; activities are the hands and may touch the outside world. Anything non-deterministic belongs in an activity or comes from the context.

The Durable Task SDK detects non-deterministic orchestration when the replayed code schedules different work than the history records, and throws rather than silently corrupting state. Treat any NonDeterministicOrchestrationException as a code defect.

What is forbidden in an orchestrator — and the fix

Every forbidden construct, why it breaks replay, and the deterministic substitute. Memorize this table; it is the single highest-leverage thing in the article.

Forbidden in orchestrator Why it breaks replay Replay-safe replacement Where the real work goes
DateTime.UtcNow / DateTime.Now Different value each replay → divergent decisions context.CurrentUtcDateTime
Guid.NewGuid() New ID each replay → divergent history context.NewGuid()
Random / crypto RNG Non-reproducible Seed from context.NewGuid() or compute in an activity Activity
HttpClient / DB / file I/O Side effects re-fire on every replay Activity
Reading env vars / config Value may change between replays Pass as input, or read in an activity Activity
Task.Delay / Thread.Sleep Wall-clock; lost on restart context.CreateTimer(deadline, ct)
Task.Run / arbitrary threads Non-deterministic scheduling Schedule durable tasks only Activity
lock / Monitor / mutex Threading assumptions don’t hold Use a durable entity for serialization Entity
await on non-Durable tasks Completes outside the replay model Only await Durable APIs
Iterating an unordered Dictionary Ordering differs per replay Sort to a stable order first
Environment.MachineName, static mutable state Host-specific / shared mutable Pass via input or entity Entity / input
static counters incremented in body Replays increment repeatedly Move to an entity Entity
Console.WriteLine / unguarded logging Logs duplicate on every replay IsReplaying-guarded replay-safe logger
ConfigureAwait / custom SynchronizationContext Breaks the framework’s scheduler Just await durable tasks plainly
Throwing to “retry” the orchestrator Faults the orchestration, not a retry Put retry policy on the activity Activity

The context APIs you reach for instead, and exactly what each returns:

Context member Replaces Returns / does Note
context.CurrentUtcDateTime DateTime.UtcNow Deterministic “now” frozen per replay Advances only as history advances
context.NewGuid() Guid.NewGuid() Deterministic GUID seeded from instance + counter Use as idempotency-key seed
context.IsReplaying true while re-executing history Guard logging / one-shot effects
context.CreateReplaySafeLogger<T>() ILogger Logger that suppresses replayed lines Avoids double logs
context.GetInput<T>() constructor args The serialized input payload Must be a serializable POCO
context.InstanceId This orchestration’s ID For correlation / child IDs
context.CallActivityAsync<T>(...) direct method call Schedules an activity, awaits result Recorded in history
context.CreateTimer(deadline, ct) Task.Delay A persisted durable timer Survives restart
context.WaitForExternalEvent<T>(name) a callback Awaits a named external event Pair with a timeout
context.ContinueAsNew(state) a while(true) loop Restarts with clean history Last statement on the branch
context.CallSubOrchestratorAsync<T>(...) a giant WhenAll Schedules a child orchestration Bounds fan-out width
context.Entities.CallEntityAsync<T>(...) a lock / shared field Read-modify-write an entity Single-threaded per key
context.WaitForExternalEvent<T>(name, timeout) a callback + manual timer Awaits an event with a built-in timeout Throws TimeoutException on expiry

Function chaining and passing state safely

The simplest pattern is a sequence: A then B then C, where each step’s output feeds the next. Because state flows through return values held in history, you do not need external storage to pass data between steps.

[Function(nameof(IngestPipeline))]
public async Task<string> IngestPipeline(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var input = context.GetInput<IngestRequest>()!;

    string downloaded = await context.CallActivityAsync<string>(nameof(Download), input.Url);
    string parsed     = await context.CallActivityAsync<string>(nameof(Parse), downloaded);
    string stored     = await context.CallActivityAsync<string>(nameof(Persist), parsed);
    return stored;
}

Two rules keep this safe. First, everything crossing an activity boundary is serialized to JSON — inputs and outputs must be serializable POCOs, not live handles, streams, or HttpClient instances. Keep payloads small: if a step produces a 200 MB blob, return the blob URI, not the bytes, because large payloads bloat the history table and slow every replay. Second, add retries where failure is expected, not a blanket retry on everything.

var retry = TaskOptions.FromRetryPolicy(new RetryPolicy(
    maxNumberOfAttempts: 5,
    firstRetryInterval: TimeSpan.FromSeconds(5),
    backoffCoefficient: 2.0,
    maxRetryInterval: TimeSpan.FromMinutes(2)));

string downloaded = await context.CallActivityAsync<string>(
    nameof(Download), input.Url, retry);

The retry timing is itself recorded as durable timers, so a 5-attempt exponential backoff survives a worker restart mid-backoff.

Retry policy options, end to end

Every field of RetryPolicy, its default behaviour, and how to reason about it. Tuning these badly is a top cause of “stuck retrying forever.”

Setting Type / values Typical value When to change Trade-off / gotcha
maxNumberOfAttempts int ≥ 1 3–5 Raise for flaky upstreams; keep low for fast-fail Too high + non-idempotent activity = repeated side effects
firstRetryInterval TimeSpan 5 s Lower for chatty internal calls Too low hammers a struggling dependency
backoffCoefficient double ≥ 1 2.0 1.0 for fixed delay; >1 for exponential Exponential can stretch total time to hours
maxRetryInterval TimeSpan 1–5 min Cap the exponential growth Without a cap, late attempts are days apart
retryTimeout TimeSpan (unset) Bound total retry wall-clock Unset = retries until attempts exhausted
handle predicate Func<exc,bool> retry all Retry only transient exceptions Retrying a ValidationException is pointless

Where to put a retry — not every failure deserves one:

Failure kind Retry? Why
Transient network / 5xx / throttling (429) Yes, with backoff Likely to succeed on retry
Timeout to a healthy-but-busy dependency Yes, bounded Backoff lets it recover
ValidationException / 400 / bad input No Deterministic failure; retry wastes time
NonDeterministicOrchestrationException No Code defect — fix it, never retry
Poison message (always throws) No (cap attempts) Dead-letter / partition the result instead
Idempotent write that may have partially succeeded Yes, if idempotent Safe only when the activity is idempotent

Fan-out/fan-in for parallel processing

Chaining is sequential. When steps are independent, fan them out, run them in parallel across the entire scaled-out function app, then fan in to aggregate. This is the pattern that makes Durable Functions worth using over a logic-light queue trigger.

[Function(nameof(BatchResize))]
public async Task<int> BatchResize(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var batch = context.GetInput<ImageBatch>()!;

    // List the work in an activity (I/O), not in the orchestrator:
    string[] files = await context.CallActivityAsync<string[]>(
        nameof(ListSourceFiles), batch.Prefix);

    // FAN OUT: schedule all activities without awaiting individually.
    var tasks = new List<Task<long>>(files.Length);
    foreach (string file in files)
        tasks.Add(context.CallActivityAsync<long>(nameof(ResizeImage), file));

    // FAN IN: await them all; this is replay-safe and durable.
    long[] sizes = await Task.WhenAll(tasks);

    int totalBytes = sizes.Aggregate(0, (sum, s) => sum + (int)s);
    await context.CallActivityAsync(nameof(WriteManifest),
        new Manifest(batch.Prefix, files.Length, totalBytes));
    return files.Length;
}

Task.WhenAll over Durable tasks is the canonical fan-in. The orchestrator suspends until every activity reports back, and the framework records each completion in history independently, so a crash after 900 of 1,000 completions resumes with only the outstanding 100 left to run.

Two production guardrails matter. Bound the fan-out width: fanning out 100,000 activities at once floods the work-item queue and can starve other orchestrations — chunk the list and process N at a time, or use sub-orchestrations. And decide your failure policy explicitly: Task.WhenAll throws an aggregate if any task faults after its retries are exhausted, so if you want “best effort, collect successes and failures,” await each task in a try/catch and partition the results yourself rather than letting one poison item fail the whole batch.

Bounding the fan-out with sub-orchestrations

A sub-orchestration per chunk caps concurrent work items and isolates failures. This is the single most important scaling fix in the article.

[Function(nameof(BatchParent))]
public async Task<int[]> BatchParent(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    string[] all = context.GetInput<string[]>()!;
    const int chunkSize = 500;

    var chunkTasks = new List<Task<int>>();
    for (int i = 0; i < all.Length; i += chunkSize)
    {
        string[] chunk = all.Skip(i).Take(chunkSize).ToArray();
        // CallSubOrchestratorAsync bounds the in-flight width to one chunk at a time per call:
        chunkTasks.Add(context.CallSubOrchestratorAsync<int>(nameof(ProcessChunk), chunk));
    }
    return await Task.WhenAll(chunkTasks);   // still parallel, but width-controlled
}

The fan-in failure policies, side by side — pick before you ship, not during the incident:

Policy How you write it On a single failure Use when
All-or-nothing await Task.WhenAll(tasks) Throws aggregate; orchestration faults Every item must succeed (financial postings)
Best-effort partition try/await each, collect ok/err lists One bad item doesn’t sink the batch Independent items; you report failures
First-success await Task.WhenAny(...) then cancel Returns on first winner Racing redundant sources
Bounded width sub-orchestration per N items Failure isolated to a chunk Very large batches (10k+)
Throttled semaphore of pending tasks Caps concurrent in-flight work Protecting a rate-limited downstream

Fan-out sizing — what each width does to the Azure Storage backend:

Fan-out width Behaviour on Azure Storage backend Recommendation
1–100 Comfortable; negligible queue pressure Just Task.WhenAll
100–1,000 Fine; watch control-queue latency under bursts Task.WhenAll; monitor
1,000–10,000 Work-item queue pressure begins Chunk into sub-orchestrations
10,000–100,000 Control-queue latency spikes; replays slow Mandatory chunking (~500/chunk)
> 100,000 Starves other orchestrations; risk of wedge Chunk and consider Netherite

Human interaction with external events and durable timers

Some workflows must pause and wait for a human — an approval, a signature, a second factor — possibly for hours or days. You do this with an external event and a durable timer racing each other so you get a timeout instead of a workflow that hangs forever.

[Function(nameof(ApprovalWorkflow))]
public async Task<string> ApprovalWorkflow(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var request = context.GetInput<PurchaseRequest>()!;
    await context.CallActivityAsync(nameof(RequestApproval), request);

    // Durable timer: a replay-safe deadline. Always pair with a CTS so the
    // timer is cleaned up when the event arrives first.
    using var cts = new CancellationTokenSource();
    DateTime deadline = context.CurrentUtcDateTime.AddHours(72);
    Task timeout = context.CreateTimer(deadline, cts.Token);

    // External event: resumes when someone POSTs to the raise-event API.
    Task<bool> approved = context.WaitForExternalEvent<bool>("ApprovalResponse");

    Task winner = await Task.WhenAny(approved, timeout);
    if (winner == approved)
    {
        cts.Cancel();   // tear down the pending timer
        return approved.Result ? "Approved" : "Rejected";
    }
    return "TimedOut";   // escalate
}

Two things people get wrong. Use context.CreateTimer, never Task.Delay — a durable timer is persisted, so if the host restarts during the 72-hour wait the timer is restored and still fires, whereas Task.Delay is wall-clock and evaporates on restart. (Durable timers were historically capped at ~6 days on the Azure Storage backend; for longer waits, loop shorter timers.) And always cancel the loser — if you don’t cancel the timer when the event wins, the orchestration is held open until the timer fires, inflating instance counts and history.

The external event is delivered from outside by instance ID:

# Raise the "ApprovalResponse" event with payload `true` to a running instance
curl -X POST \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}/raiseEvent/ApprovalResponse?taskHub=MyTaskHub&code=${SYSTEM_KEY}" \
  -H "Content-Type: application/json" \
  -d 'true'

External-event vs durable-timer mechanics

The two primitives that make human-in-the-loop safe, contrasted:

Aspect External event (WaitForExternalEvent) Durable timer (CreateTimer)
What it waits for A named signal from outside A wall-clock deadline
Delivered by raiseEvent REST API / client The framework
Survives host restart Yes (buffered if it arrives early) Yes (persisted)
If it never happens Hangs forever — needs a timer Always fires
Cancellation n/a Cancel via CancellationToken when event wins
Max duration Unbounded ~6 days (Azure Storage); loop for longer
Common bug No timeout → stuck “Running” Not cancelling the loser
Replay-safety Yes (recorded as an event) Yes (recorded as a timer-fired event)

Task.WhenAny race outcomes — read this to reason about the branches:

Winner What it means What you must do
approved (event) Human responded in time Cancel the timer (cts.Cancel()), return result
timeout (timer) Deadline passed, no response Escalate / mark TimedOut (event may still arrive — handle or ignore)
Both effectively simultaneous Rare boundary First-completed wins deterministically on replay
Neither (still pending) Orchestration suspends Nothing — it resumes when one completes

Eternal orchestrations and ContinueAsNew

Some processes never really end: a per-device aggregator, a recurring cleanup, a monitor that polls forever. You cannot just wrap the body in while (true) — the history table would grow without bound and eventually every replay would crawl. The answer is ContinueAsNew, which restarts the orchestration with fresh state and a clean history, carrying forward only the input you choose.

[Function(nameof(PeriodicMonitor))]
public async Task PeriodicMonitor(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var state = context.GetInput<MonitorState>()!;

    bool stillOpen = await context.CallActivityAsync<bool>(nameof(CheckHealth), state.Target);
    if (!stillOpen)
        return;   // condition met -> orchestration completes for good

    // Wait one polling interval with a durable timer:
    DateTime next = context.CurrentUtcDateTime.AddMinutes(5);
    await context.CreateTimer(next, CancellationToken.None);

    // Reset history and loop with updated state. Do NOT recurse or while(true).
    context.ContinueAsNew(state with { Iterations = state.Iterations + 1 });
}

Key constraints: drain pending work before ContinueAsNew (any external events that arrived but weren’t awaited are lost across the boundary, so await everything you care about first); ContinueAsNew does not “return” — it schedules a restart, so structure the method so the call is the last statement on that branch; and remember this is what bounds history growth — an eternal orchestration without ContinueAsNew is a slow-motion outage.

Eternal-orchestration rules

The boundary semantics that trip people up:

Rule Why What happens if you ignore it
Call ContinueAsNew as the last statement on the branch It schedules a restart, doesn’t return Code after it runs unexpectedly during replay
Drain (await) pending external events first Unawaited events are dropped at the boundary Lost signals; missed approvals
Never use while(true) to loop History grows unbounded Replays crawl; queries time out
Don’t recurse via CallSubOrchestrator to loop Builds a deep instance chain Resource and history sprawl
Carry forward only the state you need Large carried state bloats the new instance Slow restarts
Terminate the loop on a real exit condition Otherwise it truly is eternal Orphan instances accumulate

Looping mechanism comparison:

Mechanism History growth Correct for Notes
ContinueAsNew Reset each iteration (flat) Monitors, recurring jobs, aggregators The right tool
while(true) in body Unbounded growth Nothing Slow-motion outage
Timer-triggered function restarting an orchestration Flat (new instance each time) Cron-like schedules Singleton-ID to avoid overlap
Recursion via sub-orchestration Grows a chain Bounded depth only Not for “forever”

Durable entities for stateful, single-threaded actor logic

Orchestrations coordinate; entities hold state. A durable entity is an addressable, persistent object (think a tiny actor) identified by entityName@key. The framework guarantees single-threaded access per entity, so you get serialized, race-free updates without locks — ideal for counters, shopping carts, per-tenant aggregates, or rate-limit budgets.

public class Counter : TaskEntity<int>
{
    public void Add(int amount) => State += amount;
    public void Reset() => State = 0;
    public int Get() => State;

    [Function(nameof(Counter))]
    public static Task Run([EntityTrigger] TaskEntityDispatcher dispatcher)
        => dispatcher.DispatchAsync<Counter>();
}

Call entities two ways. From a client you fire signals (one-way, fire-and-forget):

[Function("AddToCounter")]
public async Task<HttpResponseData> AddToCounter(
    [HttpTrigger(AuthorizationLevel.Function, "post", Route = "counter/{key}/add")]
        HttpRequestData req,
    [DurableClient] DurableTaskClient client,
    string key)
{
    var entityId = new EntityInstanceId(nameof(Counter), key);
    await client.Entities.SignalEntityAsync(entityId, "Add", 1);
    return req.CreateResponse(HttpStatusCode.Accepted);
}

From an orchestrator you can signal or call and await a return value, and the single-threaded guarantee lets an orchestration safely read-modify-write shared state:

var entityId = new EntityInstanceId(nameof(Counter), key);
int current = await context.Entities.CallEntityAsync<int>(entityId, "Get");
if (current < limit)
    await context.Entities.CallEntityAsync(entityId, "Add", 1);

When to reach for entities over an orchestration: use an orchestration for a workflow with a defined start and end; use an entity for long-lived, mutable state that many callers update concurrently. They compose — an orchestration that needs a global counter or lock should delegate to an entity rather than trying to serialize access itself.

Signal vs call, and entity vs orchestration

The two ways to invoke an entity differ in a way that matters for correctness:

Aspect SignalEntityAsync (signal) CallEntityAsync (call)
Direction One-way, fire-and-forget Two-way, awaits a return
Return value None Typed result
Callable from client Yes No (orchestrator/entity only)
Callable from orchestrator Yes Yes
Ordering guarantee Delivered, eventually Completes before next line
Use for Increment, append, notify Read-modify-write, read state
Blocking the caller No Yes (until entity responds)

Choosing the right primitive for a job:

Need Orchestration Entity Plain activity
Multi-step workflow with start/end
Long-lived mutable state, many writers
Race-free counter / budget / cart
One-off I/O with no shared state
Distributed lock ✅ (LockAsync)
Fan-out of independent work ✅ (orchestrator) ✅ (the work)
Per-tenant aggregate updated by events

Choosing a storage backend

Durable Functions persists all state through a storage provider. The default is fine until it isn’t, and the choice has real throughput and cost consequences.

Provider Backing store Best for Watch out for
Azure Storage (default) Blobs, queues, tables Default; low ops; most apps Throughput ceiling under heavy fan-out; per-transaction cost adds up; history in Table Storage
Netherite Azure Event Hubs + Page Blobs High-throughput, high fan-out workloads needing low latency Operationally heavier; partitions fixed at provisioning; Event Hubs cost
MSSQL Azure SQL / SQL Server Portability, on-prem/hybrid, single store you already operate and back up You own SQL throughput and DTU/vCore sizing

The provider is selected in host.json:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "MyTaskHub",
      "storageProvider": {
        "type": "Netherite",
        "partitionCount": 12
      }
    }
  }
}

Practical guidance: stay on Azure Storage until you have measured a throughput problem — most orchestrations never hit its limits, and it is the cheapest to operate. Move to Netherite when you are processing tens of thousands of work items per second and feeling queue latency. Choose MSSQL when portability, a single backed-up store, or running outside Azure dominates the decision. Switching providers is a state migration, so decide before you have millions of live instances, not after.

A note on task hubs: the hubName namespaces all the queues and tables. Two function apps sharing a storage account must use different hub names, or they will fight over each other’s work items — a classic “my orchestration ran on the wrong app” incident.

Backend comparison in depth

The three providers across the dimensions that actually drive the decision:

Dimension Azure Storage Netherite MSSQL
Throughput ceiling Moderate (queue/table bound) Very high (Event Hubs partitions) Bound by SQL tier (DTU/vCore)
Latency under fan-out Rises with width Low and stable Depends on SQL sizing
Operational effort Lowest Higher (Event Hubs, partitions) Medium (you run SQL)
Partition model ~Auto, control-queue partitions Fixed at provisioning (e.g. 12) SQL-managed
Cost model Per-transaction (cheap at low scale) Event Hubs TU + Page Blobs SQL compute + storage
Portability / hybrid Azure-only Azure-only On-prem/hybrid friendly
Backup / single store 3 stores (blob/queue/table) Event Hubs + blobs One database to back up
Best fit Most apps; default 10k+ work-items/sec, low latency Portability, existing SQL estate
Migration cost from here State migration required State migration required

Task-hub configuration rules — collisions here cause “wrong app ran my orchestration”:

Rule Value / setting Why
Unique hubName per app on shared storage host.jsondurableTask.hubName Apps share queues/tables otherwise
Default hub name derived from app name Fine if each app has its own storage
Allowed characters alphanumeric, start with a letter Invalid names fail silently/confusingly
Change hub name = new task hub new queues/tables created In-flight instances on the old hub are orphaned
Don’t share a hub across environments dev/test/prod separate hubs Cross-environment interference

Approximate Azure Storage backend limits worth knowing (use as mechanism, validate exact numbers against current docs):

Resource Approximate limit Effect when hit
Control-queue partitions ~128 (across a few queues) Caps orchestration parallelism per hub
Durable timer max duration ~6 days Longer waits must loop shorter timers
Activity payload (input/output) Large payloads spill to blob Bloats history; slows replay
External-event buffering Held until awaited Early events are not lost
Storage throttling HTTP 429 from the account Backend latency spikes; retries
Instance ID length / characters Reasonable string; avoid /, \, #, ? Bad IDs break status/raiseEvent URLs
Concurrent activities per instance (host) Tunable via host.json concurrency Caps per-instance parallelism
Status webhook lifetime Bounded; expires/purged 410 Gone when querying old URLs

Architecture at a glance

The diagram below is the request-and-state path of a fan-out/fan-in orchestration, left to right. A client/trigger (an HTTP call or an event) starts an orchestration through the [DurableClient] binding and can later raiseEvent to it. The orchestrator — the deterministic brain — schedules activities with Task.WhenAll and uses ContinueAsNew to keep eternal loops’ history flat. The work lands in the activities/entities zone: an activity fanned out (chunked to ~500 per sub-orchestration), a single-threaded entity@key holding shared state, and a partner API that must be hit with an idempotent key. All of that state — history, control queues, work-item queue — lives in the Durable backend (Azure Storage by default, with ~128 control-queue partitions, or Netherite for high throughput). Finally the observe/groom zone is where you live during an incident: App Insights for KQL traces, and the status/purge APIs to inspect history and reclaim space.

Follow the numbered badges to read the failure map onto the path. The brain is where non-determinism (1) bites; the activity zone is where unbounded fan-out (2) saturates the queue and a non-idempotent side effect (3) double-applies; the backend is where history bloat or a poison item (4) stalls a partition; and the whole instance can sit “Running” forever (5) when a WaitForExternalEvent has no timeout. The legend narrates each as symptom → confirm → fix.

Fan-out/fan-in Durable Functions architecture: a DurableClient-triggered HTTP/event starts an orchestrator (the deterministic brain) that fans out chunked activities via Task.WhenAll and ContinueAsNew, calls a single-threaded entity@key and an idempotent partner API, persists history and control/work-item queues to the Azure Storage or Netherite task-hub backend, and is triaged through Application Insights KQL traces and the status/purge APIs — with five numbered failure badges for non-determinism, unbounded fan-out, non-idempotent side effects, history bloat/poison items, and stuck-Running instances

Real-world scenario

A payments platform team at a fictional fintech, LedgerLink, ran a nightly settlement orchestration that fanned out one activity per merchant — roughly 40,000 of them — to reconcile transactions against a partner ledger. It worked for months. Then onboarding pushed merchant count past ~95,000 and settlement, which used to finish in 40 minutes, started running for six-plus hours and occasionally wedged in “Running” until someone terminated it manually. Worse, a few runs produced double-applied adjustments, and the partner started raising disputes.

Two root causes surfaced under investigation. First, the fan-out was unbounded: scheduling 95,000 activities in one Task.WhenAll saturated the Azure Storage work-item queue, and control-queue latency spiked so badly that replays slowed to a crawl. Second, the reconcile activity called the partner’s ledger API non-idempotently — when an activity timed out and the retry policy fired, the original call had sometimes already posted, so the adjustment landed twice. The history table had also grown to tens of GB because each activity returned the full reconciliation record instead of a reference, so every replay dragged that payload through Table Storage.

The fix had three parts. They chunked the fan-out into sub-batches of 500 with a durable sub-orchestration per chunk, capping concurrent work items. They made the activity idempotent by deriving a deterministic idempotency key (context.NewGuid() seeded per merchant, persisted before the call) and having the partner API treat a repeated key as a no-op. And because throughput was now the binding constraint, they migrated the task hub to the Netherite backend.

// Sub-orchestration per chunk bounds the fan-out width and isolates failures.
[Function(nameof(SettleChunk))]
public async Task<ChunkResult> SettleChunk(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var merchants = context.GetInput<string[]>()!;   // <= 500 per chunk
    var retry = TaskOptions.FromRetryPolicy(new RetryPolicy(
        maxNumberOfAttempts: 4,
        firstRetryInterval: TimeSpan.FromSeconds(10),
        backoffCoefficient: 2.0));

    var tasks = merchants
        .Select(m => context.CallActivityAsync<bool>(nameof(Reconcile), m, retry))
        .ToList();

    bool[] results = await Task.WhenAll(tasks);
    return new ChunkResult(merchants.Length, results.Count(ok => ok));
}

Settlement dropped back to ~35 minutes and stopped wedging; duplicate adjustments went to zero. The incident timeline and what each step actually changed:

Time Status Action Result Verdict
Month 0 Healthy 40k merchants, single WhenAll ~40 min nightly Fine at this scale
Month 6, T+0 Degraded Merchants hit 95k; same code 6 h+, occasional wedge Unbounded fan-out
T+1 h Investigating Checked control-queue latency Latency spiked, replays crawling Queue saturation confirmed
T+2 h Investigating KQL for non-terminal instances Found stuck “Running” runs Wedge confirmed
T+1 day Mitigated Chunk to 500 via sub-orchestrations Duration ~70 min Width fixed; dupes remain
T+3 days Mitigated Idempotency key persisted pre-call Dupes → 0 Side effect fixed
T+1 week Fixed Migrate task hub to Netherite ~35 min, stable Throughput headroom

The lesson the team wrote into their runbook: fan-out width and activity idempotency are not optional at scale. Durable Functions will happily let you schedule a hundred thousand activities and retry a non-idempotent side effect — and both will bite you in production, not in the demo.

Advantages and disadvantages

The event-sourced, replay-based model both enables code-as-workflow and imposes the determinism constraint. Weigh it honestly:

Advantages (why this model helps you) Disadvantages (why it bites)
Workflows are plain code — no hand-rolled queues, tables, or state machines The orchestrator body replays, so non-deterministic code silently corrupts or throws
State is durable for free — survives crashes, deploys, scale-in via history History is a real store you must groom (bloat, purge) and size
Fan-out/fan-in across the whole scaled-out app with one Task.WhenAll Unbounded fan-out starves the control queue — you must chunk
Built-in durable timers + external events make human-in-the-loop trivial No timeout on WaitForExternalEvent → stuck “Running” forever
Retries, backoff, and sub-orchestration isolation are first-class Retries re-fire non-idempotent side effects → double-apply
Entities give race-free shared state without locks Misusing an entity like an orchestration (or vice versa) hurts
Pluggable backends (Storage / Netherite / MSSQL) for different scale points Switching backends is a state migration, not a config flip
Strong observability via App Insights traces + status/purge APIs “Stuck” instances are invisible unless you actively query for them

The model is right when you have genuine multi-step or long-running workflows that must survive failure and you want to ship code, not operate infrastructure. It bites hardest on very wide fan-outs (unbounded width), side-effecting activities that aren’t idempotent, eternal loops without ContinueAsNew, and teams that don’t internalize replay. Every disadvantage is manageable — but only if you know it exists, which is the point of this article.

Hands-on lab

Deploy a tiny fan-out/fan-in orchestration, watch it run, exercise the status API, then groom it with purge — free-tier-friendly on the Consumption plan; delete at the end. Run in Cloud Shell (Bash). (This lab uses the .NET isolated worker; substitute the JS/Python templates if you prefer.)

Step 1 — Variables and resource group.

RG=rg-durable-lab
LOC=centralindia
STG=stdurable$RANDOM        # 3–24 lowercase alphanumerics, globally unique
APP=func-durable-$RANDOM    # globally-unique function app name
az group create -n $RG -l $LOC -o table

Step 2 — Storage account (the default Durable backend) and the function app.

az storage account create -n $STG -g $RG -l $LOC --sku Standard_LRS -o table
az functionapp create -n $APP -g $RG --storage-account $STG \
  --consumption-plan-location $LOC --runtime dotnet-isolated \
  --functions-version 4 -o table

Expected: a function app on the Consumption plan, runtime dotnet-isolated.

Step 3 — Scaffold a Durable project locally and add a fan-out orchestration.

func init DurableLab --worker-runtime dotnet-isolated
cd DurableLab
func new --name FanOut --template "Durable Functions Orchestrator"
# Edit FanOut.cs to fan out a CallActivityAsync over a small array and Task.WhenAll the results.

Step 4 — Publish and capture the system key for the Durable HTTP APIs.

func azure functionapp publish $APP
SYS_KEY=$(az functionapp keys list -n $APP -g $RG \
  --query "systemKeys.durabletask_extension" -o tsv)

Step 5 — Start an orchestration and capture the instance ID. The HTTP-start trigger returns a status-query payload:

BASE="https://$APP.azurewebsites.net"
RESP=$(curl -s -X POST "$BASE/api/FanOut_HttpStart?code=$SYS_KEY")
echo "$RESP"
INSTANCE_ID=$(echo "$RESP" | python3 -c "import sys,json;print(json.load(sys.stdin)['id'])")

Step 6 — Query status and history.

curl -s "$BASE/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?showHistory=true&code=$SYS_KEY" | head -40
# Expected: runtimeStatus transitions Pending → Running → Completed, with activity events in history.

Step 7 — Groom: purge the completed instance.

curl -s -X DELETE \
  "$BASE/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?code=$SYS_KEY"
# Expected: an instancesDeleted count of 1; the history for that instance is gone.

Validation checklist. You created the Storage-backed task hub, ran a fan-out/fan-in orchestration, watched it reach Completed, inspected its event-sourced history, and purged it. The lab steps mapped to what each proves:

Step What you did What it proves Real-world analogue
2 Storage + function app The default backend is just a storage account Every first Durable deploy
3 Fan-out orchestrator Task.WhenAll is the canonical fan-in Batch/parallel processing
5 HTTP-start → instance ID The instance ID is the address for everything Starting work from an API
6 showHistory=true History is real, inspectable, event-sourced 02:14 triage of a stuck run
7 Purge API History must be groomed or it bloats Scheduled cleanup

Cleanup (avoid lingering storage charges).

az group delete -n $RG --yes --no-wait

Cost note. Consumption plan + a small LRS storage account for an hour of this lab is well under ₹20; deleting the resource group stops everything. Durable’s cost on Consumption is dominated by storage transactions (every history write is a transaction), which is why grooming and small payloads matter.

Common mistakes & troubleshooting

This is the playbook — the part you bookmark. First as a scannable table you read mid-incident, then the entries that bite hardest expanded with the exact confirm commands.

# Symptom Root cause Confirm (exact cmd / path) Fix
1 NonDeterministicOrchestrationException on replay DateTime.UtcNow/Guid.NewGuid/I/O in orchestrator Exception message; diff ?showHistory=true across replays Use context.CurrentUtcDateTime/NewGuid; move I/O to an activity
2 Instance stuck “Running” forever WaitForExternalEvent with no timeout Status API runtimeStatus=Running for hours; KQL non-terminal query Add the timer-race timeout; terminate the wedged instance
3 Duplicate charges/adjustments Non-idempotent activity + retry fired dependencies failures + duplicate rows downstream Deterministic idempotency key persisted before the call
4 Settlement went from 40 min to 6 h, occasionally wedges Unbounded fan-out saturating control queue Control-queue latency; instance duration trend Chunk to ~500 via sub-orchestrations; consider Netherite
5 Queries time out; history in tens of GB Large payloads returned by value; missing purge History table size; payload sizes in history Return blob URIs; scheduled purge of terminal instances
6 Eternal monitor’s history grows every cycle while(true) loop instead of ContinueAsNew History length grows per iteration Replace loop with ContinueAsNew
7 A poison work item stalls a partition Activity throws deterministically; redelivered forever Repeating failure in logs; control/work-item queue backlog Fix the activity; cap attempts; partition the result
8 “My orchestration ran on the wrong app” Two apps share a storage account + hub name Compare host.json hubName across apps Give each app a unique hubName
9 External event “lost” — instance never resumed Event raised to wrong instance ID / hub, or before await with ContinueAsNew raiseEvent 202 but no state change; check ID/hub Use exact instance ID + taskHub; await events before ContinueAsNew
10 Terminating an instance didn’t stop the work terminate doesn’t cancel in-flight activities Activity still logging after terminate Make activities cancellation-aware; design for at-least-once
11 Backend latency spikes; HTTP 429 in logs Storage account throttling under load Storage metrics 429; backend trace latency Scale the account / move to Netherite; reduce transactions
12 Fan-in throws aggregate, whole batch fails on one bad item Task.WhenAll with no per-item handling Aggregate exception naming one activity Switch to collect-and-partition try/catch per task

The expanded form for the entries that bite hardest:

1. NonDeterministicOrchestrationException on replay. Root cause: the orchestrator body did something non-deterministic — read DateTime.UtcNow, called Guid.NewGuid(), did direct I/O, or iterated an unordered collection — so the replay scheduled different work than history records. Confirm: the exception message names the divergence; pull the instance with ?showHistory=true and compare the scheduled events against the body. Grep the orchestrator for the forbidden constructs in the table above. Fix: replace with context.CurrentUtcDateTime / context.NewGuid(), move all I/O into activities, and sort collections to a stable order. Never retry this — it’s a code defect.

2. Instance stuck “Running” forever. Root cause: almost always an unresolved WaitForExternalEvent with no timeout, or a fan-in where one activity throws on every retry and the host keeps redelivering it. Confirm: the status API shows runtimeStatus: Running for far longer than expected; the fleet-wide KQL below surfaces every non-terminal instance. Fix: add the timer-race from the human-interaction section; put bounded retry policies on activities; terminate the genuinely wedged instance.

# Inspect a single instance: status, input, output, and execution history
curl "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?showHistory=true&code=${SYSTEM_KEY}"

# Terminate a wedged instance (does NOT cancel in-flight activities)
curl -X POST \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}/terminate?reason=stuck&code=${SYSTEM_KEY}"
// Orchestrations that started but never reached a terminal state in 24h
traces
| where timestamp > ago(24h)
| where customDimensions.prop__functionType == "Orchestrator"
| extend instanceId = tostring(customDimensions.prop__instanceId),
         state      = tostring(customDimensions.prop__state)
| summarize states = make_set(state), last = max(timestamp) by instanceId
| where not (states has "Completed" or states has "Failed" or states has "Terminated")
| order by last asc

3. Duplicate charges/adjustments. Root cause: a side-effecting activity isn’t idempotent, so when an attempt times out and the retry policy fires, the original call may have already succeeded — the effect lands twice. Confirm: App Insights dependencies shows the call failing/timing out under load, and you see duplicate rows downstream. Correlate the retry timestamps with the duplicates. Fix: derive a deterministic idempotency key (seed context.NewGuid() per logical unit), persist it before the call, and have the downstream treat a repeated key as a no-op. See Transactional Outbox/Inbox & Exactly-Once Event Publishing for the broader pattern.

5. History bloat — queries time out, history in tens of GB. Root cause: large activity payloads returned by value, and/or no purge of terminal instances. Confirm: the history Table Storage is huge; individual history rows carry large payloads. Fix: return references (blob URIs, row keys) instead of big blobs, and schedule a purge so history is groomed continuously instead of growing until queries time out.

# Purge completed/failed/terminated instances older than a cutoff
curl -X DELETE \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances?createdTimeTo=2026-03-01T00:00:00Z&runtimeStatus=Completed,Failed,Terminated&code=${SYSTEM_KEY}"

Schedule that purge (a timer-triggered function calling client.PurgeInstancesAsync) so history is groomed continuously.

The error/exception reference you scan first — every error you realistically see, what it means, and the fix:

Error / status Meaning Likely cause How to confirm First fix
NonDeterministicOrchestrationException Replay scheduled different work than history Clock/GUID/I/O in orchestrator Exception text; showHistory diff Use context APIs; move I/O to activities
OrchestrationFailureException Orchestrator threw and faulted Unhandled exception in body or activity aggregate Instance output / failure details Fix the throwing path; handle aggregates
TaskFailedException An activity exhausted its retries Persistent activity failure Activity logs; dependencies Fix the activity; tune retry/idempotency
runtimeStatus: Running (stuck) Never reached terminal state Unbounded wait / poison retry Status API; KQL non-terminal Timer-race; terminate; fix poison item
runtimeStatus: Failed Terminal failure Faulted orchestrator/activity Instance output Read output; fix root cause
runtimeStatus: Terminated Manually stopped terminate was called Status API reason Was the in-flight work cancelled?
HTTP 404 on raiseEvent/status Instance not found Wrong instance ID / wrong hub Verify ID + taskHub query Use exact ID and hub name
HTTP 429 (backend) Storage throttling Heavy transaction volume Storage account metrics Scale account / Netherite; cut transactions
HTTP 410 Gone (status URL) Status webhook expired/purged Instance purged Re-query by ID if still present

Decision table for the on-call engineer — if you see…:

If you see… It’s probably… Do this
NonDeterministicOrchestrationException Clock/GUID/I/O in the orchestrator Fix the body; never retry
One instance “Running” for hours A wait with no timeout, or poison retry KQL to confirm; add timeout; terminate
Many instances slow at once Backend saturation / unbounded fan-out Check control-queue latency; chunk fan-out
Duplicate downstream effects Non-idempotent activity + retry Add idempotency key
Queries timing out, huge history Bloat Smaller payloads; scheduled purge
Work running on the “wrong app” Shared task hub Unique hubName per app
Event raised but nothing resumed Wrong ID/hub, or dropped at ContinueAsNew Verify ID/hub; await before ContinueAsNew
Terminate didn’t stop the work terminate ignores in-flight activities Make activities cancellation-aware
Backend 429s under load Storage account throttling Scale account / Netherite; cut transactions
Same exception every replay, no retry helps Code defect in the body Fix it — never retry a non-determinism error

Best practices

The signals worth alerting on before the next incident — leading indicators, not “the orchestration failed”:

Alert on Signal / source Threshold (starting point) Why it’s leading
Non-terminal instance age KQL non-terminal query Any instance Running > expected SLA Catches stuck “Running” before users notice
Control-queue latency Backend traces / metrics Rising trend under load Predicts fan-out saturation
Storage throttling Storage account 429 count > 0 sustained Backend is the bottleneck
History table size Storage metrics Growth without purge Predicts query timeouts
Activity failure rate dependencies success=false > 1% sustained Poison items / retries firing
Orchestration duration App Insights custom metric p95 > baseline Width or backend regression

Security notes

The security controls that also prevent these incidents — secure and resilient pull the same way:

Control Mechanism Secures against Also prevents
Managed identity to storage identity + RBAC on the account Connection strings in config Secret-rotation breaking the backend connection
System-key protection on mgmt APIs durabletask_extension key + APIM Anonymous terminate/purge/raiseEvent Malicious instance manipulation
Authorize raiseEvent callers HTTP auth before the signal Unauthorized approvals Spoofed external events corrupting flow
Private Endpoints for storage/SQL/API VNet + private DNS Data exfiltration over public net SNAT/egress surprises in activities
Vault firewall + trusted services Key Vault networking Secret exfiltration KV-reference boot failures (when allow-listed)
Least-privilege RBAC on the account Scoped data-plane roles Over-broad access to history Accidental cross-hub interference

Cost & sizing

The bill drivers and how they interact with the patterns:

A rough monthly picture for a moderate workload (a few hundred thousand activity executions/day, small payloads, groomed history) on Consumption: storage transactions plus execution charges typically land in the low thousands of INR; the same workload on Premium EP1 adds a floor of roughly ₹12,000–18,000/month for the always-warm instance. The cost drivers and what each buys you:

Cost driver What you pay for Rough INR / month What it fixes Watch-out
Storage transactions (history/queues) Per-transaction on the account ~₹500–3,000 (workload-dependent) (it’s the backend itself) Large payloads + polling inflate it
Consumption executions Per-execution + GB-seconds Pennies per 10k executions Cheapest entry; scales to zero Cold start; fan-out multiplies count
Premium plan (EP1+) Always-warm instance floor ~₹12,000–18,000+ Cold start, VNet, predictable latency Pay even when idle
Netherite (Event Hubs TU + blobs) Throughput units + Page Blobs ~₹8,000+ Throughput ceiling under heavy fan-out Over-provisioned at low scale
MSSQL backend SQL DTU/vCore + storage depends on SQL tier Portability, single backed-up store You operate the SQL
App Insights ingestion Per-GB telemetry ~₹1,000–3,000 Triage (KQL, traces) Sample high-volume apps

Free-tier note: the Consumption plan includes a monthly grant of free executions and GB-seconds, so small Durable workloads cost mostly the (cheap) storage transactions — keep payloads small and purge terminal instances and the bill stays tiny.

Interview & exam questions

1. Why must an orchestrator function be deterministic, and name three things you can’t do in one? Because the orchestrator replays from history every time it makes progress, it must schedule the same work in the same order given the same history — non-determinism diverges history and corrupts state (the SDK throws NonDeterministicOrchestrationException). You can’t use DateTime.UtcNow, Guid.NewGuid(), or direct I/O (HttpClient, DB) in the body — use context.CurrentUtcDateTime, context.NewGuid(), and activities instead.

2. What is the fan-out/fan-in pattern and what’s the canonical fan-in? Fan-out schedules many independent activities in parallel (build a list of CallActivityAsync tasks without awaiting each); fan-in waits for them all. The canonical fan-in is await Task.WhenAll(tasks) over the Durable tasks — replay-safe and durable, so a crash after 900 of 1,000 completions resumes with only the outstanding 100.

3. How do you bound a very large fan-out and why must you? Scheduling, say, 100,000 activities in one Task.WhenAll saturates the work-item/control queues and starves other orchestrations, spiking latency. Bound it by chunking — a sub-orchestration per ~500 items via CallSubOrchestratorAsync — which caps in-flight work and isolates failures to a chunk.

4. How do you implement a human-approval step that won’t hang forever? Race a WaitForExternalEvent against a durable timer with Task.WhenAny: if the event wins, cancel the timer and return; if the timer wins, escalate/time out. Use context.CreateTimer (persisted, survives restart), never Task.Delay, and always cancel the loser so the instance doesn’t stay open.

5. What does ContinueAsNew do and when do you need it? It restarts the orchestration with a clean history and fresh input, which is how you run an eternal orchestration (monitor, recurring job) without the history table growing unbounded. Drain pending events first, and make the ContinueAsNew call the last statement on the branch — it schedules a restart, it doesn’t return.

6. When do you use a durable entity instead of an orchestration? Use an orchestration for a workflow with a defined start and end; use an entity for long-lived, mutable state that many callers update concurrently (counters, carts, per-tenant budgets, rate limits). Entities guarantee single-threaded access per entityName@key, giving race-free updates without locks.

7. An activity that posts to a partner API double-applied during retries. Why, and how do you fix it? The activity isn’t idempotent: an attempt timed out and the retry fired after the original call had already posted. Fix it by deriving a deterministic idempotency key (seed context.NewGuid() per unit, persist before the call) and having the downstream treat a repeated key as a no-op — so retries and redeliveries can’t double-apply.

8. Compare the three storage backends. Azure Storage (default) is lowest-ops and cheapest at low scale but has a throughput ceiling under heavy fan-out; Netherite (Event Hubs + Page Blobs) gives very high throughput and low latency at the cost of operational complexity; MSSQL gives portability and a single backed-up store for hybrid/on-prem at the cost of running SQL. Switching is a state migration, so choose before millions of instances exist.

9. Two function apps’ orchestrations are interfering. Most likely cause? They share a storage account and the same hubName, so they’re reading each other’s queues and tables (the “ran on the wrong app” incident). Give each app a unique hubName in host.json (or separate storage accounts).

10. How do you find and recover a stuck “Running” instance? Query the status API (?showHistory=true) for the instance, or run a fleet-wide KQL over traces for instances with no terminal state. The usual cause is a WaitForExternalEvent with no timeout or a poison-item retry loop — fix the code (timer-race, bounded retries) and terminate the wedged instance (knowing terminate doesn’t cancel in-flight activities).

11. What causes history-table bloat and how do you control it? Large activity payloads returned by value, and missing purge of terminal instances. Return references (blob URIs/row keys) instead of big blobs, and schedule a purge (PurgeInstancesAsync) of completed/failed/terminated instances so history is groomed continuously instead of growing until queries time out.

12. Does terminating an instance stop its in-flight activities? No — terminate marks the orchestration terminated but does not cancel activities already running. Design activities to be cancellation-aware and assume at-least-once execution so a terminated-but-still-running activity can’t corrupt downstream state.

These map primarily to AZ-204 (Developer Associate)implement Azure Functions; develop event-based and message-based solutions — and the durable-orchestration patterns appear in solution-architecture scenarios on AZ-305. A compact cert-mapping for revision:

Question theme Primary cert Objective area
Replay model & determinism AZ-204 Implement Azure Functions
Fan-out/fan-in, sub-orchestration AZ-204 Develop message/event solutions
Human-in-the-loop (events/timers) AZ-204 Durable Functions patterns
Entities vs orchestrations AZ-204 / AZ-305 Stateful serverless design
Backend choice & scaling AZ-305 Design for throughput/cost
Idempotency & exactly-once AZ-204 / AZ-305 Reliable messaging design

Quick check

  1. You add DateTime.UtcNow to an orchestrator and it throws on replay. What exception, and what’s the deterministic replacement?
  2. An approval orchestration sits in “Running” for three days and never finishes. What’s the most likely cause and the fix?
  3. A nightly job that fanned out 95,000 activities in one Task.WhenAll went from 40 minutes to 6 hours. Name the root cause and the fix.
  4. Your activity posts to a payment API and you see duplicate charges after a deploy. Why, and what makes it safe?
  5. Two function apps share a storage account and one app’s orchestration “runs on the other app.” What single setting fixes it?

Answers

  1. NonDeterministicOrchestrationException. The orchestrator replays, so an ambient clock produces a different value each replay and diverges history. Replace it with context.CurrentUtcDateTime (and use context.NewGuid() for IDs); move any I/O into an activity.
  2. A WaitForExternalEvent with no timeout — nothing ever raised the event, so the instance waits forever. Fix by racing the wait against a durable timer with Task.WhenAny, cancelling the loser; terminate the already-wedged instance.
  3. Unbounded fan-out saturated the work-item/control queues and starved replays. Fix by chunking into sub-orchestrations (~500/chunk) to bound in-flight width, and consider migrating the task hub to Netherite for throughput headroom.
  4. The activity isn’t idempotent: a timed-out attempt’s retry posted again after the original had already succeeded. Make it safe with a deterministic idempotency key (seeded context.NewGuid(), persisted before the call) that the downstream treats as a no-op on repeat.
  5. Give each app a unique hubName in host.json — they were sharing one task hub (the same queues and tables) on the shared storage account.

Glossary

Next steps

You can now build the five Durable patterns correctly and triage a stuck orchestration. Build outward:

azure-functionsdurable-functionsorchestrationserverlesspatternsfan-outentitiesnetherite
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments