Durable Functions in Production: Orchestrations, Fan-out/Fan-in, and Entity State

Durable Functions is the part of Azure Functions that lets you write stateful, long-running workflows as plain code instead of stitching together queues, tables, and state machines by hand. The catch is that the programming model is not what it looks like. An orchestrator function reads top to bottom like normal C# or TypeScript, but underneath it is a replay engine that re-executes your code from the start every time it makes progress. If you do not internalize that, you will ship orchestrations that work in the demo and corrupt their own state under load. This guide builds the core patterns the right way and ends with how to debug them when they get stuck at 2 a.m.

All examples use the .NET isolated worker model, which is the supported path going forward; the concepts map directly to the JavaScript, Python, and PowerShell SDKs.

1. The replay execution model and why determinism is non-negotiable

An orchestrator does not run once. It runs, awaits an activity, and unloads from memory. When that activity completes, the Durable Task Framework replays the orchestrator from line one, feeding the results of already-completed work from a history table instead of calling the activities again. Replay stops at the first await whose result is not yet in history, and real execution resumes there.

This is how an orchestration survives a worker crash, a deployment, or a scale-in: its state is the event-sourced history, not the process memory. It is also the source of every Durable Functions bug. Because the orchestrator body is replayed repeatedly, it must be deterministic — given the same history, it must make the same decisions and schedule the same activities in the same order.

That rules out, inside orchestrator code:

DateTime.UtcNow, DateTime.Now, Stopwatch, or any ambient clock.
Guid.NewGuid() or any random source.
Direct I/O: HTTP calls, database queries, reading config, environment variables.
Non-deterministic collection ordering (e.g. iterating an unordered Dictionary and scheduling work in that order).
Blocking or async calls that are not Durable APIs (Task.Delay, HttpClient, Thread.Sleep).

The replacements live on the orchestration context:

[Function(nameof(ProcessOrder))]
public async Task<OrderResult> ProcessOrder(
    [OrchestrationTrigger] TaskOrchestrationContext context,
    OrderInput input)
{
    // Deterministic, replay-safe equivalents:
    DateTime now = context.CurrentUtcDateTime;        // NOT DateTime.UtcNow
    Guid id = context.NewGuid();                       // NOT Guid.NewGuid()
    ILogger logger = context.CreateReplaySafeLogger<OrderProcessor>();

    // Skip log statements during replay so you don't see every line twice:
    if (!context.IsReplaying)
        logger.LogInformation("Starting order {OrderId}", input.OrderId);

    // All real work happens in activities, which CAN do I/O:
    var validated = await context.CallActivityAsync<bool>(nameof(ValidateOrder), input);
    return new OrderResult(input.OrderId, validated);
}

The mental model that sticks: the orchestrator is the brain and must be pure; activities are the hands and may touch the outside world. Anything non-deterministic belongs in an activity or comes from the context.

The framework helps you catch violations. The Durable Task SDK detects non-deterministic orchestration when the replayed code schedules different work than the history records, and throws rather than silently corrupting state. Treat any NonDeterministicOrchestrationException as a code defect, never a transient error to retry.

2. Function chaining and passing state safely

The simplest pattern is a sequence: A then B then C, where each step’s output feeds the next. Because state flows through return values held in history, you do not need external storage to pass data between steps.

[Function(nameof(IngestPipeline))]
public async Task<string> IngestPipeline(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var input = context.GetInput<IngestRequest>()!;

    string downloaded = await context.CallActivityAsync<string>(nameof(Download), input.Url);
    string parsed     = await context.CallActivityAsync<string>(nameof(Parse), downloaded);
    string stored     = await context.CallActivityAsync<string>(nameof(Persist), parsed);
    return stored;
}

Two rules keep this safe:

Everything crossing an activity boundary is serialized to JSON. Inputs and outputs must be serializable POCOs, not live handles, streams, or HttpClient instances. Keep payloads small. If a step produces a 200 MB blob, return the blob URI, not the bytes — large payloads bloat the history table and slow every replay.
Add retries where failure is expected, not a blanket retry on everything. Use TaskOptions with a retry policy for transient activity failures:

var retry = TaskOptions.FromRetryPolicy(new RetryPolicy(
    maxNumberOfAttempts: 5,
    firstRetryInterval: TimeSpan.FromSeconds(5),
    backoffCoefficient: 2.0,
    maxRetryInterval: TimeSpan.FromMinutes(2)));

string downloaded = await context.CallActivityAsync<string>(
    nameof(Download), input.Url, retry);

The retry timing is itself recorded as durable timers, so a 5-attempt exponential backoff survives a worker restart mid-backoff.

3. Fan-out/fan-in for parallel processing

Chaining is sequential. When steps are independent, fan them out, run them in parallel across the entire scaled-out function app, then fan in to aggregate. This is the pattern that makes Durable Functions worth using over a logic-light queue trigger.

[Function(nameof(BatchResize))]
public async Task<int> BatchResize(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var batch = context.GetInput<ImageBatch>()!;

    // List the work in an activity (I/O), not in the orchestrator:
    string[] files = await context.CallActivityAsync<string[]>(
        nameof(ListSourceFiles), batch.Prefix);

    // FAN OUT: schedule all activities without awaiting individually.
    var tasks = new List<Task<long>>(files.Length);
    foreach (string file in files)
        tasks.Add(context.CallActivityAsync<long>(nameof(ResizeImage), file));

    // FAN IN: await them all; this is replay-safe and durable.
    long[] sizes = await Task.WhenAll(tasks);

    int totalBytes = sizes.Aggregate(0, (sum, s) => sum + (int)s);
    await context.CallActivityAsync(nameof(WriteManifest),
        new Manifest(batch.Prefix, files.Length, totalBytes));
    return files.Length;
}

Task.WhenAll over Durable tasks is the canonical fan-in. The orchestrator suspends until every activity reports back, and the framework records each completion in history independently, so a crash after 900 of 1,000 completions resumes with only the outstanding 100 left to run.

Production guardrails:

Bound the fan-out width. Fanning out 100,000 activities at once floods the work-item queue and can starve other orchestrations. Chunk the list and process N at a time, or use a semaphore-style throttle by splitting into sub-batches.
Decide your failure policy explicitly. Task.WhenAll throws an aggregate if any task faults after its retries are exhausted. If you want “best effort, collect successes and failures,” await each task in a try/catch and partition the results yourself rather than letting one poison item fail the whole batch.

4. Human interaction with external events and durable timers

Some workflows must pause and wait for a human — an approval, a signature, a second factor — possibly for hours or days. You do this with an external event and a durable timer racing each other so you get a timeout instead of a workflow that hangs forever.

[Function(nameof(ApprovalWorkflow))]
public async Task<string> ApprovalWorkflow(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var request = context.GetInput<PurchaseRequest>()!;
    await context.CallActivityAsync(nameof(RequestApproval), request);

    // Durable timer: a replay-safe deadline. Always pair with a CTS so the
    // timer is cleaned up when the event arrives first.
    using var cts = new CancellationTokenSource();
    DateTime deadline = context.CurrentUtcDateTime.AddHours(72);
    Task timeout = context.CreateTimer(deadline, cts.Token);

    // External event: resumes when someone POSTs to the raise-event API.
    Task<bool> approved = context.WaitForExternalEvent<bool>("ApprovalResponse");

    Task winner = await Task.WhenAny(approved, timeout);
    if (winner == approved)
    {
        cts.Cancel();   // tear down the pending timer
        return approved.Result ? "Approved" : "Rejected";
    }
    return "TimedOut";   // escalate
}

Two things people get wrong here:

Use context.CreateTimer, never Task.Delay. A durable timer is persisted; if the host restarts during the 72-hour wait, the timer is restored and still fires. Task.Delay is wall-clock and evaporates on restart. (Durable timers were historically capped at ~6 days on the Azure Storage backend; for longer waits, loop shorter timers.)
Always cancel the loser. If you do not cancel the timer when the event wins, the orchestration is held open until the timer fires, inflating instance counts and history.

The external event is delivered from outside by instance ID:

# Raise the "ApprovalResponse" event with payload `true` to a running instance
curl -X POST \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}/raiseEvent/ApprovalResponse?taskHub=MyTaskHub&code=${SYSTEM_KEY}" \
  -H "Content-Type: application/json" \
  -d 'true'

5. Eternal orchestrations and ContinueAsNew

Some processes never really end: a per-device aggregator, a recurring cleanup, a monitor that polls forever. You cannot just wrap the body in while (true) — the history table would grow without bound and eventually every replay would crawl. The answer is ContinueAsNew, which restarts the orchestration with fresh state and a clean history, carrying forward only the input you choose.

[Function(nameof(PeriodicMonitor))]
public async Task PeriodicMonitor(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var state = context.GetInput<MonitorState>()!;

    bool stillOpen = await context.CallActivityAsync<bool>(nameof(CheckHealth), state.Target);
    if (!stillOpen)
        return;   // condition met -> orchestration completes for good

    // Wait one polling interval with a durable timer:
    DateTime next = context.CurrentUtcDateTime.AddMinutes(5);
    await context.CreateTimer(next, CancellationToken.None);

    // Reset history and loop with updated state. Do NOT recurse or while(true).
    context.ContinueAsNew(state with { Iterations = state.Iterations + 1 });
}

Key constraints:

Drain pending work before ContinueAsNew. Any external events that arrived but were not awaited are lost across the boundary. Await everything you care about first.
ContinueAsNew does not “return” — it schedules a restart. Code after it in the same path should not run; structure the method so the call is the last statement on that branch.
This is what bounds history growth. An eternal orchestration without ContinueAsNew is a slow-motion outage.

6. Durable entities for stateful, single-threaded actor logic

Orchestrations coordinate; entities hold state. A durable entity is an addressable, persistent object (think a tiny actor) identified by entityName@key. The framework guarantees single-threaded access per entity, so you get serialized, race-free updates without locks — ideal for counters, shopping carts, per-tenant aggregates, or rate-limit budgets.

public class Counter : TaskEntity<int>
{
    public void Add(int amount) => State += amount;
    public void Reset() => State = 0;
    public int Get() => State;

    [Function(nameof(Counter))]
    public static Task Run([EntityTrigger] TaskEntityDispatcher dispatcher)
        => dispatcher.DispatchAsync<Counter>();
}

Call entities two ways. From a client you fire signals (one-way, fire-and-forget):

[Function("AddToCounter")]
public async Task<HttpResponseData> AddToCounter(
    [HttpTrigger(AuthorizationLevel.Function, "post", Route = "counter/{key}/add")]
        HttpRequestData req,
    [DurableClient] DurableTaskClient client,
    string key)
{
    var entityId = new EntityInstanceId(nameof(Counter), key);
    await client.Entities.SignalEntityAsync(entityId, "Add", 1);
    return req.CreateResponse(HttpStatusCode.Accepted);
}

From an orchestrator you can signal or call and await a return value, and the single-threaded guarantee lets an orchestration safely read-modify-write shared state:

var entityId = new EntityInstanceId(nameof(Counter), key);
int current = await context.Entities.CallEntityAsync<int>(entityId, "Get");
if (current < limit)
    await context.Entities.CallEntityAsync(entityId, "Add", 1);

When to reach for entities over an orchestration: use an orchestration for a workflow with a defined start and end; use an entity for long-lived, mutable state that many callers update concurrently. They compose — an orchestration that needs a global counter or lock should delegate to an entity rather than trying to serialize access itself.

7. Choosing a storage backend

Durable Functions persists all state through a storage provider. The default is fine until it isn’t, and the choice has real throughput and cost consequences.

Provider	Backing store	Best for	Watch out for
Azure Storage (default)	Blobs, queues, tables	Default; low ops; most apps	Throughput ceiling under heavy fan-out; per-transaction cost adds up; history in Table Storage
Netherite	Azure Event Hubs + Page Blobs	High-throughput, high fan-out workloads needing low latency	Operationally heavier; partitions fixed at provisioning; Event Hubs cost
MSSQL	Azure SQL / SQL Server	Portability, on-prem/hybrid, single store you already operate and back up	You own SQL throughput and DTU/vCore sizing

The provider is selected in host.json:

{
  "version": "2.0",
  "extensions": {
    "durableTask": {
      "hubName": "MyTaskHub",
      "storageProvider": {
        "type": "Netherite",
        "partitionCount": 12
      }
    }
  }
}

Practical guidance: stay on Azure Storage until you have measured a throughput problem — most orchestrations never hit its limits, and it is the cheapest to operate. Move to Netherite when you are processing tens of thousands of work items per second and feeling queue latency. Choose MSSQL when portability, a single backed-up store, or running outside Azure dominates the decision. Switching providers is a state migration, so decide before you have millions of live instances, not after.

A note on task hubs: the hubName namespaces all the queues and tables. Two function apps sharing a storage account must use different hub names, or they will fight over each other’s work items — a classic “my orchestration ran on the wrong app” incident.

8. Diagnosing stuck instances, poison messages, and history bloat

When an orchestration misbehaves, query it rather than guessing. The status API and func/REST endpoints expose runtime status, input, output, and the full history.

# Inspect a single instance: status, input, output, and execution history
curl "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}?showHistory=true&code=${SYSTEM_KEY}"

# Terminate a wedged instance (does NOT cancel in-flight activities)
curl -X POST \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances/${INSTANCE_ID}/terminate?reason=stuck&code=${SYSTEM_KEY}"

For fleet-wide triage, query the tracking tables with Application Insights / KQL. Durable Functions emits structured traces; this surfaces instances that are running far longer than expected:

// Orchestrations that started but never reached a terminal state in 24h
traces
| where timestamp > ago(24h)
| where customDimensions.prop__functionType == "Orchestrator"
| extend instanceId = tostring(customDimensions.prop__instanceId),
         state      = tostring(customDimensions.prop__state)
| summarize states = make_set(state), last = max(timestamp) by instanceId
| where not (states has "Completed" or states has "Failed" or states has "Terminated")
| order by last asc

Map symptoms to root causes:

Stuck “Running” forever. Almost always an unresolved WaitForExternalEvent with no timeout, or a fan-in where one activity throws on every retry and the host keeps redelivering it. Add the timer-race from section 4, and put bounded retry policies on activities.
Poison messages. A work item that fails deterministically gets retried, dead-lettered, and can stall a control queue partition. On the Azure Storage backend, inspect the control/work-item queues; ensure activity inputs are valid and idempotent so a redelivery cannot corrupt downstream state. Fix the activity; do not just bump retry counts.
History table bloat. Large activity payloads and missing ContinueAsNew are the two causes. Return references (blob URIs, row keys) instead of big blobs, and make every long-lived orchestration call ContinueAsNew. Use the purge API to reclaim space from old terminal instances:

# Purge completed/failed/terminated instances older than a cutoff
curl -X DELETE \
  "https://myapp.azurewebsites.net/runtime/webhooks/durabletask/instances?createdTimeTo=2026-03-01T00:00:00Z&runtimeStatus=Completed,Failed,Terminated&code=${SYSTEM_KEY}"

Schedule that purge (a timer-triggered function calling client.PurgeInstancesAsync) so history is groomed continuously instead of growing until queries time out.

Enterprise scenario

A payments platform team ran a nightly settlement orchestration that fanned out one activity per merchant — roughly 40,000 of them — to reconcile transactions against a partner ledger. It worked for months. Then onboarding pushed merchant count past ~95,000 and settlement, which used to finish in 40 minutes, started running for six-plus hours and occasionally wedged in “Running” until someone terminated it manually. Worse, a few runs produced double-applied adjustments.

Two root causes surfaced under investigation. First, the fan-out was unbounded: scheduling 95,000 activities in one Task.WhenAll saturated the Azure Storage work-item queue, and control-queue latency spiked so badly that replays slowed to a crawl. Second, the reconcile activity called the partner’s ledger API non-idempotently — when an activity timed out and the retry policy fired, the original call had sometimes already posted, so the adjustment landed twice. The history table had also grown to tens of GB because each activity returned the full reconciliation record instead of a reference.

The fix had three parts. They chunked the fan-out into sub-batches of 500 with a durable sub-orchestration per chunk, capping concurrent work items. They made the activity idempotent by deriving a deterministic idempotency key (context.NewGuid() seeded per merchant, persisted before the call) and having the partner API treat a repeated key as a no-op. And because throughput was now the binding constraint, they migrated the task hub to the Netherite backend. Settlement dropped back to ~35 minutes and stopped wedging.

// Sub-orchestration per chunk bounds the fan-out width and isolates failures.
[Function(nameof(SettleChunk))]
public async Task<ChunkResult> SettleChunk(
    [OrchestrationTrigger] TaskOrchestrationContext context)
{
    var merchants = context.GetInput<string[]>()!;   // <= 500 per chunk
    var retry = TaskOptions.FromRetryPolicy(new RetryPolicy(
        maxNumberOfAttempts: 4,
        firstRetryInterval: TimeSpan.FromSeconds(10),
        backoffCoefficient: 2.0));

    var tasks = merchants
        .Select(m => context.CallActivityAsync<bool>(nameof(Reconcile), m, retry))
        .ToList();

    bool[] results = await Task.WhenAll(tasks);
    return new ChunkResult(merchants.Length, results.Count(ok => ok));
}

The lesson the team wrote into their runbook: fan-out width and activity idempotency are not optional at scale. Durable Functions will happily let you schedule a hundred thousand activities and retry a non-idempotent side effect — and both will bite you in production, not in the demo.

Verify

Confirm the patterns behave before trusting them in production:

Determinism. Add a temporary DateTime.UtcNow in an orchestrator and replay it — confirm you get a NonDeterministicOrchestrationException so you know the guardrail is active. Remove it.
Fan-in resilience. Start a fan-out batch, then restart the host (or scale in) mid-run; query the instance with showHistory=true and confirm it resumes and completes the remaining activities only, not all of them.
Human-interaction timeout. Start an approval orchestration and let it hit the deadline without raising the event; confirm it returns TimedOut and the instance reaches a terminal state instead of hanging.
Eternal bounding. Run the periodic monitor for several iterations and check the history length stays flat across ContinueAsNew rather than growing each cycle.
Entity serialization. Fire concurrent Add signals at one entity key and confirm the final count is exact — no lost updates.
Purge. Run the purge API against old terminal instances and confirm the history/instances tables shrink.

Durable Functions in Production: Orchestrations, Fan-out/Fan-in, and Entity State

1. The replay execution model and why determinism is non-negotiable

2. Function chaining and passing state safely

3. Fan-out/fan-in for parallel processing

4. Human interaction with external events and durable timers

5. Eternal orchestrations and ContinueAsNew

6. Durable entities for stateful, single-threaded actor logic

7. Choosing a storage backend

8. Diagnosing stuck instances, poison messages, and history bloat

Enterprise scenario

Verify

Checklist

Written by Vinod

Comments

Keep Reading

Application Gateway for Containers: Gateway API on AKS with Traffic Splitting, mTLS, and Header Routing

Azure Event Hubs at Scale: Partitioning, Capture, Kafka Endpoint, and Stream Analytics Processing

Azure Service Bus at Scale: Sessions, Deduplication, and Dead-Letter Handling