Azure Databases

Cosmos DB Multi-Region Writes: Consistency Levels and Conflict Resolution

Multi-region writes are the feature that makes Azure Cosmos DB look like magic in a demo and like a distributed-systems trap in production. The moment two regions can both accept writes for the same logical partition, you have surrendered the comfortable single-writer world and signed up for conflict resolution, weaker consistency, and a much harder mental model. None of that is a reason to avoid it: for globally distributed, write-heavy, low-latency workloads it is the right tool. But you have to configure it deliberately. This guide walks the full path: enabling multi-region writes, picking a consistency level you can actually defend, and building both last-writer-wins and custom conflict resolution that behaves correctly when a region drops.

Everything here assumes the Cosmos DB for NoSQL API. The consistency model is API-agnostic, but conflict-resolution policies and the conflicts feed are specific to the NoSQL API; Cassandra, MongoDB, and Gremlin handle conflicts differently (typically LWW only).

1. Add regions and enable multi-region writes

Multi-region writes (formerly “multi-master”) is an account-level toggle. You first need at least two regions associated with the account, then you flip enableMultipleWriteLocations. Adding regions is an online operation; enabling multi-write is not always online and can briefly affect availability, so do it in a maintenance window the first time.

With Azure CLI:

# Add a second (and third) read region first
az cosmosdb update \
  --name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --locations regionName="East US 2" failoverPriority=0 isZoneRedundant=true \
  --locations regionName="West Europe" failoverPriority=1 isZoneRedundant=true \
  --locations regionName="Southeast Asia" failoverPriority=2 isZoneRedundant=true

# Then enable multi-region writes
az cosmosdb update \
  --name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --enable-multiple-write-locations true

A few things that bite people:

Declaratively in Bicep, which is how this should live in your repo:

resource account 'Microsoft.DocumentDB/databaseAccounts@2024-11-15' = {
  name: 'kv-cosmos-prod'
  location: 'East US 2'
  kind: 'GlobalDocumentDB'
  properties: {
    databaseAccountOfferType: 'Standard'
    enableMultipleWriteLocations: true
    enableAutomaticFailover: true
    consistencyPolicy: {
      defaultConsistencyLevel: 'BoundedStaleness'
      maxStalenessPrefix: 100000
      maxIntervalInSeconds: 300
    }
    locations: [
      { locationName: 'East US 2',     failoverPriority: 0, isZoneRedundant: true }
      { locationName: 'West Europe',    failoverPriority: 1, isZoneRedundant: true }
      { locationName: 'Southeast Asia', failoverPriority: 2, isZoneRedundant: true }
    ]
  }
}

Cost note: enabling multi-region writes roughly multiplies your provisioned RU/s cost by the number of write regions for replication, because writes replicate everywhere. Three write regions is three times the write throughput cost. Decide whether you genuinely need write locality in all three or whether one or two write regions plus read replicas is enough.

2. The five consistency levels and their tradeoffs

Cosmos DB exposes a tunable, linear consistency spectrum. Stronger is to the left, more available and lower latency to the right:

Strong  >  Bounded Staleness  >  Session  >  Consistent Prefix  >  Eventual
Level What it guarantees Read latency Write availability on partition Multi-region writes?
Strong Linearizable; reads see the latest committed write Highest (cross-region quorum) Lowest Not allowed
Bounded staleness Lag bounded by K versions or T seconds; consistent-prefix within the bound Higher High Allowed
Session Read-your-writes, monotonic reads/writes within a session token Low High Allowed
Consistent prefix Never see out-of-order writes; no recency bound Low High Allowed
Eventual Replicas converge eventually; reads may be out of order Lowest Highest Allowed

The hard constraint: Strong consistency is incompatible with multi-region writes. Linearizability requires a single global ordering of writes, which you cannot have when multiple regions accept writes independently. If you try to enable multi-region writes on a Strong account, the operation is rejected. So the real choice for multi-write accounts is among Bounded Staleness, Session, Consistent Prefix, and Eventual.

The default consistency level is set on the account, but a client can relax (never tighten) it per request. A Session-default account can issue an Eventual read for a cheap, fast lookup; it cannot request Strong.

// Relax to Eventual for a non-critical read (lower RU, lower latency)
var options = new ItemRequestOptions { ConsistencyLevel = ConsistencyLevel.Eventual };
var resp = await container.ReadItemAsync<Product>(
    id, new PartitionKey(tenantId), options);

3. Bounded staleness vs session: choosing per workload

For multi-region writes, the two levels worth most of your attention are Bounded Staleness and Session, because they cover the majority of real requirements without paying full latency cost.

Bounded Staleness gives you a quantified staleness budget. You configure a maximum lag as both a version count (maxStalenessPrefix) and a time window (maxIntervalInSeconds); reads in any region are guaranteed to be no more stale than the tighter of the two. This is the level you want when you need a contractual freshness bound you can put in an SLA: “replicas are never more than 5 minutes behind.” For a multi-region-write account spanning two-plus regions, the minimums are maxStalenessPrefix >= 100000 and maxIntervalInSeconds >= 300. Inside a single region it still behaves like strong consistency, which is a useful property: clients pinned to one region get read-your-writes for free.

Session is the pragmatic default for most applications, and it is the actual Cosmos DB default. It guarantees consistency within a session — typically one user’s connection — via a session token (x-ms-session-token). The same client that wrote a document will read it back; it gets monotonic reads and writes. The catch is that the guarantee is scoped to the session token. If request A writes in East US 2 and request B (a different client, different token) reads in West Europe a few milliseconds later, B can miss the write. To preserve read-your-writes across tiers, you must flow the session token between services.

// Write returns a session token; capture and propagate it
var write = await container.CreateItemAsync(order, new PartitionKey(order.TenantId));
string sessionToken = write.Headers.Session; // pass to downstream via header/cookie

// A later read in another tier honors that token -> read-your-writes preserved
var read = await container.ReadItemAsync<Order>(
    order.Id, new PartitionKey(order.TenantId),
    new ItemRequestOptions { SessionToken = sessionToken });

Rule of thumb I apply:

4. Conflict types under multi-region writes

With multiple write regions, two clients can mutate the same document (same id + partition key) concurrently in different regions. When replication brings those versions together, Cosmos DB detects a conflict. There are three kinds:

How a conflict surfaces depends entirely on the conflict-resolution policy you set on the container:

You set the policy at container creation. It cannot be changed after creation through most SDKs/portal, so choose deliberately — switching strategy generally means a new container and a migration.

5. Last-writer-wins with a custom path property

The default LWW policy resolves on the system property _ts (last-modified timestamp, second granularity). Second granularity is coarse: two writes in the same second tie, and Cosmos picks deterministically but not in a way you control. For correctness you often want LWW over a property you own — a monotonic version number, an epoch-millis timestamp, or a sequence assigned by your write path.

# Create a container with LWW resolving on a custom numeric path
az cosmosdb sql container create \
  --account-name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --database-name shop \
  --name orders \
  --partition-key-path "/tenantId" \
  --conflict-resolution-policy-mode "LastWriterWins" \
  --conflict-resolution-policy-path "/version"

The path must point to a numeric field; the document with the higher value wins. Keep these invariants or LWW will silently lose data:

Equivalent in Bicep, which is where this belongs for reproducibility:

resource ordersContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-11-15' = {
  parent: shopDatabase
  name: 'orders'
  properties: {
    resource: {
      id: 'orders'
      partitionKey: { paths: [ '/tenantId' ], kind: 'Hash' }
      conflictResolutionPolicy: {
        mode: 'LastWriterWins'
        conflictResolutionPath: '/version'
      }
    }
  }
}

6. Custom conflict resolution via stored procedure and the conflicts feed

When LWW is too blunt — you need to merge concurrent edits, or apply business rules about which write wins — switch to custom resolution. There are two flavors.

(a) Stored-procedure resolution. You register a JavaScript sproc as the resolver. On every conflict Cosmos invokes it with the incoming document, the existing committed document, and any documents already in the conflicts feed. Your sproc decides the final state and writes it. The sproc signature is fixed:

// resolver sproc: merges line items, keeps the max status rank
function resolver(incomingItem, existingItem, isTombstone, conflictingItems) {
  var collection = getContext().getCollection();
  var response = getContext().getResponse();

  // isTombstone === true means the incoming op was a delete
  var resolved = existingItem || {};
  if (incomingItem) {
    resolved.lineItems = mergeById(
      (existingItem && existingItem.lineItems) || [],
      incomingItem.lineItems || []);
    resolved.status = Math.max(
      (existingItem && existingItem.status) || 0,
      incomingItem.status || 0);
    resolved.id = incomingItem.id;
  }

  // Conflicting versions sitting in the feed must be folded in too
  (conflictingItems || []).forEach(function (c) {
    resolved.lineItems = mergeById(resolved.lineItems, c.lineItems || []);
    resolved.status = Math.max(resolved.status, c.status || 0);
  });

  var docLink = collection.getSelfLink() + 'docs/' + resolved.id;
  if (isTombstone && (!incomingItem)) {
    collection.deleteDocument(docLink, {}, function (e) { if (e) throw e; });
  } else {
    collection.upsertDocument(collection.getSelfLink(), resolved,
      function (e) { if (e) throw e; });
  }
  response.setBody(resolved);

  function mergeById(a, b) { /* union by line id, prefer higher qty */
    var m = {};
    a.concat(b).forEach(function (x) {
      if (!m[x.id] || x.qty > m[x.id].qty) m[x.id] = x;
    });
    return Object.keys(m).map(function (k) { return m[k]; });
  }
}

Register it and bind it to the container’s policy:

# 1) Register the sproc in the container
az cosmosdb sql stored-procedure create \
  --account-name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --database-name shop \
  --container-name orders \
  --name resolver \
  --body @resolver.js

# 2) Create the container pointing its policy at that sproc
az cosmosdb sql container create \
  --account-name kv-cosmos-prod --resource-group rg-data-prod \
  --database-name shop --name orders \
  --partition-key-path "/tenantId" \
  --conflict-resolution-policy-mode "Custom" \
  --conflict-resolution-procedure "dbs/shop/colls/orders/sprocs/resolver"

Key constraints on the resolver sproc:

(b) Manual resolution via the conflicts feed. Set the policy to Custom with no resolver procedure. Now Cosmos writes every conflicting version to the per-container conflicts feed and stops. Your application drains it and resolves on its own terms.

// Drain the conflicts feed and resolve in application code
using var iterator = container.Conflicts.GetConflictQueryIterator<ConflictProperties>();
while (iterator.HasMoreResults)
{
    foreach (var conflict in await iterator.ReadNextAsync())
    {
        // The losing version that landed in the feed
        Order conflicting = container.Conflicts.ReadConflictContent<Order>(conflict);
        // The currently committed version
        Order committed = await container.ReadItemAsync<Order>(
            conflicting.Id, new PartitionKey(conflicting.TenantId));

        Order winner = Merge(committed, conflicting); // your business rule
        await container.ReplaceItemAsync(winner, winner.Id, new PartitionKey(winner.TenantId));

        // Delete the entry from the feed once handled
        await container.Conflicts.DeleteAsync(conflict, new PartitionKey(conflicting.TenantId));
    }
}

Manual mode is the most flexible and the most operationally demanding: if nobody drains the feed, conflicts accumulate and your data quietly diverges from what users expect. Run the drainer as a continuously scheduled job and alert if the feed depth grows.

7. Automatic vs manual failover and testing outages

Two independent settings govern regional failover:

Trigger a controlled failover to rehearse an outage. This actually reprioritizes regions; run it in a test account or a planned window:

# Promote West Europe to priority 0 (simulate losing East US 2 as primary)
az cosmosdb failover-priority-change \
  --name kv-cosmos-prod \
  --resource-group rg-data-prod \
  --failover-policies "West Europe=0" "East US 2=1" "Southeast Asia=2"

On the client side, your CosmosClient should be configured with an explicit preferred-regions list so it fails over locally without a config change:

var client = new CosmosClient(connectionString, new CosmosClientOptions
{
    ApplicationPreferredRegions = new List<string>
    {
        "East US 2", "West Europe", "Southeast Asia"  // ordered preference
    },
    ConnectionMode = ConnectionMode.Direct
});

With ApplicationPreferredRegions set, the SDK automatically retries the next region on a regional failure — you do not redeploy to fail over. Test this for real: block egress to the primary region’s Cosmos endpoint (NSG rule or local firewall) and confirm your service keeps serving from the next region within the SDK’s retry window.

8. Validating RPO/RTO and monitoring replication latency

Numbers you should be able to quote for a multi-region-write account:

Monitor replication latency continuously. The relevant metric is Replication Latency (P50/P99 by source/target region) in Azure Monitor:

// P99 cross-region replication latency, by region pair, last 6h
AzureMetrics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where MetricName == "ReplicationLatency"
| where TimeGenerated > ago(6h)
| summarize p99 = percentile(Average, 99) by bin(TimeGenerated, 5m), Resource
| order by TimeGenerated desc

Also alert on the conflict path so silent divergence cannot hide:

// Surfacing custom/manual conflict activity
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where Category == "DataPlaneRequests"
| where OperationName has "Conflict"
| summarize count() by bin(TimeGenerated, 15m), requestResourceType_s

Verify

Concrete checks before you call this production-ready:

  1. Multi-write is actually on. az cosmosdb show -n kv-cosmos-prod -g rg-data-prod --query "enableMultipleWriteLocations" returns true, and writeLocations lists every region.
  2. Consistency is what you intended. az cosmosdb show ... --query "consistencyPolicy" shows the level plus, for Bounded Staleness, maxStalenessPrefix >= 100000 and maxIntervalInSeconds >= 300.
  3. The conflict policy is bound. az cosmosdb sql container show --account-name kv-cosmos-prod -g rg-data-prod -d shop -n orders --query "resource.conflictResolutionPolicy" shows your mode and either conflictResolutionPath or conflictResolutionProcedure.
  4. Conflicts resolve as designed. Disable replication briefly (or use the emulator’s multi-region mode), write the same document in two regions with diverging values, reconnect, and confirm the winner matches your LWW path or sproc output.
  5. Failover is transparent. Run az cosmosdb failover-priority-change (or block the primary endpoint) and confirm the client keeps serving via ApplicationPreferredRegions with no redeploy.
  6. Latency and conflicts are observed. The ReplicationLatency chart is populated and you have an alert on P99 plus on conflict-feed activity.

Enterprise scenario

A global payments platform ran a payments-ledger container with three write regions (East US 2, West Europe, Southeast Asia) at Session consistency. Their idempotency layer keyed on a client-supplied paymentId, and the write path did a read-modify-write to advance a status field (0=pending, 1=authorized, 2=captured, 3=refunded). During a partial network partition between East US 2 and West Europe, a retrying client authorized a payment in West Europe while a parallel capture landed in East US 2. The container used default LWW on _ts. Because both writes fell in the same second, _ts tied, Cosmos picked the authorize as the winner, and the capture was silently discarded — money moved, ledger said “authorized.” They found it only because daily reconciliation against the processor disagreed.

The constraint: they could not tolerate any state-machine regression, and they could not afford to drop to single-region writes (latency SLOs in APAC). The fix was a custom resolver sproc that resolves on the business state machine instead of a timestamp — the higher status rank always wins, and a refund (3) is terminal:

function resolveLedger(incoming, existing, isTombstone, conflicts) {
  var ctx = getContext(), coll = ctx.getCollection(), res = ctx.getResponse();
  var all = [existing, incoming].concat(conflicts || []).filter(Boolean);
  // Terminal states win; otherwise the highest status rank wins.
  var winner = all.reduce(function (best, c) {
    if (best === null) return c;
    if (c.status === 3) return c;            // refund is absorbing
    return (c.status > best.status) ? c : best;
  }, null);
  coll.upsertDocument(coll.getSelfLink(), winner, function (e) { if (e) throw e; });
  res.setBody(winner);
}

They also moved the LWW-style fields they could safely auto-merge (audit tags, last-touched-by) into the same sproc so nothing fell back to _ts. Post-change, a six-month reconciliation run showed zero ledger regressions, and the conflicts feed alert (depth > 0 for more than five minutes) gave them an early-warning signal they had been missing entirely. The lesson the team wrote into their design guide: on a multi-region-write account, the conflict-resolution policy is part of your data model, not an afterthought — and default LWW on _ts is almost never correct for stateful, ordered domains.

Checklist

cosmos-dbmulti-regionconsistencyconflict-resolutionglobal

Comments

Keep Reading