Multi-region writes are the feature that makes Azure Cosmos DB look like magic in a demo and like a distributed-systems trap in production. The moment two regions can both accept writes for the same logical partition, you have surrendered the comfortable single-writer world and signed up for conflict resolution, weaker consistency, and a much harder mental model. None of that is a reason to avoid it: for globally distributed, write-heavy, low-latency workloads it is the right tool. But you have to configure it deliberately. This guide walks the full path: enabling multi-region writes, picking a consistency level you can actually defend, and building both last-writer-wins and custom conflict resolution that behaves correctly when a region drops.
Everything here assumes the Cosmos DB for NoSQL API. The consistency model is API-agnostic, but conflict-resolution policies and the conflicts feed are specific to the NoSQL API; Cassandra, MongoDB, and Gremlin handle conflicts differently (typically LWW only).
1. Add regions and enable multi-region writes
Multi-region writes (formerly “multi-master”) is an account-level toggle. You first need at least two regions associated with the account, then you flip enableMultipleWriteLocations. Adding regions is an online operation; enabling multi-write is not always online and can briefly affect availability, so do it in a maintenance window the first time.
With Azure CLI:
# Add a second (and third) read region first
az cosmosdb update \
--name kv-cosmos-prod \
--resource-group rg-data-prod \
--locations regionName="East US 2" failoverPriority=0 isZoneRedundant=true \
--locations regionName="West Europe" failoverPriority=1 isZoneRedundant=true \
--locations regionName="Southeast Asia" failoverPriority=2 isZoneRedundant=true
# Then enable multi-region writes
az cosmosdb update \
--name kv-cosmos-prod \
--resource-group rg-data-prod \
--enable-multiple-write-locations true
A few things that bite people:
failoverPriority=0is the write region under single-write, and the target of automatic failover. Priorities must be contiguous starting at 0 and unique.- Once multi-region writes are on, every region is a write region;
failoverPrioritythen only governs the order regions are reprioritized during automatic failover, not who can write. - Zone redundancy (
isZoneRedundant) is per region and can only be set when the region is added. You cannot toggle it in place later without removing and re-adding the region.
Declaratively in Bicep, which is how this should live in your repo:
resource account 'Microsoft.DocumentDB/databaseAccounts@2024-11-15' = {
name: 'kv-cosmos-prod'
location: 'East US 2'
kind: 'GlobalDocumentDB'
properties: {
databaseAccountOfferType: 'Standard'
enableMultipleWriteLocations: true
enableAutomaticFailover: true
consistencyPolicy: {
defaultConsistencyLevel: 'BoundedStaleness'
maxStalenessPrefix: 100000
maxIntervalInSeconds: 300
}
locations: [
{ locationName: 'East US 2', failoverPriority: 0, isZoneRedundant: true }
{ locationName: 'West Europe', failoverPriority: 1, isZoneRedundant: true }
{ locationName: 'Southeast Asia', failoverPriority: 2, isZoneRedundant: true }
]
}
}
Cost note: enabling multi-region writes roughly multiplies your provisioned RU/s cost by the number of write regions for replication, because writes replicate everywhere. Three write regions is three times the write throughput cost. Decide whether you genuinely need write locality in all three or whether one or two write regions plus read replicas is enough.
2. The five consistency levels and their tradeoffs
Cosmos DB exposes a tunable, linear consistency spectrum. Stronger is to the left, more available and lower latency to the right:
Strong > Bounded Staleness > Session > Consistent Prefix > Eventual
| Level | What it guarantees | Read latency | Write availability on partition | Multi-region writes? |
|---|---|---|---|---|
| Strong | Linearizable; reads see the latest committed write | Highest (cross-region quorum) | Lowest | Not allowed |
| Bounded staleness | Lag bounded by K versions or T seconds; consistent-prefix within the bound | Higher | High | Allowed |
| Session | Read-your-writes, monotonic reads/writes within a session token | Low | High | Allowed |
| Consistent prefix | Never see out-of-order writes; no recency bound | Low | High | Allowed |
| Eventual | Replicas converge eventually; reads may be out of order | Lowest | Highest | Allowed |
The hard constraint: Strong consistency is incompatible with multi-region writes. Linearizability requires a single global ordering of writes, which you cannot have when multiple regions accept writes independently. If you try to enable multi-region writes on a Strong account, the operation is rejected. So the real choice for multi-write accounts is among Bounded Staleness, Session, Consistent Prefix, and Eventual.
The default consistency level is set on the account, but a client can relax (never tighten) it per request. A Session-default account can issue an Eventual read for a cheap, fast lookup; it cannot request Strong.
// Relax to Eventual for a non-critical read (lower RU, lower latency)
var options = new ItemRequestOptions { ConsistencyLevel = ConsistencyLevel.Eventual };
var resp = await container.ReadItemAsync<Product>(
id, new PartitionKey(tenantId), options);
3. Bounded staleness vs session: choosing per workload
For multi-region writes, the two levels worth most of your attention are Bounded Staleness and Session, because they cover the majority of real requirements without paying full latency cost.
Bounded Staleness gives you a quantified staleness budget. You configure a maximum lag as both a version count (maxStalenessPrefix) and a time window (maxIntervalInSeconds); reads in any region are guaranteed to be no more stale than the tighter of the two. This is the level you want when you need a contractual freshness bound you can put in an SLA: “replicas are never more than 5 minutes behind.” For a multi-region-write account spanning two-plus regions, the minimums are maxStalenessPrefix >= 100000 and maxIntervalInSeconds >= 300. Inside a single region it still behaves like strong consistency, which is a useful property: clients pinned to one region get read-your-writes for free.
Session is the pragmatic default for most applications, and it is the actual Cosmos DB default. It guarantees consistency within a session — typically one user’s connection — via a session token (x-ms-session-token). The same client that wrote a document will read it back; it gets monotonic reads and writes. The catch is that the guarantee is scoped to the session token. If request A writes in East US 2 and request B (a different client, different token) reads in West Europe a few milliseconds later, B can miss the write. To preserve read-your-writes across tiers, you must flow the session token between services.
// Write returns a session token; capture and propagate it
var write = await container.CreateItemAsync(order, new PartitionKey(order.TenantId));
string sessionToken = write.Headers.Session; // pass to downstream via header/cookie
// A later read in another tier honors that token -> read-your-writes preserved
var read = await container.ReadItemAsync<Order>(
order.Id, new PartitionKey(order.TenantId),
new ItemRequestOptions { SessionToken = sessionToken });
Rule of thumb I apply:
- Session when the workload is per-user and you control token propagation. Cheapest correct option.
- Bounded Staleness when multiple independent readers need a bounded global freshness guarantee, or when a downstream consumer (analytics, a cache warmer) cannot carry session tokens.
- Consistent Prefix / Eventual only for genuinely tolerant data (counters you reconcile, telemetry, feeds) where you want the lowest latency and highest availability and you have an out-of-band reconciliation story.
4. Conflict types under multi-region writes
With multiple write regions, two clients can mutate the same document (same id + partition key) concurrently in different regions. When replication brings those versions together, Cosmos DB detects a conflict. There are three kinds:
- Insert conflict — two regions create a document with the same
id/partition key. - Replace/update conflict — two regions update the same existing document concurrently.
- Delete conflict — one region deletes a document another region is updating.
How a conflict surfaces depends entirely on the conflict-resolution policy you set on the container:
- Last-Writer-Wins (LWW) — the default. Cosmos resolves conflicts automatically and silently using a numeric path (default
_ts). The winner is committed; losers are discarded and never appear in the conflicts feed. - Custom (stored procedure) — your registered sproc resolves each conflict.
- Custom (manual / no sproc) — Cosmos does not auto-resolve. Conflicting versions are written to a conflicts feed and your application must read it and resolve them.
You set the policy at container creation. It cannot be changed after creation through most SDKs/portal, so choose deliberately — switching strategy generally means a new container and a migration.
5. Last-writer-wins with a custom path property
The default LWW policy resolves on the system property _ts (last-modified timestamp, second granularity). Second granularity is coarse: two writes in the same second tie, and Cosmos picks deterministically but not in a way you control. For correctness you often want LWW over a property you own — a monotonic version number, an epoch-millis timestamp, or a sequence assigned by your write path.
# Create a container with LWW resolving on a custom numeric path
az cosmosdb sql container create \
--account-name kv-cosmos-prod \
--resource-group rg-data-prod \
--database-name shop \
--name orders \
--partition-key-path "/tenantId" \
--conflict-resolution-policy-mode "LastWriterWins" \
--conflict-resolution-policy-path "/version"
The path must point to a numeric field; the document with the higher value wins. Keep these invariants or LWW will silently lose data:
- The path is always present and numeric on every write. A missing path is treated as 0.
- The value is monotonically increasing per logical document. If you use client clocks, skew between regions becomes data loss — prefer a value you can guarantee increases (a version counter incremented on read-modify-write, or a hybrid logical clock).
- Ties resolve deterministically but arbitrarily. Make the value unique enough to avoid ties on writes you care about.
Equivalent in Bicep, which is where this belongs for reproducibility:
resource ordersContainer 'Microsoft.DocumentDB/databaseAccounts/sqlDatabases/containers@2024-11-15' = {
parent: shopDatabase
name: 'orders'
properties: {
resource: {
id: 'orders'
partitionKey: { paths: [ '/tenantId' ], kind: 'Hash' }
conflictResolutionPolicy: {
mode: 'LastWriterWins'
conflictResolutionPath: '/version'
}
}
}
}
6. Custom conflict resolution via stored procedure and the conflicts feed
When LWW is too blunt — you need to merge concurrent edits, or apply business rules about which write wins — switch to custom resolution. There are two flavors.
(a) Stored-procedure resolution. You register a JavaScript sproc as the resolver. On every conflict Cosmos invokes it with the incoming document, the existing committed document, and any documents already in the conflicts feed. Your sproc decides the final state and writes it. The sproc signature is fixed:
// resolver sproc: merges line items, keeps the max status rank
function resolver(incomingItem, existingItem, isTombstone, conflictingItems) {
var collection = getContext().getCollection();
var response = getContext().getResponse();
// isTombstone === true means the incoming op was a delete
var resolved = existingItem || {};
if (incomingItem) {
resolved.lineItems = mergeById(
(existingItem && existingItem.lineItems) || [],
incomingItem.lineItems || []);
resolved.status = Math.max(
(existingItem && existingItem.status) || 0,
incomingItem.status || 0);
resolved.id = incomingItem.id;
}
// Conflicting versions sitting in the feed must be folded in too
(conflictingItems || []).forEach(function (c) {
resolved.lineItems = mergeById(resolved.lineItems, c.lineItems || []);
resolved.status = Math.max(resolved.status, c.status || 0);
});
var docLink = collection.getSelfLink() + 'docs/' + resolved.id;
if (isTombstone && (!incomingItem)) {
collection.deleteDocument(docLink, {}, function (e) { if (e) throw e; });
} else {
collection.upsertDocument(collection.getSelfLink(), resolved,
function (e) { if (e) throw e; });
}
response.setBody(resolved);
function mergeById(a, b) { /* union by line id, prefer higher qty */
var m = {};
a.concat(b).forEach(function (x) {
if (!m[x.id] || x.qty > m[x.id].qty) m[x.id] = x;
});
return Object.keys(m).map(function (k) { return m[k]; });
}
}
Register it and bind it to the container’s policy:
# 1) Register the sproc in the container
az cosmosdb sql stored-procedure create \
--account-name kv-cosmos-prod \
--resource-group rg-data-prod \
--database-name shop \
--container-name orders \
--name resolver \
--body @resolver.js
# 2) Create the container pointing its policy at that sproc
az cosmosdb sql container create \
--account-name kv-cosmos-prod --resource-group rg-data-prod \
--database-name shop --name orders \
--partition-key-path "/tenantId" \
--conflict-resolution-policy-mode "Custom" \
--conflict-resolution-procedure "dbs/shop/colls/orders/sprocs/resolver"
Key constraints on the resolver sproc:
- It is scoped to a single partition key per invocation; it cannot resolve across partitions.
- If the sproc throws or is missing, the conflict is routed to the conflicts feed instead of being lost — a safety net, not a happy path. Monitor for it.
- It must be idempotent and deterministic. Cosmos may invoke it more than once for the same conflict; non-deterministic logic produces divergent regional state.
(b) Manual resolution via the conflicts feed. Set the policy to Custom with no resolver procedure. Now Cosmos writes every conflicting version to the per-container conflicts feed and stops. Your application drains it and resolves on its own terms.
// Drain the conflicts feed and resolve in application code
using var iterator = container.Conflicts.GetConflictQueryIterator<ConflictProperties>();
while (iterator.HasMoreResults)
{
foreach (var conflict in await iterator.ReadNextAsync())
{
// The losing version that landed in the feed
Order conflicting = container.Conflicts.ReadConflictContent<Order>(conflict);
// The currently committed version
Order committed = await container.ReadItemAsync<Order>(
conflicting.Id, new PartitionKey(conflicting.TenantId));
Order winner = Merge(committed, conflicting); // your business rule
await container.ReplaceItemAsync(winner, winner.Id, new PartitionKey(winner.TenantId));
// Delete the entry from the feed once handled
await container.Conflicts.DeleteAsync(conflict, new PartitionKey(conflicting.TenantId));
}
}
Manual mode is the most flexible and the most operationally demanding: if nobody drains the feed, conflicts accumulate and your data quietly diverges from what users expect. Run the drainer as a continuously scheduled job and alert if the feed depth grows.
7. Automatic vs manual failover and testing outages
Two independent settings govern regional failover:
enableAutomaticFailover— if the write region (under single-write) becomes unavailable, Cosmos promotes the next region byfailoverPriority. With multi-region writes on, this is largely moot for writes because every region already writes; the SDK simply stops routing to the down region. Keep it on regardless.- Service-managed vs manual failover for reads/priority — you can trigger a manual failover to validate behavior or to drain a region for maintenance.
Trigger a controlled failover to rehearse an outage. This actually reprioritizes regions; run it in a test account or a planned window:
# Promote West Europe to priority 0 (simulate losing East US 2 as primary)
az cosmosdb failover-priority-change \
--name kv-cosmos-prod \
--resource-group rg-data-prod \
--failover-policies "West Europe=0" "East US 2=1" "Southeast Asia=2"
On the client side, your CosmosClient should be configured with an explicit preferred-regions list so it fails over locally without a config change:
var client = new CosmosClient(connectionString, new CosmosClientOptions
{
ApplicationPreferredRegions = new List<string>
{
"East US 2", "West Europe", "Southeast Asia" // ordered preference
},
ConnectionMode = ConnectionMode.Direct
});
With ApplicationPreferredRegions set, the SDK automatically retries the next region on a regional failure — you do not redeploy to fail over. Test this for real: block egress to the primary region’s Cosmos endpoint (NSG rule or local firewall) and confirm your service keeps serving from the next region within the SDK’s retry window.
8. Validating RPO/RTO and monitoring replication latency
Numbers you should be able to quote for a multi-region-write account:
- RTO is effectively near-zero for writes under multi-region writes, because every region is already a write region; there is no promotion step on the write path.
- RPO depends on consistency level. Under Strong RPO is 0 — but Strong is unavailable with multi-region writes. Under Bounded Staleness, RPO is bounded by your configured staleness window (the data within the bound that had not yet replicated when the region was lost). Under Session/Consistent Prefix/Eventual, RPO is non-zero and unbounded in the worst case. This is the core tradeoff: multi-region writes buy you RTO at the cost of a non-zero RPO.
Monitor replication latency continuously. The relevant metric is Replication Latency (P50/P99 by source/target region) in Azure Monitor:
// P99 cross-region replication latency, by region pair, last 6h
AzureMetrics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where MetricName == "ReplicationLatency"
| where TimeGenerated > ago(6h)
| summarize p99 = percentile(Average, 99) by bin(TimeGenerated, 5m), Resource
| order by TimeGenerated desc
Also alert on the conflict path so silent divergence cannot hide:
// Surfacing custom/manual conflict activity
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.DOCUMENTDB"
| where Category == "DataPlaneRequests"
| where OperationName has "Conflict"
| summarize count() by bin(TimeGenerated, 15m), requestResourceType_s
Verify
Concrete checks before you call this production-ready:
- Multi-write is actually on.
az cosmosdb show -n kv-cosmos-prod -g rg-data-prod --query "enableMultipleWriteLocations"returnstrue, andwriteLocationslists every region. - Consistency is what you intended.
az cosmosdb show ... --query "consistencyPolicy"shows the level plus, for Bounded Staleness,maxStalenessPrefix >= 100000andmaxIntervalInSeconds >= 300. - The conflict policy is bound.
az cosmosdb sql container show --account-name kv-cosmos-prod -g rg-data-prod -d shop -n orders --query "resource.conflictResolutionPolicy"shows yourmodeand eitherconflictResolutionPathorconflictResolutionProcedure. - Conflicts resolve as designed. Disable replication briefly (or use the emulator’s multi-region mode), write the same document in two regions with diverging values, reconnect, and confirm the winner matches your LWW path or sproc output.
- Failover is transparent. Run
az cosmosdb failover-priority-change(or block the primary endpoint) and confirm the client keeps serving viaApplicationPreferredRegionswith no redeploy. - Latency and conflicts are observed. The
ReplicationLatencychart is populated and you have an alert on P99 plus on conflict-feed activity.
Enterprise scenario
A global payments platform ran a payments-ledger container with three write regions (East US 2, West Europe, Southeast Asia) at Session consistency. Their idempotency layer keyed on a client-supplied paymentId, and the write path did a read-modify-write to advance a status field (0=pending, 1=authorized, 2=captured, 3=refunded). During a partial network partition between East US 2 and West Europe, a retrying client authorized a payment in West Europe while a parallel capture landed in East US 2. The container used default LWW on _ts. Because both writes fell in the same second, _ts tied, Cosmos picked the authorize as the winner, and the capture was silently discarded — money moved, ledger said “authorized.” They found it only because daily reconciliation against the processor disagreed.
The constraint: they could not tolerate any state-machine regression, and they could not afford to drop to single-region writes (latency SLOs in APAC). The fix was a custom resolver sproc that resolves on the business state machine instead of a timestamp — the higher status rank always wins, and a refund (3) is terminal:
function resolveLedger(incoming, existing, isTombstone, conflicts) {
var ctx = getContext(), coll = ctx.getCollection(), res = ctx.getResponse();
var all = [existing, incoming].concat(conflicts || []).filter(Boolean);
// Terminal states win; otherwise the highest status rank wins.
var winner = all.reduce(function (best, c) {
if (best === null) return c;
if (c.status === 3) return c; // refund is absorbing
return (c.status > best.status) ? c : best;
}, null);
coll.upsertDocument(coll.getSelfLink(), winner, function (e) { if (e) throw e; });
res.setBody(winner);
}
They also moved the LWW-style fields they could safely auto-merge (audit tags, last-touched-by) into the same sproc so nothing fell back to _ts. Post-change, a six-month reconciliation run showed zero ledger regressions, and the conflicts feed alert (depth > 0 for more than five minutes) gave them an early-warning signal they had been missing entirely. The lesson the team wrote into their design guide: on a multi-region-write account, the conflict-resolution policy is part of your data model, not an afterthought — and default LWW on _ts is almost never correct for stateful, ordered domains.