Azure Databases

Cosmos DB for NoSQL: Partition Key Design, RU Optimization, and Hot Partition Repair

Most Cosmos DB cost and latency incidents trace back to one decision made early and never revisited: the partition key. Get it right and the container scales horizontally and predictably to any throughput you can pay for. Get it wrong and you hit a wall no amount of RU/s can buy past, because a single physical partition tops out at 10,000 RU/s regardless of what you provision on the container. This is a working guide to choosing the key, measuring and shrinking RU consumption, tuning the indexing policy, and repairing a container that is already skewed in production.

1. Logical vs physical partitions, and the 20GB ceiling

Cosmos DB has two layers of partitioning, and conflating them is the root of most design mistakes.

A logical partition is the set of all items sharing one partition key value. If your key is /tenantId, every document for tenant-42 lives in one logical partition. Its hard constraints:

A physical partition is the actual compute-and-storage unit Cosmos provisions behind the scenes. Cosmos hashes the partition key value and maps each logical partition onto exactly one physical partition. Its constraints:

The number of physical partitions is derived, not chosen - the maximum of two requirements:

physical partitions = ceil( max(
    provisioned_RU / 10000,
    total_storage_GB / 50
))

Two consequences explain most “I gave it 50,000 RU/s and it’s still throttling” tickets:

  1. A single hot logical partition cannot exceed 10,000 RU/s, because it cannot be split across physical partitions. Provisioning 100,000 RU/s on the container does nothing for one key value receiving all the traffic.
  2. Throughput is distributed evenly across physical partitions. If you provision 60,000 RU/s and Cosmos created 6 physical partitions, each gets 10,000 RU/s - even if 5 are idle and 1 is on fire.

The single most important number to internalize: 10,000 RU/s per physical partition, and a logical partition never spans more than one physical partition. Every hot-partition incident is some violation of this rule.

2. Choosing a partition key

The partition key is effectively permanent - you can only migrate to a new container, never change it in place - so this is the decision to over-invest in. Evaluate every candidate against three properties.

Cardinality. You want many distinct values so Cosmos can spread data across many logical (and therefore physical) partitions. /userId in a system with millions of users is excellent. /country is terrible: a few hundred values, wildly skewed toward your largest markets, each capped at 20 GB and 10,000 RU/s.

Access pattern alignment. The key should match how you read. If 90% of queries filter by customerId, partitioning on /customerId turns those into single-partition queries that touch one physical partition for a fraction of a fan-out’s cost. A query that omits the partition key becomes a cross-partition query, which fans out to every physical partition and bills you for the sum.

Write distribution. Hot logical partitions are usually write problems. Avoid keys that funnel writes:

The heuristic I apply, in order of preference:

Candidate key Cardinality Read alignment Write spread Verdict
/id (item id) Very high Point reads only Excellent Great if you only do point reads
/userId, /deviceId High Per-entity queries Even Usually the right answer
/tenantId Medium Per-tenant queries Skewed Good only if tenants are balanced
/date, /createdOn High Range queries Monotonic hot spot Avoid as sole key
/status, /region Low Filtered scans Skewed Avoid

When no single field is both high-cardinality and read-aligned, build a synthetic key by concatenating fields, or reach for hierarchical partition keys (Section 4).

3. Estimating and measuring RU/s

A Request Unit is Cosmos DB’s normalized currency for throughput: a 1 KB point read by id costs roughly 1 RU. Writes, queries, and larger documents cost more. Two activities matter - estimating up front, and measuring in production.

Measure, do not guess. Every response carries the real cost in the x-ms-request-charge header. Stop estimating the moment you can issue a real query against real data.

# Read the request charge for a query using the REST surface via az rest is awkward;
# in practice you read the header from your SDK. With the .NET SDK:
#   response.RequestCharge  ->  double, RUs consumed
# With the Python SDK, the charge is on the client after the call:
#   client.client_connection.last_response_headers['x-ms-request-charge']

In the Data Explorer Query Stats tab, every query shows its Request Charge and Retrieved document count. A query reporting 2.8 RU is fine; one reporting 850 RU on a small container is doing a cross-partition scan or fighting the indexing policy.

For sizing before you have data, the official Cosmos DB capacity calculator translates item size, read/write rates, and consistency level into a baseline RU/s. Rules of thumb worth carrying:

When throttled, Cosmos returns HTTP 429 with an x-ms-retry-after-ms header. The SDKs retry automatically up to a configurable limit, but sustained 429s mean you are either under-provisioned overall or - far more often - hammering one physical partition.

4. Hierarchical partition keys for skewed tenants

Multi-tenant systems almost always want to partition by /tenantId for query locality, but real tenant distributions are power-law: a handful of tenants generate most of the data and traffic. A single big tenant blows past 20 GB or saturates its 10,000 RU/s, and /tenantId traps you.

Hierarchical partition keys (also called subpartitioning) solve this by letting you define up to three levels. Cosmos uses the full path to place items, but can still route a query that supplies only a prefix to the right physical partitions.

Define the hierarchy at container creation:

az cosmosdb sql container create \
  --account-name cosmos-platform-prod \
  --resource-group rg-data-platform \
  --database-name events \
  --name telemetry \
  --partition-key-path "/tenantId" "/deviceId" "/sessionId" \
  --partition-key-version 2 \
  --throughput 10000

Now the effective partitioning is tenantId -> deviceId -> sessionId. A whale tenant’s data is spread across many deviceId sub-partitions and is no longer confined to a single logical partition or its 20 GB / 10,000 RU/s ceiling. Crucially, queries keep their efficiency depending on how much of the prefix they supply:

-- Single physical partition: full key supplied
SELECT * FROM c WHERE c.tenantId = 'acme' AND c.deviceId = 'dev-9' AND c.sessionId = 's-1'

-- Targeted subset: prefix supplied, Cosmos routes to the relevant physical partitions
SELECT * FROM c WHERE c.tenantId = 'acme'

-- Full cross-partition fan-out: prefix NOT supplied
SELECT * FROM c WHERE c.deviceId = 'dev-9'

The middle query is the payoff: you get tenant-scoped reads without ever creating a 20 GB-capped, throughput-capped logical partition for acme. Note that hierarchical keys must be enabled at creation time with partition key version 2; you cannot retrofit them onto an existing single-key container without migrating.

5. Indexing policy tuning

By default Cosmos DB indexes every property of every document - ad hoc queries are fast on day one, writes are needlessly expensive forever. On write-heavy containers this is the single biggest RU lever after the partition key.

The strategy: index only what you filter, sort, or join on; exclude the rest. Path precedence is resolved by longest match, so the robust pattern is exclude everything, then include the specific paths you query.

{
  "indexingMode": "consistent",
  "automatic": true,
  "includedPaths": [
    { "path": "/customerId/?" },
    { "path": "/status/?" },
    { "path": "/createdOn/?" }
  ],
  "excludedPaths": [
    { "path": "/*" },
    { "path": "/_etag/?" }
  ],
  "compositeIndexes": [
    [
      { "path": "/customerId", "order": "ascending" },
      { "path": "/createdOn", "order": "descending" }
    ]
  ]
}

Two things to understand precisely:

Apply a policy update with the CLI; index transformation runs online in the background:

az cosmosdb sql container update \
  --account-name cosmos-platform-prod \
  --resource-group rg-data-platform \
  --database-name orders \
  --name orders \
  --idx @indexing-policy.json

Trimming a wide-open policy down to a handful of indexed paths routinely cuts create/upsert cost by 30-50% on documents with many properties, because the write no longer maintains dozens of index entries it will never serve a query from.

6. Detecting hot partitions

A hot partition is invisible at the container level - average RU consumption looks healthy while one physical partition sits at 100% throwing 429s. You detect it with partition-scoped metrics, not aggregates.

The key metric is Normalized RU Consumption: the percentage of provisioned RU/s used by the hottest partition in each window. Pinned near 100% while container-level utilization sits at 30% means a hot partition by definition.

In Azure Monitor / Metrics, chart it like this:

// Azure Monitor metric, split by physical partition.
// Metric: NormalizedRUConsumption
// Aggregation: Max
// Split (filter) by: PhysicalPartitionId
//
// In the Metrics blade:
//   Metric        = Normalized RU Consumption
//   Aggregation   = Max
//   Apply splitting on dimension "PhysicalPartitionId"

For log-based analysis, query the throttled requests in Log Analytics if diagnostic settings are routing DataPlaneRequests:

CDBDataPlaneRequests
| where TimeGenerated > ago(1h)
| where StatusCode == 429
| summarize Throttled = count() by PartitionKeyRangeId, bin(TimeGenerated, 5m)
| order by Throttled desc

A single PartitionKeyRangeId dominating the 429 count is the signature of a hot partition. Cross-reference it with PartitionKeyStatistics (available via the SDK’s GetPartitionKeyRangesAsync and storage metrics) to see which key values carry the most data. The triad to confirm a hot partition:

  1. Normalized RU Consumption (Max) near 100% on one PhysicalPartitionId.
  2. 429s concentrated on one PartitionKeyRangeId.
  3. Container-level RU utilization comfortably below provisioned.

7. Remediation: re-partitioning, synthetic keys, migration

You cannot change a partition key in place. Every real fix moves data to a better-keyed container, but the right approach depends on the failure mode.

Synthetic / composite keys address low cardinality. If you were forced onto /status or /region, redefine the key as a computed field on each document that combines a high-cardinality value with the natural one:

# Stamp a synthetic partition key on write to spread load.
# Combine a meaningful prefix with a bucketed suffix for high cardinality.
import hashlib

def synthetic_pk(tenant_id: str, entity_id: str, buckets: int = 100) -> str:
    suffix = int(hashlib.sha256(entity_id.encode()).hexdigest(), 16) % buckets
    return f"{tenant_id}-{suffix:03d}"

doc["pk"] = synthetic_pk(doc["tenantId"], doc["id"])
# Container partition key path is "/pk".
# Reads for a tenant must now fan across the 100 buckets, so prefer this only
# when point reads dominate, or use hierarchical keys instead for query locality.

The trade-off is explicit: synthetic suffixes spread writes well but turn tenant-scoped reads into a fan-out across the buckets. When you need both write spread and read locality, hierarchical partition keys (Section 4) are the better tool - the default for the multi-tenant case.

Container migration is the path when the key itself is wrong. There is no in-place repartition; you create a new container with the correct key (or hierarchy and indexing policy) and copy the data:

Always provision the destination with high RU/s during the backfill (bulk ingestion is throughput-bound) and dial it back once steady-state.

8. Autoscale vs manual throughput

The throughput mode shapes both your bill and your resilience to spikes.

Manual throughput pins a fixed RU/s. You pay for that ceiling 24/7 whether you use it or not - correct only for steady, predictable workloads you can size tightly.

Autoscale sets a maximum and instantly scales between 10% and 100% of it based on load, billing per hour for the highest RU/s reached that hour. Autoscale costs 1.5x the manual rate per RU, so the break-even is roughly 66% average utilization: below that, autoscale is cheaper because you avoid paying for idle headroom; above it, a well-sized manual setting wins.

# Create a container with autoscale: max 40,000 RU/s, floor is automatically 4,000 (10%)
az cosmosdb sql container create \
  --account-name cosmos-platform-prod \
  --resource-group rg-data-platform \
  --database-name orders \
  --name orders \
  --partition-key-path "/customerId" \
  --max-throughput 40000

# Convert an existing manual container to autoscale
az cosmosdb sql container throughput migrate \
  --account-name cosmos-platform-prod \
  --resource-group rg-data-platform \
  --database-name orders \
  --name orders \
  --throughput-type autoscale

Two operational nuances:

Verify

Confirm the design holds before declaring victory.

# 1. Confirm the partition key path (and hierarchy) actually configured on the container
az cosmosdb sql container show \
  --account-name cosmos-platform-prod \
  --resource-group rg-data-platform \
  --database-name orders \
  --name orders \
  --query "resource.partitionKey" -o json

# 2. Confirm current throughput mode and value
az cosmosdb sql container throughput show \
  --account-name cosmos-platform-prod \
  --resource-group rg-data-platform \
  --database-name orders \
  --name orders \
  --query "resource.{manual:throughput, autoscaleMax:autoscaleSettings.maxThroughput}" -o json

Then validate behavior, not just config:

Enterprise scenario

A SaaS platform team running an order-management service partitioned a transactions container on /merchantId - reasonable, since nearly every query was merchant-scoped. It held up through hundreds of mid-size merchants. Then they signed a marketplace customer whose Black Friday traffic was 40x their next-largest merchant. That merchant’s logical partition slammed into the 10,000 RU/s physical-partition ceiling: the container ran at 50,000 RU/s across five physical partitions and overall utilization sat near 22%, yet checkout writes for that one merchant returned HTTP 429 for hours. More RU/s changed nothing - all of that merchant’s traffic hashed to a single physical partition.

The constraint was unmovable: you cannot change a partition key in place, a single logical partition cannot be split, and they could not afford a maintenance window during the holiday peak.

The fix was a migration to hierarchical partition keys, /merchantId then /orderId. They created a new container with partition key version 2, set a tight indexing policy (excluding the large lineItems blob they never filtered on), provisioned 80,000 RU/s for the backfill, and drained the source’s change feed into it with an Azure Function so the copy was live and resumable. They cut writes over behind a feature flag once the processor caught up, then dropped to autoscale max 50,000.

az cosmosdb sql container create \
  --account-name cosmos-orders-prod \
  --resource-group rg-orders \
  --database-name commerce \
  --name transactions_v2 \
  --partition-key-path "/merchantId" "/orderId" \
  --partition-key-version 2 \
  --idx @lean-indexing.json \
  --max-throughput 80000

The whale merchant’s orders now spread across thousands of orderId sub-partitions instead of one logical partition, the per-partition ceiling stopped binding, and merchant-scoped reads stayed single-partition because queries still supplied the /merchantId prefix. Steady-state RU spend dropped because the lean index cut write cost on a container doing millions of order writes a day.

Checklist

cosmos-dbnosqlpartitioningrequest-unitsperformance

Comments

Keep Reading