Most Cosmos DB cost and latency incidents trace back to one decision made early and never revisited: the partition key. Get it right and the container scales horizontally and predictably to any throughput you can pay for. Get it wrong and you hit a wall no amount of RU/s can buy past, because a single physical partition tops out at 10,000 RU/s regardless of what you provision on the container. This is a working guide to choosing the key, measuring and shrinking RU consumption, tuning the indexing policy, and repairing a container that is already skewed in production.
1. Logical vs physical partitions, and the 20GB ceiling
Cosmos DB has two layers of partitioning, and conflating them is the root of most design mistakes.
A logical partition is the set of all items sharing one partition key value. If your key is /tenantId, every document for tenant-42 lives in one logical partition. Its hard constraints:
- 20 GB of storage (raw data plus index). This is a ceiling you cannot raise.
- All items with that key value are co-located - which is what makes single-partition queries and transactional batch operations cheap.
A physical partition is the actual compute-and-storage unit Cosmos provisions behind the scenes. Cosmos hashes the partition key value and maps each logical partition onto exactly one physical partition. Its constraints:
- Up to 50 GB of storage per physical partition (newer accounts support larger; treat 50 GB as the planning number).
- Up to 10,000 RU/s of throughput per physical partition.
The number of physical partitions is derived, not chosen - the maximum of two requirements:
physical partitions = ceil( max(
provisioned_RU / 10000,
total_storage_GB / 50
))
Two consequences explain most “I gave it 50,000 RU/s and it’s still throttling” tickets:
- A single hot logical partition cannot exceed 10,000 RU/s, because it cannot be split across physical partitions. Provisioning 100,000 RU/s on the container does nothing for one key value receiving all the traffic.
- Throughput is distributed evenly across physical partitions. If you provision 60,000 RU/s and Cosmos created 6 physical partitions, each gets 10,000 RU/s - even if 5 are idle and 1 is on fire.
The single most important number to internalize: 10,000 RU/s per physical partition, and a logical partition never spans more than one physical partition. Every hot-partition incident is some violation of this rule.
2. Choosing a partition key
The partition key is effectively permanent - you can only migrate to a new container, never change it in place - so this is the decision to over-invest in. Evaluate every candidate against three properties.
Cardinality. You want many distinct values so Cosmos can spread data across many logical (and therefore physical) partitions. /userId in a system with millions of users is excellent. /country is terrible: a few hundred values, wildly skewed toward your largest markets, each capped at 20 GB and 10,000 RU/s.
Access pattern alignment. The key should match how you read. If 90% of queries filter by customerId, partitioning on /customerId turns those into single-partition queries that touch one physical partition for a fraction of a fan-out’s cost. A query that omits the partition key becomes a cross-partition query, which fans out to every physical partition and bills you for the sum.
Write distribution. Hot logical partitions are usually write problems. Avoid keys that funnel writes:
- Monotonic keys like
/dateor an incrementing ID concentrate every new write into the “current” partition - the classic append hot spot. - Status-like keys (
/statuswith valuesactive/closed) skew because most live traffic hits one value.
The heuristic I apply, in order of preference:
| Candidate key | Cardinality | Read alignment | Write spread | Verdict |
|---|---|---|---|---|
/id (item id) |
Very high | Point reads only | Excellent | Great if you only do point reads |
/userId, /deviceId |
High | Per-entity queries | Even | Usually the right answer |
/tenantId |
Medium | Per-tenant queries | Skewed | Good only if tenants are balanced |
/date, /createdOn |
High | Range queries | Monotonic hot spot | Avoid as sole key |
/status, /region |
Low | Filtered scans | Skewed | Avoid |
When no single field is both high-cardinality and read-aligned, build a synthetic key by concatenating fields, or reach for hierarchical partition keys (Section 4).
3. Estimating and measuring RU/s
A Request Unit is Cosmos DB’s normalized currency for throughput: a 1 KB point read by id costs roughly 1 RU. Writes, queries, and larger documents cost more. Two activities matter - estimating up front, and measuring in production.
Measure, do not guess. Every response carries the real cost in the x-ms-request-charge header. Stop estimating the moment you can issue a real query against real data.
# Read the request charge for a query using the REST surface via az rest is awkward;
# in practice you read the header from your SDK. With the .NET SDK:
# response.RequestCharge -> double, RUs consumed
# With the Python SDK, the charge is on the client after the call:
# client.client_connection.last_response_headers['x-ms-request-charge']
In the Data Explorer Query Stats tab, every query shows its Request Charge and Retrieved document count. A query reporting 2.8 RU is fine; one reporting 850 RU on a small container is doing a cross-partition scan or fighting the indexing policy.
For sizing before you have data, the official Cosmos DB capacity calculator translates item size, read/write rates, and consistency level into a baseline RU/s. Rules of thumb worth carrying:
- A 1 KB point read is ~1 RU; a 1 KB create is ~5 RU at default indexing.
- Stronger consistency costs more on reads: Strong and Bounded Staleness reads cost roughly 2x the equivalent Session/Eventual read.
- Indexing every property inflates write cost. Writes pay to maintain the index; trimming it (Section 5) is the highest-leverage write optimization.
When throttled, Cosmos returns HTTP 429 with an x-ms-retry-after-ms header. The SDKs retry automatically up to a configurable limit, but sustained 429s mean you are either under-provisioned overall or - far more often - hammering one physical partition.
4. Hierarchical partition keys for skewed tenants
Multi-tenant systems almost always want to partition by /tenantId for query locality, but real tenant distributions are power-law: a handful of tenants generate most of the data and traffic. A single big tenant blows past 20 GB or saturates its 10,000 RU/s, and /tenantId traps you.
Hierarchical partition keys (also called subpartitioning) solve this by letting you define up to three levels. Cosmos uses the full path to place items, but can still route a query that supplies only a prefix to the right physical partitions.
Define the hierarchy at container creation:
az cosmosdb sql container create \
--account-name cosmos-platform-prod \
--resource-group rg-data-platform \
--database-name events \
--name telemetry \
--partition-key-path "/tenantId" "/deviceId" "/sessionId" \
--partition-key-version 2 \
--throughput 10000
Now the effective partitioning is tenantId -> deviceId -> sessionId. A whale tenant’s data is spread across many deviceId sub-partitions and is no longer confined to a single logical partition or its 20 GB / 10,000 RU/s ceiling. Crucially, queries keep their efficiency depending on how much of the prefix they supply:
-- Single physical partition: full key supplied
SELECT * FROM c WHERE c.tenantId = 'acme' AND c.deviceId = 'dev-9' AND c.sessionId = 's-1'
-- Targeted subset: prefix supplied, Cosmos routes to the relevant physical partitions
SELECT * FROM c WHERE c.tenantId = 'acme'
-- Full cross-partition fan-out: prefix NOT supplied
SELECT * FROM c WHERE c.deviceId = 'dev-9'
The middle query is the payoff: you get tenant-scoped reads without ever creating a 20 GB-capped, throughput-capped logical partition for acme. Note that hierarchical keys must be enabled at creation time with partition key version 2; you cannot retrofit them onto an existing single-key container without migrating.
5. Indexing policy tuning
By default Cosmos DB indexes every property of every document - ad hoc queries are fast on day one, writes are needlessly expensive forever. On write-heavy containers this is the single biggest RU lever after the partition key.
The strategy: index only what you filter, sort, or join on; exclude the rest. Path precedence is resolved by longest match, so the robust pattern is exclude everything, then include the specific paths you query.
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{ "path": "/customerId/?" },
{ "path": "/status/?" },
{ "path": "/createdOn/?" }
],
"excludedPaths": [
{ "path": "/*" },
{ "path": "/_etag/?" }
],
"compositeIndexes": [
[
{ "path": "/customerId", "order": "ascending" },
{ "path": "/createdOn", "order": "descending" }
]
]
}
Two things to understand precisely:
- The
/?suffix means “index the scalar value at this path.” The/*wildcard underexcludedPathsexcludes everything beneath root, which combined with the explicitincludedPathsgives a tight allowlist. - Composite indexes are required for efficient queries that filter on one property and
ORDER BYanother, orORDER BYtwo properties.WHERE c.customerId = @id ORDER BY c.createdOn DESCis far cheaper - or only possible without a full scan - with the composite index above. Property order and sort direction must match the query (or be its exact reverse).
Apply a policy update with the CLI; index transformation runs online in the background:
az cosmosdb sql container update \
--account-name cosmos-platform-prod \
--resource-group rg-data-platform \
--database-name orders \
--name orders \
--idx @indexing-policy.json
Trimming a wide-open policy down to a handful of indexed paths routinely cuts create/upsert cost by 30-50% on documents with many properties, because the write no longer maintains dozens of index entries it will never serve a query from.
6. Detecting hot partitions
A hot partition is invisible at the container level - average RU consumption looks healthy while one physical partition sits at 100% throwing 429s. You detect it with partition-scoped metrics, not aggregates.
The key metric is Normalized RU Consumption: the percentage of provisioned RU/s used by the hottest partition in each window. Pinned near 100% while container-level utilization sits at 30% means a hot partition by definition.
In Azure Monitor / Metrics, chart it like this:
// Azure Monitor metric, split by physical partition.
// Metric: NormalizedRUConsumption
// Aggregation: Max
// Split (filter) by: PhysicalPartitionId
//
// In the Metrics blade:
// Metric = Normalized RU Consumption
// Aggregation = Max
// Apply splitting on dimension "PhysicalPartitionId"
For log-based analysis, query the throttled requests in Log Analytics if diagnostic settings are routing DataPlaneRequests:
CDBDataPlaneRequests
| where TimeGenerated > ago(1h)
| where StatusCode == 429
| summarize Throttled = count() by PartitionKeyRangeId, bin(TimeGenerated, 5m)
| order by Throttled desc
A single PartitionKeyRangeId dominating the 429 count is the signature of a hot partition. Cross-reference it with PartitionKeyStatistics (available via the SDK’s GetPartitionKeyRangesAsync and storage metrics) to see which key values carry the most data. The triad to confirm a hot partition:
- Normalized RU Consumption (Max) near 100% on one
PhysicalPartitionId. - 429s concentrated on one
PartitionKeyRangeId. - Container-level RU utilization comfortably below provisioned.
7. Remediation: re-partitioning, synthetic keys, migration
You cannot change a partition key in place. Every real fix moves data to a better-keyed container, but the right approach depends on the failure mode.
Synthetic / composite keys address low cardinality. If you were forced onto /status or /region, redefine the key as a computed field on each document that combines a high-cardinality value with the natural one:
# Stamp a synthetic partition key on write to spread load.
# Combine a meaningful prefix with a bucketed suffix for high cardinality.
import hashlib
def synthetic_pk(tenant_id: str, entity_id: str, buckets: int = 100) -> str:
suffix = int(hashlib.sha256(entity_id.encode()).hexdigest(), 16) % buckets
return f"{tenant_id}-{suffix:03d}"
doc["pk"] = synthetic_pk(doc["tenantId"], doc["id"])
# Container partition key path is "/pk".
# Reads for a tenant must now fan across the 100 buckets, so prefer this only
# when point reads dominate, or use hierarchical keys instead for query locality.
The trade-off is explicit: synthetic suffixes spread writes well but turn tenant-scoped reads into a fan-out across the buckets. When you need both write spread and read locality, hierarchical partition keys (Section 4) are the better tool - the default for the multi-tenant case.
Container migration is the path when the key itself is wrong. There is no in-place repartition; you create a new container with the correct key (or hierarchy and indexing policy) and copy the data:
- Change feed is the production-safe mechanism. Stand up the new container, run an Azure Function or self-hosted change-feed processor to drain the source’s change feed into the destination, then cut writes over once it has caught up - a live, resumable backfill with no maintenance window.
- For one-shot bulk copies, the Azure Cosmos DB Spark connector or desktop Data Migration tool moves data quickly, but you still need the change feed to capture writes that land during the copy.
Always provision the destination with high RU/s during the backfill (bulk ingestion is throughput-bound) and dial it back once steady-state.
8. Autoscale vs manual throughput
The throughput mode shapes both your bill and your resilience to spikes.
Manual throughput pins a fixed RU/s. You pay for that ceiling 24/7 whether you use it or not - correct only for steady, predictable workloads you can size tightly.
Autoscale sets a maximum and instantly scales between 10% and 100% of it based on load, billing per hour for the highest RU/s reached that hour. Autoscale costs 1.5x the manual rate per RU, so the break-even is roughly 66% average utilization: below that, autoscale is cheaper because you avoid paying for idle headroom; above it, a well-sized manual setting wins.
# Create a container with autoscale: max 40,000 RU/s, floor is automatically 4,000 (10%)
az cosmosdb sql container create \
--account-name cosmos-platform-prod \
--resource-group rg-data-platform \
--database-name orders \
--name orders \
--partition-key-path "/customerId" \
--max-throughput 40000
# Convert an existing manual container to autoscale
az cosmosdb sql container throughput migrate \
--account-name cosmos-platform-prod \
--resource-group rg-data-platform \
--database-name orders \
--name orders \
--throughput-type autoscale
Two operational nuances:
- Autoscale does not save you from a hot partition. The 10,000 RU/s per-physical-partition cap applies to autoscale exactly as to manual. Autoscale absorbs aggregate spikes; it does nothing for a single saturated key.
- Burst capacity lets a physical partition temporarily exceed its provisioned share by drawing on idle RU/s accumulated over the prior 5 minutes (up to 3000 RU/s). It smooths short bursts on otherwise-cool partitions, but it is a buffer, not a fix for sustained skew.
Verify
Confirm the design holds before declaring victory.
# 1. Confirm the partition key path (and hierarchy) actually configured on the container
az cosmosdb sql container show \
--account-name cosmos-platform-prod \
--resource-group rg-data-platform \
--database-name orders \
--name orders \
--query "resource.partitionKey" -o json
# 2. Confirm current throughput mode and value
az cosmosdb sql container throughput show \
--account-name cosmos-platform-prod \
--resource-group rg-data-platform \
--database-name orders \
--name orders \
--query "resource.{manual:throughput, autoscaleMax:autoscaleSettings.maxThroughput}" -o json
Then validate behavior, not just config:
- Run your top three queries in Data Explorer and read Request Charge on each. Single-partition reads should be single-digit to low-tens of RU; investigate anything in the hundreds.
- In Metrics, chart Normalized RU Consumption with Max aggregation split by
PhysicalPartitionIdover the last 24 hours. No partition should sit near 100% while others idle. - Check the 429 rate: under correct provisioning and keying, the Total Request Units chart should show throttling near zero outside of brief autoscale ramp windows.
Enterprise scenario
A SaaS platform team running an order-management service partitioned a transactions container on /merchantId - reasonable, since nearly every query was merchant-scoped. It held up through hundreds of mid-size merchants. Then they signed a marketplace customer whose Black Friday traffic was 40x their next-largest merchant. That merchant’s logical partition slammed into the 10,000 RU/s physical-partition ceiling: the container ran at 50,000 RU/s across five physical partitions and overall utilization sat near 22%, yet checkout writes for that one merchant returned HTTP 429 for hours. More RU/s changed nothing - all of that merchant’s traffic hashed to a single physical partition.
The constraint was unmovable: you cannot change a partition key in place, a single logical partition cannot be split, and they could not afford a maintenance window during the holiday peak.
The fix was a migration to hierarchical partition keys, /merchantId then /orderId. They created a new container with partition key version 2, set a tight indexing policy (excluding the large lineItems blob they never filtered on), provisioned 80,000 RU/s for the backfill, and drained the source’s change feed into it with an Azure Function so the copy was live and resumable. They cut writes over behind a feature flag once the processor caught up, then dropped to autoscale max 50,000.
az cosmosdb sql container create \
--account-name cosmos-orders-prod \
--resource-group rg-orders \
--database-name commerce \
--name transactions_v2 \
--partition-key-path "/merchantId" "/orderId" \
--partition-key-version 2 \
--idx @lean-indexing.json \
--max-throughput 80000
The whale merchant’s orders now spread across thousands of orderId sub-partitions instead of one logical partition, the per-partition ceiling stopped binding, and merchant-scoped reads stayed single-partition because queries still supplied the /merchantId prefix. Steady-state RU spend dropped because the lean index cut write cost on a container doing millions of order writes a day.