Every large-scale outage I have reviewed had the same root cause hiding behind the proximate one: a single shared component – one database, one config push, one poisoned cache key – whose failure was visible to every customer at once. The system had no internal walls. When the thing broke, it broke for everybody, and the size of the incident was the size of the entire fleet.
Cell-based architecture is the discipline of building those walls in on purpose. Instead of one large system serving all traffic, you run many small, identical, independent copies – cells – each owning a fixed slice of the workload. A failure is contained to the cell it happens in. The blast radius stops being “100% of customers” and becomes “1/N of customers,” and N is a number you choose. This article walks the concrete mechanics: defining a cell, sizing and placing cells, routing tenants deterministically, using shuffle sharding to break correlated impact, separating control plane from data plane, and rolling changes out one cell at a time. The running assumption is that you already operate a working multi-tenant service and now need it to fail in pieces instead of all at once.
Why a shared fleet fails all-or-nothing
Start with the failure mode, because it drives every design decision that follows. A conventional horizontally-scaled service – a stateless tier behind a load balancer, fronting a shared data store – looks resilient. You can lose any single node and the load balancer routes around it. But the shared dependencies underneath are singletons from the fleet’s point of view, and they fail as singletons:
- A bad config or feature flag is pushed to every node simultaneously. There is no node that did not get it.
- One tenant sends a query that locks a hot partition or saturates a connection pool. Every other tenant on that pool now waits.
- A poison message, a corrupt cache entry, or a memory leak triggered by one input class spreads through shared state.
In each case the load balancer is useless, because the fault is not in any one node – it is in something all the nodes share. Horizontal scaling buys you capacity and instance-level fault tolerance; it buys you nothing against correlated failure of a shared component.
The core insight: availability at scale is governed by your largest shared fault domain, not your instance count. Adding more identical nodes behind one database does not reduce blast radius – it increases the number of customers who notice when that database has a bad minute. To shrink blast radius you must shrink what is shared, and the only way to shrink what is shared is to run multiple independent copies of the whole stack.
That is the entire thesis. A cell is a complete, self-contained copy of the stack – compute, data, caches, queues – that serves a bounded subset of traffic and shares nothing with its siblings. Get the boundary right and a catastrophic failure inside one cell is, to everyone outside it, a non-event.
Step 1 – Define a cell: independent capacity, data, and failure boundary
A cell is not a shard and it is not an Availability Zone. It is a full deployment of your service, scoped to handle a fixed maximum load, with its own everything below the router. The defining test is the failure boundary: a fault originating in cell A – a deadlocked database, an OOMing pod, a corrupt cache – must have no path to degrade cell B. If a single resource is shared across cells, that resource is a blast-radius leak and the cells are cells in name only.
A useful checklist for “is this actually independent”:
| Resource | Per-cell? | Why it matters |
|---|---|---|
| Compute / pods | Yes | A crash loop or resource exhaustion stays local |
| Primary datastore | Yes | The most common shared fault domain; the one that matters most |
| Cache (Redis/Memcached) | Yes | A poisoned key or eviction storm cannot cross cells |
| Message queue / stream | Yes | A poison-message backlog is contained |
| Connection pools, rate limiters | Yes | Noisy-neighbor saturation is bounded to one cell |
| DNS, TLS certs, identity provider | Often shared | Acceptable only if highly available and change-controlled separately |
| Cell router | Shared (thin) | Must be far simpler and more stable than a cell – see Step 3 |
The hard rule: the data plane of a cell touches nothing outside the cell. Everything above the router (DNS, the routing layer itself, identity) is allowed to be shared, but only because you are going to make it deliberately thin and boring – so simple it almost never changes and almost never fails.
A practical cell footprint on Kubernetes is one namespace (or one cluster, if you want hard isolation) per cell, with cell identity stamped on every resource so nothing is ambiguous about which cell it belongs to:
# Each cell is a self-contained deployment. The cell-id label is the
# single source of truth that ties compute, data binding, and routing together.
apiVersion: apps/v1
kind: Deployment
metadata:
name: orders-api
namespace: cell-003
labels:
app: orders-api
cell-id: "003"
spec:
replicas: 6
selector:
matchLabels: { app: orders-api, cell-id: "003" }
template:
metadata:
labels: { app: orders-api, cell-id: "003" }
spec:
containers:
- name: orders-api
image: registry.internal/orders-api:1.42.0
env:
- name: CELL_ID
value: "003"
- name: DB_DSN # points only at this cell's database
valueFrom: { secretKeyRef: { name: cell-003-db, key: dsn } }
- name: CACHE_ENDPOINT # this cell's Redis, nobody else's
valueFrom: { secretKeyRef: { name: cell-003-cache, key: host } }
The cell-id value is load-bearing. It flows into metrics labels, log fields, and the routing table, so that at 3 a.m. you can answer “which cell?” before you answer “what broke?” – and scope every dashboard, alert, and drain to that one cell.
Step 2 – Size, place, and choose how many cells
Two numbers define a cell-based system: cell size (max load per cell) and cell count (how many). They trade off against each other and against blast radius.
Cell size is the maximum throughput a single cell can serve before you stop adding tenants to it. Set it from a real load test, not a guess – the point at which p99 latency starts to degrade or a backend hits a hard limit (max connections, IOPS, partition count). This becomes a hard cap. A cell that is “full” gets no new tenants; new tenants go to a new cell.
Cell count is determined by total load divided by cell size, plus headroom. The crucial property is that blast radius is 1 / cell_count. Ten cells means a single-cell failure affects at most ~10% of tenants; fifty cells means ~2%. Smaller cells give a smaller blast radius but more operational surface (more databases to patch, more cells to deploy to) and more per-cell fixed-cost overhead.
The sizing trap to avoid: do not let cells grow unbounded “to save money.” A cell that quietly doubled in size has quietly doubled its blast radius. The whole value proposition is that a cell’s worst-case impact is known and capped, and that guarantee only holds if the size cap is enforced – a full cell must reject new tenants rather than stretch.
Placement. Cells should be spread across Availability Zones and, for the highest tier, across regions, so that an AZ or region event takes out only a subset of cells. A common pattern: each cell is multi-AZ internally (its database is zone-redundant), and cells are distributed across regions so a regional outage degrades a known fraction of cells rather than all of them. Resist the urge to make a cell span regions for its data plane – cross-region synchronous data access reintroduces a shared, latency-coupled dependency and defeats the isolation.
Capture the sizing decision as a single source of truth – a registry every other component reads:
{
"cell_size_max_rps": 4000,
"headroom_factor": 0.7,
"cells": [
{ "id": "001", "region": "eastus2", "azs": [1,2,3], "status": "active", "assigned_rps": 2600 },
{ "id": "002", "region": "eastus2", "azs": [1,2,3], "status": "active", "assigned_rps": 2750 },
{ "id": "003", "region": "westus3", "azs": [1,2,3], "status": "active", "assigned_rps": 1900 },
{ "id": "004", "region": "westeurope", "azs": [1,2,3], "status": "draining", "assigned_rps": 0 }
]
}
headroom_factor is the fraction of cell_size_max_rps you will fill before a cell is considered full. Running cells at 70% leaves room to absorb a sibling cell’s traffic during a failover and to absorb organic growth between provisioning cycles. A cell allowed to run at 100% has no margin to take on load when its neighbor dies – which is exactly when you need that margin.
Step 3 – The cell router: the one component that must never fail
Every request must be deterministically mapped to exactly one cell. That mapping is the job of the cell router – and it is the single most important and most dangerous component in the design, because it is shared across all cells. If the router fails, every cell becomes unreachable regardless of how healthy the cells themselves are.
The discipline that keeps the router safe: it must be radically simpler than a cell, change far less often, and do as little work as possible. A router that parses request bodies, calls a database, or runs business logic has reintroduced a complex shared dependency – the exact thing cells exist to eliminate. The ideal router is a pure function of a stable partition key (tenant ID, account ID) to a cell ID, backed by an in-memory mapping that is itself versioned and slow-changing.
# A cell router is a pure, fast function. No I/O on the hot path.
# The mapping table is loaded at startup and refreshed out-of-band,
# never fetched per request.
import hashlib
class CellRouter:
def __init__(self, mapping: dict[str, str], active_cells: list[str]):
self._mapping = mapping # tenant_id -> cell_id (explicit pins)
self._active = active_cells # fallback ring for unmapped tenants
def cell_for(self, tenant_id: str) -> str:
# 1. Explicit assignment wins (lets us pin and migrate tenants).
pinned = self._mapping.get(tenant_id)
if pinned is not None:
return pinned
# 2. Fallback: deterministic hash onto the active set.
# Stable for a given tenant + active-cell set.
h = int(hashlib.sha256(tenant_id.encode()).hexdigest(), 16)
return self._active[h % len(self._active)]
Two design choices are worth defending. First, prefer an explicit assignment table over pure consistent hashing. Pure hashing is elegant but it means you cannot move a single tenant without changing the hash function or the ring, and you have no record of where a given tenant is. An explicit tenant -> cell table makes migration a row update and makes “where is tenant X” a lookup. Use hashing only as the bootstrap for never-before-seen tenants. Second, keep the routing data in memory and refresh it asynchronously; a router that does a database read per request has put a shared database back on the critical path of every cell.
In practice the router is often a thin layer at the edge – an L7 load balancer with header/host-based rules, an Envoy or NGINX deployment driven by a generated config, or a lightweight service. Whatever the implementation, treat its config change as a far higher-risk event than a cell deployment, because its blast radius is the whole fleet.
Step 4 – Map tenants to cells and rebalance without downtime
Assigning a brand-new tenant is the easy half: pick the least-loaded active cell that is below its headroom cap, write the assignment, done. The hard half is moving an existing tenant – to drain a cell for decommissioning, to relieve a hot cell, or to rebalance after the fleet has grown.
The technique that avoids downtime is a staged migration that keeps both cells consistent during the cutover. The shape depends on whether the tenant has state to move:
For stateless or externally-stateful tenants, migration is just a routing change with a brief dual-read window:
1. Provision tenant resources in the target cell (warm caches, run schema, seed config).
2. Begin dual-write: writes go to both old and new cell's data stores.
3. Backfill: copy historical data old -> new; reconcile until lag is ~0.
4. Flip routing: update the tenant -> cell mapping atomically. New requests hit the new cell.
5. Drain in-flight: let outstanding requests on the old cell complete (honor a grace period).
6. Stop dual-write; verify new cell is authoritative; decommission old-cell tenant data.
The atomic routing flip in step 4 is the only moment that matters for correctness. Before it, the old cell is authoritative; after it, the new cell is. The dual-write window in steps 2–3 exists purely so that the new cell already has current data the instant the flip happens, so no request lands on a cell that is missing the tenant’s recent writes.
The migration invariant to hold onto: at every instant, exactly one cell is authoritative for a tenant’s writes, and the routing table reflects it. Dual-write is a convergence mechanism to make the new cell ready, not a steady state – two cells both believing they own a tenant is a split-brain, and split-brain is how you get silent data divergence. Keep the dual-write window as short as backfill allows, and never leave it on.
Make the assignment table the authority and keep it versioned, so a bad rebalance is a revert rather than an archaeology project:
{
"version": 184,
"updated": "2026-06-08T11:20:00Z",
"assignments": {
"tenant-9f12": { "cell": "002", "since": "2026-01-04", "pinned": true },
"tenant-7a3e": { "cell": "001", "since": "2026-05-30", "migrating_to": "003" }
}
}
Step 5 – Shuffle sharding to break correlated tenant impact
Plain cells already cap blast radius to 1/N of tenants. Shuffle sharding does dramatically better for the common case where each “cell” is a small set of workers and a single tenant can poison the workers it touches.
The idea, popularized by AWS, is this: instead of assigning each tenant to one cell, assign each tenant to a small random subset of workers drawn from a larger pool. Two tenants might overlap on some workers but are very unlikely to share their entire set. So when one tenant turns toxic – a poison request that crashes every worker it hits – it takes down only the handful of workers in its shard. Any other tenant whose shard does not fully overlap that set still has healthy workers to serve it.
The combinatorics are the whole point. With a pool of M workers and a shard size of N, the number of distinct shards is “M choose N.” With 8 workers and shards of 2, that is 28 combinations; with 100 workers and shards of 5, over 75 million. Even a handful of workers per tenant yields an astronomically large number of distinct shards, so the probability that any two given tenants share every worker in their shard is vanishingly small. The blast radius of one bad tenant collapses from “all tenants on its cell” to “only tenants whose entire shard overlaps it” – which, with sensible parameters, is effectively just that one tenant.
A deterministic shuffle-shard assignment, seeded by tenant ID so it is stable and reproducible:
import hashlib
def shuffle_shard(tenant_id: str, workers: list[str], shard_size: int) -> list[str]:
"""Deterministically pick `shard_size` workers for a tenant.
Stable for a given tenant and worker pool; no overlap of the FULL set
between tenants is overwhelmingly likely once M choose N is large.
"""
# Seeded shuffle: hash (tenant, worker) and take the lowest-ranked N.
ranked = sorted(
workers,
key=lambda w: hashlib.sha256(f"{tenant_id}:{w}".encode()).hexdigest(),
)
return ranked[:shard_size]
# Two different tenants get different, mostly-disjoint subsets:
pool = [f"w{i:02d}" for i in range(16)]
print(shuffle_shard("tenant-9f12", pool, 2)) # e.g. ['w03', 'w11']
print(shuffle_shard("tenant-7a3e", pool, 2)) # e.g. ['w08', 'w14']
Shuffle sharding composes with cells, it does not replace them: cells give you hard data isolation (separate databases), shuffle sharding gives you fine-grained fault isolation across the request-handling workers inside the routing or stateless tier. The two together are why a single abusive caller can be contained to a near-zero fraction of the fleet rather than a full 1/N.
Step 6 – Separate the control plane from the data plane
The fastest way to undo all of this is to let a fleet-wide control plane sit on the critical path of every cell’s data plane. The two have opposite requirements and must be split:
- Data plane – serves customer requests inside a cell. Must be available even if every other part of the system is down. High volume, low latency, simple.
- Control plane – provisions cells, updates the routing table, runs deployments, rebalances tenants. Lower volume, can tolerate brief unavailability, necessarily more complex.
The rule that keeps this honest: a data-plane request must never make a synchronous call to the control plane. If serving a customer request requires the control plane to be up, then a control-plane outage is a customer outage in every cell at once – you have built a fleet-wide shared dependency and called it a feature. The control plane pushes configuration out to cells; cells cache it and keep serving from the cached copy even when the control plane is unreachable. This is sometimes called “static stability”: a cell continues to operate correctly on its last-known-good config without any control-plane help.
+------------------+
| Control plane | provisioning, routing-table authority,
| (low volume) | deploy orchestration, tenant placement
+---------+--------+
| pushes config OUT (async), never called on hot path
+------------------+------------------+
v v v
+---------+ +---------+ +---------+
| Cell 01 | | Cell 02 | | Cell 03 | each serves traffic from its
| (data | | (data | | (data | own cached config; survives a
| plane) | | plane) | | plane) | control-plane outage intact
+---------+ +---------+ +---------+
A concrete consequence: the router’s mapping table is owned by the control plane but served from a copy embedded in (or pushed to) the routing layer. If the control plane dies, routing keeps working on the last table it received. New tenants cannot be onboarded until the control plane returns – an acceptable degradation – but every existing tenant keeps being served, which is the only thing that matters during an incident.
Step 7 – Deploy changes cell-by-cell with wave-based rollout
Cells give you something a flat fleet cannot: a deployment unit smaller than “everything.” A code change, config change, or schema migration is rolled out one cell at a time, so a bad change is caught in an early cell while the rest of the fleet runs the old, known-good version. This is the operational payoff that justifies the architecture.
Organize cells into waves of increasing exposure, and bake between waves:
# Wave-based rollout: each wave widens exposure only after the prior wave
# has soaked clean. A regression is contained to the cells already updated.
rollout:
version: "1.43.0"
bake_time_minutes: 30 # observe each wave before proceeding
abort_on:
- error_rate_pct: 1.0 # any updated cell exceeding -> halt rollout
- p99_latency_ms: 800
waves:
- name: canary
cells: ["009"] # one low-traffic / internal-only cell
- name: wave-1
cells: ["003", "007"] # ~10% of fleet
- name: wave-2
cells: ["001", "002", "005"]
- name: wave-3
cells: ["004", "006", "008"] # remainder
The mechanics that make this safe:
- Start with a canary cell carrying internal or low-value traffic. If the change is going to crash on contact, it crashes where almost nobody is watching.
- Bake between waves. A regression that only manifests under sustained load or at a daily peak will not show up in five minutes. Hold each wave long enough to cross a realistic load pattern before widening.
- Halt, do not barrel through. If any updated cell breaches the abort thresholds, stop the rollout immediately. The non-updated cells are your safety margin – they are still serving customers on the good version. Roll the bad cells back; do not “fix forward” across a half-deployed fleet.
- Schema and migration changes follow the same waves, which forces every migration to be backward-compatible (expand/contract): the new schema must work with both old and new code, because for the duration of the rollout both versions are live across different cells.
The cultural shift this enables: a deploy is no longer a fleet-wide coin flip. With cell-by-cell rollout the worst realistic outcome of a bad change is “one or two cells degraded and rolled back,” not “everyone is down.” That is the difference between a deploy being a routine, low-adrenaline event and being the leading cause of your outages.
Step 8 – Operate cells: per-cell health, drains, and headroom
Day-two operations are where cells earn their keep. Three practices matter most.
Per-cell observability. Every metric, log, and alert is labeled with cell-id. Fleet-wide dashboards hide exactly the failures cells are meant to expose – a single cell at 30% error rate is invisible in a fleet average. You want a per-cell view and alerts that fire on a single cell’s health, so the on-call can immediately scope an incident to one cell and reach for that cell’s drain runbook. In KQL against per-cell logs:
// Surface any single cell whose error rate is anomalous, rather than
// hiding it in a fleet-wide average. One bad cell must stand out.
AppRequests
| where TimeGenerated > ago(15m)
| extend cell = tostring(Properties["cell_id"])
| summarize total = count(),
errors = countif(ResultCode >= 500)
by cell, bin(TimeGenerated, 1m)
| extend error_rate_pct = round(100.0 * errors / total, 2)
| where error_rate_pct > 1.0
| order by error_rate_pct desc
Cell drains. When a cell is unhealthy, the strongest tool you have is to evacuate it: stop routing new traffic to it and shift its tenants to healthy cells with headroom. Because cells are independent, draining one is a contained, reversible action – you are not touching any other cell. A drain flips the cell to draining in the registry; the router stops selecting it for new requests; in-flight work finishes under a grace period; the tenants reassign to siblings. This turns “a cell is melting down” from an emergency into a procedure.
# Drain a cell: take it out of routing, let in-flight requests finish,
# and reassign its tenants to cells with headroom. Reversible.
cellctl drain --cell 003 --grace 120s --reason "db-replica-lag-incident"
# Watch in-flight drop to zero before considering the cell idle.
cellctl status --cell 003 --watch
# CELL STATUS INFLIGHT ASSIGNED_RPS TENANTS
# 003 draining 412 1900 57
# 003 draining 38 1900 57
# 003 drained 0 0 0
Capacity headroom. A drain only works if the other cells can absorb the displaced load. This is why the headroom_factor from Step 2 is not optional accounting – it is the reason an evacuation succeeds instead of cascading. If every cell runs at 95%, draining one cell pushes its load onto neighbors that have nowhere to put it, and you have turned a one-cell incident into a two- or three-cell incident. Run cells with enough headroom that the fleet can lose its largest cell and redistribute without any surviving cell exceeding its safe load. That is the static-stability property applied to capacity: the system tolerates the loss of a cell with no scramble for new resources.
Verify
Treat the isolation as a claim to be falsified, not a property you assume. The following confirm it actually holds:
-
Determinism of routing. Pin a tenant, then resolve its cell repeatedly from multiple router instances. Every instance returns the same cell for the same tenant; a brand-new tenant resolves consistently across instances.
for i in $(seq 1 20); do curl -s "https://router.internal/resolve?tenant=tenant-9f12" | jq -r .cell done | sort -u # must print exactly ONE cell id -
Blast-radius containment (game day). Inject a hard fault into one cell’s database in a staging fleet – block its port, or fail its primary. Confirm that the affected cell’s error rate spikes while every other cell stays green. If a sibling cell degrades, you have an undeclared shared dependency; find it.
# In a staging fleet only: sever cell-003's data plane and watch the rest. kubectl -n cell-003 scale deploy/orders-api --replicas=0 # or block the DB NSG cellctl status --all # 003 should be the ONLY cell showing impact -
No control-plane dependency on the hot path. Take the control plane offline and confirm every cell keeps serving existing tenants from cached config. Data-plane success rate must not move; only new-tenant onboarding should fail.
-
Shuffle-shard disjointness. For a sample of tenant pairs, compute their shards and confirm they do not fully overlap; the fraction of pairs sharing their entire shard should match the
M choose Nmath (effectively zero for sensible parameters). -
Rollout halts on regression. In staging, deploy a deliberately broken build and confirm the rollout stops at the canary/first wave and does not proceed – the abort thresholds must actually fire and gate the next wave.
-
Drain redistributes within headroom. Drain a cell under load and confirm its tenants reassign to siblings and that no surviving cell exceeds its safe load ceiling.
Enterprise scenario
A B2B SaaS platform team I worked with ran a multi-tenant API on a classic flat design: a stateless tier across three AZs in front of one large, vertically-scaled regional Postgres. It had run clean for two years. Then a single enterprise tenant rolled out a new integration that issued an unindexed query pattern at high concurrency. The query took a heavy lock on a hot table; the connection pool filled with waiters; within ninety seconds every tenant on the platform was timing out. One customer’s bad code had become a total platform outage, and the incident’s blast radius was 100% of the customer base. The constraint was non-negotiable after that review: no single tenant – and no single component – may be able to take down more than a small, bounded fraction of customers, and the team could not move tenants off shared infrastructure overnight.
They re-platformed onto cells. They sized a cell from a load test at 4,000 RPS sustained, set a 70% headroom cap, and stood up an initial eight cells, each a full stack – pods, its own Postgres Flexible Server, its own Redis – one Kubernetes namespace per cell. A thin Envoy-based router mapped tenant ID to cell from an in-memory table pushed by a small control plane; the router did no I/O on the hot path and the table was served from cache, so a control-plane outage could not stop request routing. Inside the stateless tier they added shuffle sharding so even within a cell, one abusive caller hit only a 2-of-16 worker subset. Crucially, they pinned their handful of largest enterprise tenants – the ones most able to generate pathological load – onto dedicated single-tenant cells, so a repeat of the original incident would be contained to that one customer’s own cell and visible to no one else.
{
"cells": [
{ "id": "ent-acme", "tier": "single-tenant", "tenants": ["tenant-acme"], "size_max_rps": 4000 },
{ "id": "001", "tier": "shared", "headroom_factor": 0.7, "size_max_rps": 4000 },
{ "id": "002", "tier": "shared", "headroom_factor": 0.7, "size_max_rps": 4000 }
]
}
The validation came eight months later, from a near-identical trigger: a tenant on shared cell 005 shipped a query that locked a table and exhausted that cell’s pool. Cell 005’s error rate went to 40%; the on-call drained it – reassigning its tenants to cells with headroom – in under four minutes. The other seven cells never moved off green. The post-incident blast radius was roughly one-eighth of the fleet for four minutes, versus the entire customer base indefinitely the first time. Same class of bug; the architecture turned a company-defining outage into a contained, routine drain.