Spanner gives you the one thing every other relational database makes you choose against: horizontal scale with external consistency and real SQL. The catch is that the schema is no longer a logical convenience — it is the physical sharding strategy. Your primary key choice decides how rows map to splits, which splits land on which servers, and therefore whether a write-heavy table tops out at a few thousand QPS or scales linearly to millions. Get the key wrong and no amount of node-adding will save you, because every write is hammering the same split.
This guide works from the storage engine up: how splits and ranges actually behave, how to choose keys that spread load, when interleaving pays off and when it traps you, how secondary indexes create their own hotspots, and how to read Key Visualizer to confirm any of it in production. Examples use GoogleSQL dialect DDL; the physical reasoning is identical for the PostgreSQL dialect.
1. Splits, nodes, and how keys map to ranges
Spanner stores rows in splits — contiguous ranges of the primary key space, sorted lexicographically by key. A split is the unit of load distribution: each split is served by a single replica leader at a time, and Spanner moves splits between servers to balance CPU and storage. A node (or its fractional equivalent, processing units, where 1000 PU = 1 node) serves many splits.
The mechanics you must internalize:
- Rows are physically ordered by primary key. Adjacent key values live in the same split until the split grows large or hot, at which point Spanner splits the range at a key boundary.
- A single split has a throughput ceiling because one leader handles its writes. The documented practical guidance is to keep a split under roughly a few MB/s of writes; beyond that you need the data spread across more splits.
- Spanner adds split boundaries based on load and size, but it can only split between distinct key values. If every write targets the same key prefix, there is no boundary to introduce — the load cannot be spread.
That last point is the whole game. A “hotspot” is not a Spanner bug; it is a schema that forces all current writes into a key range that cannot be subdivided fast enough.
Spanner can pre-split a table at load time via
gcloud spanner databases ddl updatewith split points, and it learns split boundaries over minutes as load arrives. But pre-splitting a monotonically increasing key buys you nothing, because tomorrow’s writes all land past the highest boundary anyway. The fix is always in the key shape, not in pre-splitting.
2. Choosing primary keys to avoid monotonic-key hotspots
The classic mistake is a primary key that increases (or decreases) with time: an auto-increment integer, a TIMESTAMP, a UUIDv1/ULID, or anything sequence-backed. Every new row sorts to the end of the key space, so every write hits the last split. You have built a distributed database that writes to exactly one machine.
-- ANTI-PATTERN: monotonic key. All inserts hit the highest split.
CREATE TABLE Events (
EventId INT64 NOT NULL, -- from a sequence -> monotonic
Payload STRING(MAX),
CreatedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp = true),
) PRIMARY KEY (EventId);
Fixes, in rough order of preference:
-
Use a random UUID as the leading key column.
GENERATE_UUID()(UUIDv4) is uniformly distributed, so inserts scatter across the entire key space and Spanner can split freely.CREATE TABLE Events ( EventId STRING(36) NOT NULL DEFAULT (GENERATE_UUID()), Payload STRING(MAX), CreatedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp = true), ) PRIMARY KEY (EventId); -
Swap the order of a composite key so the high-cardinality, well-distributed column leads.
PRIMARY KEY (UserId, CreatedAt)spreads writes across users;PRIMARY KEY (CreatedAt, UserId)funnels them by time. Same columns, opposite physical behavior. -
Hash a natural key into the leading column when you need a deterministic but spread key (covered in step 5).
The cost of a random key is range scans: you can no longer scan “the last hour of events” by reading a contiguous key range, because time is not the leading column. That is the central trade — distribution versus locality. Resolve it deliberately: if you query by entity (WHERE UserId = ?), lead with the entity; if you truly need time-ordered scans, accept a bit-reversed or sharded approach rather than a raw timestamp.
3. Interleaving child tables for co-located, low-latency joins
Interleaving physically co-locates a child table’s rows with their parent row, in the same split, ordered by the shared key prefix. Orders and OrderLines for the same OrderId sit next to each other on disk. The payoff:
- A parent + children read is served from one split — no cross-split fan-out, no distributed read.
ON DELETE CASCADEbecomes a cheap local operation.- Joins on the interleaving key are local; Spanner does not shuffle.
CREATE TABLE Customers (
CustomerId STRING(36) NOT NULL DEFAULT (GENERATE_UUID()),
Name STRING(256),
) PRIMARY KEY (CustomerId);
CREATE TABLE Orders (
CustomerId STRING(36) NOT NULL,
OrderId STRING(36) NOT NULL DEFAULT (GENERATE_UUID()),
Total NUMERIC,
PlacedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp = true),
) PRIMARY KEY (CustomerId, OrderId),
INTERLEAVE IN PARENT Customers ON DELETE CASCADE;
CREATE TABLE OrderLines (
CustomerId STRING(36) NOT NULL,
OrderId STRING(36) NOT NULL,
LineNo INT64 NOT NULL,
Sku STRING(64),
Qty INT64,
) PRIMARY KEY (CustomerId, OrderId, LineNo),
INTERLEAVE IN PARENT Orders ON DELETE CASCADE;
The child’s primary key must be prefixed by the parent’s full primary key — that prefix is what guarantees co-location. The trap is the interleaving (co-location) row size limit: a parent row plus all of its interleaved descendant rows must stay under ~8 GiB. A single customer with a runaway number of orders and lines will blow that ceiling and you cannot split that hierarchy across machines. Interleave one-to-few relationships (a customer’s orders, an order’s lines), never one-to-unbounded fan-outs (every event for an account that lives forever).
Decision rule: interleave when the child is small per parent and you almost always read them together. Keep tables independent when the child set is unbounded, when children are queried without the parent, or when the parent’s hierarchy could approach the size limit. A normal foreign key (step 6) gives referential integrity without forcing co-location.
4. Secondary indexes: STORING, index hotspots, and null filtering
A secondary index in Spanner is a separate, sorted table keyed by the indexed columns (with the base table’s primary key appended to make each index entry unique). That has three consequences experts plan for.
Index keys hotspot exactly like table keys. An index on a monotonic column funnels all index writes into one split, even if the base table is perfectly distributed. An index on CreatedAt recreates the timestamp hotspot in the index.
-- Same monotonic-write problem, now in the index.
CREATE INDEX OrdersByPlacedAt ON Orders (PlacedAt);
Use STORING to make an index covering. Without it, a query that selects columns not in the index does an extra back-join to the base table (one lookup per row). STORING copies those columns into the index so the query is served entirely from the index — at the cost of extra storage and write amplification.
CREATE INDEX OrdersByCustomerStatus
ON Orders (CustomerId, Status)
STORING (Total, PlacedAt);
-- SELECT Total, PlacedAt WHERE CustomerId=? AND Status=? is index-only.
NULL_FILTERED indexes skip rows where an indexed column is null, which shrinks sparse indexes dramatically and is the right default for “flag” columns where only a few rows qualify (e.g., an index over open tickets only).
CREATE NULL_FILTERED INDEX OrdersAwaitingFulfilment
ON Orders (FulfilmentDueAt)
STORING (Total) WHERE FulfilmentDueAt IS NOT NULL;
You can also interleave an index in a table (... INTERLEAVE IN <ParentTable>) so index entries co-locate with the parent, which keeps per-entity index lookups local. Use it when the index is always queried within one parent.
5. Bit-reversed sequences and key hashing for high-write tables
Sometimes you genuinely need a numeric, mostly-ordered key (foreign-key friendliness, integer joins, smaller storage than a UUID) but cannot accept the monotonic hotspot. Spanner ships a purpose-built tool: the bit-reversed positive sequence. It generates unique positive integers whose bit-reversed values are what get stored, so consecutive logical values land in completely different parts of the key space.
CREATE SEQUENCE EventSeq OPTIONS (sequence_kind = 'bit_reversed_positive');
CREATE TABLE Events (
EventId INT64 NOT NULL DEFAULT
(GET_NEXT_SEQUENCE_VALUE(SEQUENCE EventSeq)),
Payload STRING(MAX),
CreatedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp = true),
) PRIMARY KEY (EventId);
This gives you integer keys with UUID-like write distribution and no application-side ID generation. It is the recommended replacement for auto-increment on a hot table.
The manual alternative is explicit key hashing / sharding: prepend a computed shard column derived from a hash of the rest of the key, modulo N.
CREATE TABLE EventsSharded (
ShardId INT64 NOT NULL, -- MOD(FARM_FINGERPRINT(NaturalKey), 32) at write time
NaturalKey STRING(128) NOT NULL,
Payload STRING(MAX),
) PRIMARY KEY (ShardId, NaturalKey);
With ShardId leading, writes spread across N=32 logical shards, each free to split independently. The cost: a query for all rows matching a NaturalKey range must fan out across all 32 shards (WHERE ShardId BETWEEN 0 AND 31 AND ...). Pick N small enough that the fan-out is cheap but large enough to clear your write ceiling — for most workloads a low double-digit number is plenty, not hundreds. Bit-reversed sequences are usually cleaner; reach for manual sharding only when you need scan locality within a shard.
6. Foreign keys, check constraints, and generated columns
Beyond keys and indexes, three schema features encode invariants so the application cannot violate them.
Foreign keys enforce referential integrity without the co-location of interleaving. Use them when you want the guarantee but the relationship is not a tight parent/child you always read together.
ALTER TABLE Orders ADD CONSTRAINT FK_Orders_Customer
FOREIGN KEY (CustomerId) REFERENCES Customers (CustomerId);
Spanner auto-creates a backing index on the referencing columns if a usable one does not already exist; enforcement adds a read on every insert/update of the child, so foreign keys are not free on the write path.
Check constraints validate column values at write time:
ALTER TABLE Orders ADD CONSTRAINT CK_Orders_Total
CHECK (Total >= 0);
Generated columns compute a value from other columns in the same row, and — critically — a STORED generated column can be indexed. This is the clean way to make a derived value (a normalized email, a hash shard, a status bucket) queryable without trusting the application to keep it in sync.
ALTER TABLE Customers ADD COLUMN
EmailLower STRING(256) AS (LOWER(Email)) STORED;
CREATE INDEX CustomersByEmailLower ON Customers (EmailLower);
This pattern — a STORED generated shard column plus a leading position in the key or an index — is how you bolt hash-sharding onto an existing table without rewriting application writes.
7. Reading query plans and using Key Visualizer to find hot splits
Schema reasoning is a hypothesis; the query plan and Key Visualizer are how you confirm it.
Query plans. Use EXPLAIN / EXPLAIN ANALYZE (or the console’s plan view) and look for the failure signatures:
-
A full table scan where you expected an index seek means the index is missing, not covering, or the optimizer rejected it. Force it to validate with a hint and decide whether the forced plan is actually better:
SELECT Total, PlacedAt FROM Orders@{FORCE_INDEX=OrdersByCustomerStatus} WHERE CustomerId = @cid AND Status = @status; -
A back-join to the base table after an index seek means you are missing a
STORINGcolumn. Add it if the query is hot. -
Distributed union over many splits for a point-style query suggests your key or shard fan-out is wider than the access pattern needs.
Key Visualizer is the heatmap that proves hotspots in production. It plots key ranges on the vertical axis against time on the horizontal axis, with brightness = activity. Bright horizontal stripes that persist are the tell: a narrow key range absorbing disproportionate load — the visual signature of a monotonic key or a hot index.
# Open Key Visualizer for a database (console deep-link form).
gcloud spanner databases describe orders-db --instance=prod-instance
# Then: Console -> Spanner -> <instance> -> <database> -> Key Visualizer.
Read it like this: a single bright band fixed at the top of the key range over time is classic append-hotspot (monotonic key writing to the highest split). Diagonal bright bands indicate scans walking the key space. A healthy high-write table looks like uniform low-level noise across the whole key range — load spread everywhere, no persistent stripe.
8. Schema change rollouts and backfilling indexes safely
Spanner applies schema changes online — no downtime, no table lock — but large changes run a background process you must operate carefully.
- Adding an index backfills it. For a large table this is a long-running operation that consumes CPU and can compete with serving traffic. Create indexes during lower-traffic windows and watch instance CPU.
- Schema updates are versioned and applied as a sequence. Spanner validates each statement; an invalid one (e.g., a
NOT NULLadd on a column with existing nulls, or aCHECKthat existing rows violate) is rejected up front rather than half-applied. CREATE INDEXis non-blocking but not instant. Track it as a long-running operation and only rely on the index once it reports complete.
# Apply DDL; for big indexes prefer one statement per call so you can track each.
gcloud spanner databases ddl update orders-db \
--instance=prod-instance \
--ddl='CREATE INDEX OrdersByCustomerStatus ON Orders (CustomerId, Status) STORING (Total, PlacedAt)'
# List in-flight schema/backfill operations and watch progress.
gcloud spanner operations list \
--instance=prod-instance \
--database=orders-db \
--type=DATABASE_UPDATE_DDL
gcloud spanner operations describe <OPERATION_ID> \
--instance=prod-instance \
--database=orders-db
For schema changes that cannot be expressed as a single safe DDL (changing a primary key, splitting one table into two, re-sharding), there is no in-place mutation — primary keys are immutable. You create a new table with the corrected key, dual-write or backfill via Dataflow, cut reads over, then drop the old table.
Enterprise scenario
A fintech platform team ran a ledger-style Transactions table keyed on PRIMARY KEY (TxnId), where TxnId was a ULID — time-ordered by design, so reconciliation jobs could scan “today’s transactions” as a contiguous range. It worked beautifully in staging. In production, as ingestion crossed ~12k writes/sec at month-end settlement, p99 commit latency spiked from 15 ms to over 400 ms and throughput flatlined no matter how many nodes they added.
Key Visualizer told the whole story in one screenshot: a single bright horizontal band pinned to the top of the key range, persistent across the entire window. Because ULIDs are monotonic, every insert landed in the highest split — one leader was absorbing all 12k writes/sec while the rest of a 10-node instance sat idle. Adding nodes did nothing because there was no second split to move load to.
The constraint was real: they could not abandon time-ordered reconciliation scans, and TxnId was a foreign key referenced across three other tables and external systems, so they could not simply randomize it. The fix kept TxnId as a stable business identifier but changed the physical key to a two-part shard-then-ULID design, with reconciliation rewritten to a bounded fan-out across the small shard set.
CREATE TABLE Transactions (
ShardId INT64 NOT NULL, -- MOD(FARM_FINGERPRINT(TxnId), 16) at write
TxnId STRING(26) NOT NULL, -- ULID, still the business key
Amount NUMERIC NOT NULL,
BookedAt TIMESTAMP NOT NULL OPTIONS (allow_commit_timestamp = true),
) PRIMARY KEY (ShardId, TxnId);
-- Reconciliation scans 16 shards in parallel instead of one hot tail.
-- WHERE ShardId BETWEEN 0 AND 15 AND TxnId >= @day_start AND TxnId < @day_end
With 16 shards leading the key, writes spread across 16 independently-splittable ranges. Commit p99 dropped back under 20 ms, the 10 nodes finally shared load evenly, and the band in Key Visualizer dissolved into uniform noise. Reconciliation cost 16 small range scans instead of one — negligible at their volume. The lesson the team wrote into their schema review checklist: a monotonic key is a single-machine bottleneck wearing a distributed database’s clothes, and it will pass every test that does not push real concurrent write volume.
Verify
Confirm the schema behaves before declaring victory.
-
Inspect the live schema and indexes from the information schema:
SELECT table_name, column_name, ordinal_position, is_nullable FROM information_schema.columns WHERE table_schema = '' ORDER BY table_name, ordinal_position; SELECT index_name, table_name, is_unique, is_null_filtered, index_state FROM information_schema.indexes WHERE table_schema = ''; -
Confirm an index is used for your hot query with
EXPLAIN ANALYZE— verify an index scan/seek, no full table scan, and no surprise back-join. -
Load-test at production concurrency, not staging volume. Drive concurrent writers at your real peak QPS and watch commit latency stay flat as you scale writers.
-
Read Key Visualizer during the load test. Healthy = uniform low-level activity across the key range. Failing = a persistent bright stripe.
-
Watch instance CPU during index backfills via Cloud Monitoring (
spanner.googleapis.com/instance/cpu/utilization) and confirm serving traffic is not starved.