CQRS gets adopted for the wrong reason about half the time. Teams reach for it because someone said “it scales,” then discover the hard part was never splitting the code paths. The hard part is the read model: keeping a fleet of denormalized projections correct and current off an append-only event log, and being honest with users about the window where a write has happened but the query side does not know yet. This article walks the pipeline end to end, on the assumption you have already decided CQRS earns its keep for at least one bounded context.
The mental model: your write side owns one job, which is to validate a command and emit facts. Everything a user reads is a cache. Once you internalize that read models are disposable, rebuildable caches over an authoritative event log, most of the operational decisions fall out naturally.
1. Separate the write model from purpose-built read models
The write model exists to enforce invariants. An aggregate loads its prior events, decides whether a command is legal, and appends new events. It should be normalized to the point of being almost unqueryable - a single aggregate stream, addressed by ID, optimized for consistency, not for the seventeen ways a UI wants to slice it.
Read models go the other direction. Each one is shaped for exactly one query pattern and is allowed to be redundant, denormalized, and lossy. The OrderSummary projection that powers a customer’s order history is a different table than the WarehousePickList projection that powers fulfillment, even though both derive from the same OrderPlaced and ItemShipped events. Do not try to make one read model serve both; the entire point is that you no longer have to compromise schema design between writers and readers.
The contract between the two sides is the event. Keep events as immutable, named facts in the past tense (PaymentCaptured, not CapturePayment), versioned, and free of any read-model concerns. A projection failing or being added must never require touching the write side.
2. Choose read stores: relational, document, search, and materialized views
There is no single correct read store. Match the store to the query.
| Query shape | Store | Why |
|---|---|---|
| Keyed lookups, joins, ad-hoc reporting | PostgreSQL / Azure SQL | Transactional projection updates, mature indexing |
| Document-per-aggregate, denormalized blobs | Cosmos DB / MongoDB | Single-read render of a whole screen, no joins |
| Full-text, faceted, relevance ranking | Elasticsearch / OpenSearch | Inverted index, scoring the RDBMS cannot do well |
| Pre-aggregated counters, dashboards | Materialized view / ClickHouse | Roll-ups computed once, read cheaply |
A useful pattern for relational read models is to lean on native materialized views for the aggregation-heavy ones rather than hand-maintaining counters in projection code:
CREATE MATERIALIZED VIEW order_daily_totals AS
SELECT
date_trunc('day', placed_at) AS day,
region,
count(*) AS order_count,
sum(total_cents) AS gross_cents
FROM order_summary
GROUP BY 1, 2
WITH NO DATA;
CREATE UNIQUE INDEX ON order_daily_totals (day, region);
-- Refreshed concurrently so reads are not blocked during recompute
REFRESH MATERIALIZED VIEW CONCURRENTLY order_daily_totals;
REFRESH ... CONCURRENTLY requires a unique index on the view and does not lock out readers, which matters when a dashboard hits it continuously. The tradeoff is that the refresh itself is heavier than an incremental update, so reserve it for roll-ups, not row-level projections.
3. Build projection workers that consume the event log reliably
A projection worker is a long-running consumer that reads events in order and applies them to a read store. The non-negotiable properties are: it processes events in order per stream, it is idempotent, and it records its position so it can resume after a crash.
Here is the shape of a worker consuming from EventStoreDB and writing to Postgres. Note the position is committed in the same transaction as the read-model write - this is what makes the update effectively exactly-once even though delivery is at-least-once.
const sub = client.subscribeToAll({
fromPosition: await loadCheckpoint(), // resume point
filter: streamNameFilter({ prefixes: ["order-"] }),
});
for await (const resolved of sub) {
const event = resolved.event;
if (!event) continue;
await db.tx(async (t) => {
switch (event.type) {
case "OrderPlaced":
await t.none(
`INSERT INTO order_summary (order_id, customer_id, total_cents, status, placed_at)
VALUES ($1,$2,$3,'placed',$4)
ON CONFLICT (order_id) DO NOTHING`,
[event.data.orderId, event.data.customerId, event.data.totalCents, event.data.placedAt]
);
break;
case "OrderShipped":
await t.none(
`UPDATE order_summary SET status='shipped', shipped_at=$2
WHERE order_id=$1`,
[event.data.orderId, event.data.shippedAt]
);
break;
}
// Checkpoint moves only if the read-model write committed.
await t.none(
`INSERT INTO projection_checkpoint (name, commit_pos, prepare_pos)
VALUES ('order_summary', $1, $2)
ON CONFLICT (name) DO UPDATE
SET commit_pos = EXCLUDED.commit_pos,
prepare_pos = EXCLUDED.prepare_pos`,
[resolved.commitPosition?.toString(), resolved.preparePosition?.toString()]
);
});
}
The ON CONFLICT DO NOTHING on insert and the position-in-the-same-transaction are the two load-bearing details. Drop either and you get duplicate rows or skipped events after the first crash. One projection equals one worker process (or one consumer-group partition) so that ordering is never violated by concurrent applies to the same stream.
4. Catch-up subscriptions, checkpointing, and exactly-once projection updates
“Exactly-once delivery” does not exist over a network. What you build instead is at-least-once delivery plus idempotent application, which yields exactly-once effects. Three techniques get you there:
- Catch-up subscriptions. On startup the worker first reads historically from its last checkpoint to the live head, then transitions to a live subscription with no gap. EventStoreDB’s
subscribeToAll(fromPosition)and Kafka’s committed-offset resume both implement this. A brand-new projection starts from position zero and replays the entire log. - Transactional checkpoints. Store the position in the read store itself, written atomically with the projection mutation as shown above. Never checkpoint to a separate system on a separate timeline - that reintroduces the dual-write problem you adopted CQRS to escape.
- Idempotency keys. Where the read store cannot do a natural upsert, carry the event’s global position or a deterministic key and dedupe on it. A
processed_events(event_id)guard table works when upserts do not.
For Kafka-based pipelines, the equivalent is the consumer reading and the offset commit being part of one transaction:
props.put(ConsumerConfig.ISOLATION_LEVEL_CONFIG, "read_committed");
props.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "false");
// Process the batch, write to the read DB, then commit offsets
// only after the DB transaction succeeds.
consumer.commitSync(currentOffsets);
enable.auto.commit=false plus a manual commitSync after the database write is the minimum bar. Auto-commit on a timer will acknowledge events you have not yet projected, and a crash in that window silently drops them.
5. Close the consistency gap: read-your-writes and client-side versioning
This is the part users feel. A command returns 202 Accepted, the UI navigates to a list, and the new item is not there yet because the projection has not caught up. You cannot eliminate the gap, but you can hide it from the one user who just made the change - which is the only user who notices.
Three mechanisms, from cheapest to strongest:
- Optimistic UI. The client already has the data it just submitted; render it locally and reconcile when the server view catches up. Costs nothing on the backend and covers the common case.
- Version tokens / read-your-writes. Have the command return the position of the event it produced. The client passes that token on its next query; the read API blocks (briefly) or returns a
staleflag until the projection’s checkpoint has passed that position.
POST /orders -> 202, body: { "orderId": "...", "version": 481523 }
GET /orders?after=481523
-> 200 if projection checkpoint >= 481523
-> 200 with { "consistencyPending": true } otherwise
-- The read API gates on the same checkpoint the worker writes:
SELECT commit_pos >= $1 AS is_current
FROM projection_checkpoint
WHERE name = 'order_summary';
- Sticky strong-read fallback. For the rare screen that genuinely cannot tolerate staleness, route that one query straight to the write model (load the aggregate by ID) instead of the projection. Use sparingly; it defeats the read/write split, so reserve it for “show me the thing I just created” rather than list or search views.
The principle: pay for consistency only where a human will perceive its absence, and only for that human.
6. Reproject and run zero-downtime read-model migrations with blue-green tables
Because read models are derived, schema changes are not migrations in the destructive sense - they are rebuilds. The blue-green pattern makes them safe with no read downtime:
- Create
order_summary_v2alongside the liveorder_summary(blue). - Start a new projection worker that replays the entire event log from position zero into v2 (green), using a separate checkpoint row.
- When green’s checkpoint catches up to and stays at the live head, atomically swap reads via a view or a feature flag.
- Keep blue for a rollback window, then drop it.
BEGIN;
ALTER TABLE order_summary RENAME TO order_summary_blue;
ALTER TABLE order_summary_v2 RENAME TO order_summary;
COMMIT;
-- Reads cut over in one transaction; rollback is the inverse rename.
This is why the discipline of “events carry no read-model concerns” pays off: a new projector can interpret the same historical events into a totally different shape. The only true constraint is that your event schema is forward-compatible (upcasters for old event versions), because you cannot rewrite history.
7. Monitor projection lag and reconcile drift against the source of truth
Two metrics decide whether your CQRS system is healthy.
Projection lag is the distance between the live event-log head and each projection’s checkpoint. Track it as both event-count and wall-clock time; alert on time, because that is what the SLA on read-your-writes is denominated in.
ProjectionLag_CL
| where TimeGenerated > ago(1h)
| summarize max_lag_seconds = max(lag_seconds_d) by projection_name_s, bin(TimeGenerated, 1m)
| where max_lag_seconds > 10
Drift is the silent killer: a projection that is caught up but wrong because of a bug in an apply handler that has since been fixed. Lag will read zero while the data is corrupt. The defense is a periodic reconciliation job that recomputes a checksum or count from the read model and compares it to an independent recompute from the event log (or the write model):
-- Reconcile: per-customer order counts in the read model
-- must equal the count derived from the authoritative source.
SELECT rm.customer_id, rm.cnt AS read_model_cnt, src.cnt AS source_cnt
FROM (SELECT customer_id, count(*) cnt FROM order_summary GROUP BY 1) rm
FULL JOIN (SELECT customer_id, count(*) cnt FROM order_events_placed GROUP BY 1) src
ON rm.customer_id = src.customer_id
WHERE COALESCE(rm.cnt,0) <> COALESCE(src.cnt,0);
Any non-empty result is drift. The remediation is almost always a targeted reprojection (Step 6) rather than a manual patch, because patches hide the root cause and the next replay will reintroduce the bug.
8. Anti-patterns: leaking CQRS into the UI and over-splitting models
Two failure modes account for most CQRS regret:
Leaking eventual consistency into the UI as the default experience. If every screen shows spinners and “your change may take a moment,” you have pushed an internal architecture decision onto users. Read-your-writes (Step 5) exists precisely so the common path feels synchronous. Treat visible staleness as a bug to be designed around, not a property to be tolerated.
Over-splitting into too many models. CQRS is a per-context decision, not a global mandate. A simple CRUD context that no one queries in interesting ways should stay a plain table with a shared model - splitting it adds a projection pipeline, a consistency gap, and operational surface for zero benefit. Reserve CQRS for contexts where read and write demands genuinely conflict: high read fan-out, complex query shapes, or independent scaling needs. “We use CQRS everywhere” is a smell, not an achievement.
A third, quieter one: building full event sourcing when you only needed CQRS. You can run CQRS by emitting events from a conventional state-storing write side (for example via the transactional outbox pattern) without making the event log your system of record. Coupling the two is a choice, and event sourcing carries real costs - versioning, upcasting, replay time - that you should opt into deliberately.
Enterprise scenario
A logistics platform team ran order tracking on a single Azure SQL database. Reads had overwhelmed writes: a public “where is my parcel” page, an internal ops console, and a partner API all hammered the same tables, and a heavy reporting query could stall order ingestion at peak. They moved to CQRS with the write side emitting events through a transactional outbox into Azure Service Bus, and three projections: a Cosmos DB document per parcel for the public page, a Postgres read model for the ops console, and Elasticsearch for partner search.
The constraint surfaced in week one. The public tracking page is the screen a customer opens immediately after a status changes - exactly the read-your-writes case - and the Cosmos projection ran a few hundred milliseconds behind the event. Customers refreshed and saw the old status, then filed “tracking is broken” tickets. The fix was version tokens. The status-update API returned the event’s sequence number, the page passed it back on poll, and the read API gated on the Cosmos projection’s checkpoint:
// Read API: only serve the projection once it has caught up
// to the version the client was told about.
var checkpoint = await store.GetCheckpointAsync("parcel_tracking");
if (checkpoint < request.AfterVersion)
return Results.Ok(new { consistencyPending = true, retryAfterMs = 400 });
var parcel = await container.ReadItemAsync<ParcelView>(
request.ParcelId, new PartitionKey(request.ParcelId));
return Results.Ok(parcel.Resource);
The “tracking is broken” tickets stopped. The separate reporting load no longer touched the write database at all, ingestion throughput became predictable, and a later schema change to the ops read model went out as a blue-green reprojection over a weekend with zero downtime. The lesson the team wrote into their architecture guidance: adopt CQRS per context, and budget for the consistency gap on day one rather than discovering it in production tickets.
Verify
Before calling a CQRS pipeline production-ready, confirm each property explicitly:
- Idempotency: Replay the same event twice into a fresh read store; the result must be byte-identical to applying it once.
- Crash recovery: Kill a projection worker mid-batch and restart it; assert no rows are duplicated and no events are skipped by diffing against a clean rebuild.
- Checkpoint atomicity: Confirm the checkpoint and the read-model write share one transaction (force a failure between them and verify the checkpoint did not advance).
- Read-your-writes: Issue a command, immediately query with the returned version token, and assert the API either returns the new state or an honest
consistencyPendingrather than stale data presented as current. - Reprojection: Rebuild a read model from position zero into a green table and reconcile it against blue - they must match exactly before any swap.
- Drift detection: Run the reconciliation query against the live read model and confirm it returns zero rows.
# Smoke test: lag must converge after a synthetic burst.
curl -s "$API/orders" -d @order.json -H 'content-type: application/json'
watch -n2 'psql -tAc "SELECT commit_pos FROM projection_checkpoint WHERE name='\''order_summary'\''"'