Most “event-driven on GCP” diagrams I review collapse the moment you ask one question: what happens when the consumer is down for an hour? The honest answer separates a real event-driven platform from a pile of Cloud Functions that happen to fire on a topic. On Google Cloud the building blocks are unusually good — Pub/Sub is a globally-distributed, at-least-once broker that holds messages for up to seven days and never makes you provision a partition; Eventarc turns any Google service’s activity (a file landing in a bucket, a Firestore document changing, an Audit Log entry) into a routed CloudEvent without you writing glue; Cloud Run gives you request-and-event-driven containers that scale to zero and back to thousands; Workflows orchestrates the long, multi-step business transactions that you must never bury inside a function; and Firestore is both a low-latency operational store and a real-time push channel to clients. The trap is treating these as interchangeable. The discipline is matching each one to a semantic — command vs. event, point-to-point vs. fan-out, accept-fast vs. orchestrate-reliably — and that matching is what this article is about.
The running domain here is deliberately not e-commerce checkout (the canonical example everyone reaches for). It is a connected-logistics and last-mile delivery platform, because logistics is where event-driven architecture stops being a style choice and becomes the only thing that works: tens of thousands of vehicles emitting GPS pings every few seconds, dispatch decisions that must react in real time, SLA clocks ticking on every parcel, and a fleet of independent reactors — ETA recalculation, geofence alerts, proof-of-delivery, billing, customer notifications — all keying off the same stream of facts. Commands, events, telemetry ingestion, sagas, idempotency, and real-time read models all show up at once, and you cannot hand-wave any of them.
The business scenario
Meridian Freight is a fictional but representative company: a regional logistics and same-day-delivery operator running roughly 8,200 vehicles (a mix of owned vans and contracted gig drivers) across a metro region, moving about 210,000 parcels a day for retail, grocery, and pharmacy clients. Revenue is around ₹620 crore (~USD 74M) a year. Today they run a three-year-old Java monolith on a fleet of Compute Engine VMs behind a load balancer, backed by a single Cloud SQL (PostgreSQL) primary, with a Redis cache bolted on for the live-tracking map. On a quiet weekday it holds together. The days that matter break it.
The pain that triggered the rebuild:
- Telemetry overwhelms the database. Every vehicle posts a GPS-and-status ping every 5 seconds. At 8,200 vehicles that is ~1,600 writes/second of pure location data, and during the morning dispatch wave it spikes past 5,000/second. Those writes land in the same Cloud SQL instance that serves dispatch, billing, and the customer app. When telemetry surges, the connection pool saturates and dispatch itself slows down — the system is least responsive exactly when drivers are hitting the road.
- One slow downstream stalls everything. When a parcel is delivered, the monolith synchronously, inside one request: updates the parcel, recomputes the route ETA for the remaining stops, scores the driver, fires the customer SMS via a third-party gateway, posts a billing event to the client’s system, and writes the proof-of-delivery image to storage. When the SMS gateway degrades from 200 ms to 6 seconds, delivery-confirmation requests pile up and drivers’ apps hang on the “mark delivered” button — so they stop using it, and the live map goes stale.
- Every new client integration is a quarter-long project. Onboarding a grocery chain that wants real-time webhook updates, custom SLA rules, and its own billing format means threading new logic through the monolith’s delivery path. The blast radius is the whole application; releases are monthly and tense.
- The live map is best-effort. The customer-facing “track your parcel” map is driven by polling the monolith every few seconds, which adds yet more read load and still shows positions that are 10–30 seconds stale.
- Spiky, predictable-yet-brutal load. Near-idle at 3 a.m.; a near-vertical ramp at 6–9 a.m. as the dispatch wave launches and grocery delivery windows open. They pay for a fleet sized for the peak and still tip over at the peak.
The business goals are not “go serverless.” They are: ingest the full telemetry firehose without it ever touching the transactional path; confirm a delivery in well under 100 ms regardless of how slow the SMS gateway is; recompute ETAs and fire SLA-breach alerts in seconds, not minutes; onboard a new client integration in under two weeks; give customers a genuinely live map; and stop paying for a peak-sized VM fleet that sits idle two-thirds of the day. Each goal maps directly onto an event-driven primitive — that is what makes this event-driven and not a lift-and-shift of the monolith onto Cloud Run.
Critically, the architecture scales down as cleanly as it scales up. A 12-person startup running 300 drivers and 9,000 parcels a day deploys the identical shape — Pub/Sub on the default quota, Cloud Run scaling to zero, a single-region Firestore in Native mode, one Workflows definition — for a few thousand rupees a month, and grows into the multi-region, BigQuery-backed, provisioned-min-instances version without redrawing the diagram. That is what makes it a reference architecture rather than a hyperscaler special.
Architecture overview
The system separates three planes that the monolith smears into one request thread: the telemetry ingestion plane (a relentless high-volume firehose that must never block anything), the command-and-event plane (business facts that many independent services react to), and the orchestration plane (long, money-and-goods transactions that need durable, multi-step coordination). Each plane gets the GCP primitive whose semantics fit, instead of forcing everything through one broker.
The end-to-end flow, traced through a parcel’s life:
-
Ingress and edge. Clients — driver apps, the customer “track” app, client back-office systems, and warehouse scanners — hit a Global External Application Load Balancer with Cloud Armor (WAF, OWASP rules, geo and rate-based blocking) in front of API Gateway (or Apigee for the partner-facing, monetized APIs). The gateway terminates the API, validates the OIDC token (driver/customer identities from Identity Platform; client systems via API keys or service-account JWTs), and enforces per-key quotas. This is the only public door.
-
Telemetry: accept the firehose, never touch the database. Driver-app GPS pings do not hit a function that writes to the operational store. They are published straight to a dedicated Pub/Sub topic (
vehicle.telemetry) — Pub/Sub absorbs the 5,000-writes/second spike effortlessly because throughput is elastic and there are no partitions to provision. From that one topic, three independent subscriptions fan the stream three ways: (a) a Dataflow streaming job (or a Pub/Sub BigQuery subscription, which writes directly with zero code) lands every raw ping in BigQuery for analytics and replay; (b) a Cloud Run consumer maintains the current position of each vehicle as a single Firestore document per vehicle, which is what powers the live map; © a windowed Dataflow job detects geofence entry/exit and emits higher-levelGeofenceCrossedevents back onto the event plane. The transactional database never sees a raw GPS ping. -
Commands: accept fast, decouple immediately. For write operations that carry business weight — “create shipment”, “assign driver”, “mark delivered” — the gateway does not invoke a do-everything service. The driver app calls a thin Cloud Run
commandservice that validates the request, performs a single idempotent write to Firestore keyed on a client-supplied request ID, and returns202 Acceptedin well under 100 ms. The fact is now durably captured; everything else happens out of band. “Mark delivered” returns instantly even when the SMS gateway is on fire. -
Firestore changes become events — natively, via Eventarc. This is the GCP-native transactional-outbox: you do not run a separate outbox table and poller. Eventarc has a first-class Firestore trigger — when the
parcels/{id}document transitions toDELIVERED, Eventarc emits a CloudEvent and delivers it (through a managed Pub/Sub channel) to a publisher that puts a well-typedParcelDeliveredevent onto the centrallogistics.eventsPub/Sub topic. The event is published because and only because the Firestore write committed, so the store and the event plane can never disagree. (For services where you want explicit control, thecommandservice publishes the event itself in the same logical unit — both patterns coexist; the Firestore-trigger path is the zero-glue default for state-change notifications.) -
Fan-out to independent reactors. On the central event topic, each interested service has its own Pub/Sub subscription with a message filter (server-side, so a subscriber only pays for and receives the event types it cares about). Each subscription push-delivers to a Cloud Run service with its own authenticated endpoint, or is pull-consumed by a service that wants flow control. The reactors to
ParcelDelivered:- ETA service recomputes the route’s remaining-stop ETAs and emits
RouteEtaUpdated. - Notification service sends the customer “delivered” SMS/push (and is the only thing that ever talks to the SMS gateway).
- Billing service posts the billable event to the client’s system / ledger.
- Driver-scoring service updates on-time-delivery stats.
- POD service finalizes the proof-of-delivery image and signature.
- Client-webhook service fans the fact out to whichever clients subscribed to it. Each reactor is a separate squad’s Cloud Run service, deployed independently, ignorant of the others. Adding a seventh is a new subscription with a filter — zero changes upstream. This is the property that turns “quarter-long client onboarding” into “two weeks.”
- ETA service recomputes the route’s remaining-stop ETAs and emits
-
The multi-step transaction runs as a workflow (saga). Dispatch is not one event; it is a sequence with real consequences — reserve capacity, optimize the route across stops (a call to the Route Optimization API / Cloud Fleet Routing), assign drivers, confirm acceptance, and compensate (re-queue parcels, release the driver) if a step fails or times out. That orchestration lives in a Workflows definition, triggered when a dispatch window opens. Workflows owns the retries, exponential backoff, timeouts, parallel branches, callbacks that wait on a driver’s acceptance, and — crucially — the compensating steps. None of that belongs in a tangle of Cloud Run services calling each other synchronously.
-
Real-time read models for queries. Reactors project events into purpose-built Firestore read models — a
LiveMapcollection (current vehicle positions), aParcelStatusView, aRouteBoardfor the dispatch desk. Because Firestore push-streams changes to subscribed clients over its real-time listeners, the customer “track” app and the dispatch console get live updates with no polling — the map moves as the events arrive. This is CQRS: the write model and the read models are different shapes, kept eventually consistent by the event stream, and the read side is a push channel rather than a query load on the database.
Drawn out, the diagram is three horizontal bands. Top band (telemetry): Driver apps → LB/Cloud Armor → gateway → vehicle.telemetry Pub/Sub topic → three subscriptions → {BigQuery (raw), Cloud Run (live-position Firestore docs), Dataflow (geofence events)}. Middle band (command + event): App/clients → gateway → command Cloud Run → Firestore (idempotent write) → Eventarc Firestore trigger → logistics.events Pub/Sub topic → per-service filtered subscriptions → push to independent Cloud Run reactors → each projects a Firestore read model that push-streams to clients. Bottom band (orchestration): a dispatch-window event starts a Workflows saga that calls route-optimization, assignment, and confirmation services with built-in retry/compensation, writing terminal results back as events. Cross-cutting all three: Cloud Logging + Cloud Trace + Cloud Monitoring for traces and metrics, dead-letter topics on every subscription, Pub/Sub message retention + snapshots/seek for replay, and VPC Service Controls wrapping the data services.
Component breakdown
| Component | GCP service | What it does here | Key configuration choices |
|---|---|---|---|
| Edge & WAF | Global External ALB + Cloud Armor | TLS, global anycast ingress, OWASP + rate/geo blocking | Rate-based rules per IP; preconfigured WAF rule sets; adaptive protection for L7 DDoS |
| API / authN | API Gateway (or Apigee for partners) + Identity Platform | The single public door; token validation, quotas, key management | OIDC validation; per-key quotas; Apigee only where monetization/dev-portal is needed |
| Telemetry ingest | Pub/Sub topic (elastic) | Absorbs the GPS firehose; decouples ingest rate from processing rate | No partitions to size; message ordering keys only per-vehicle where needed; 7-day retention |
| Stream processing | Dataflow (streaming) + BigQuery subscription | Windowed geofence detection; raw telemetry land-to-warehouse | Pub/Sub→BigQuery direct subscription for raw (zero-code); Dataflow only where windowing/joins are needed |
| Command compute | Cloud Run (request-driven) | Accept-fast 202 writes; stateless, scale-to-zero |
min-instances only on latency-critical services; concurrency 80; CPU-always-allocated off unless streaming |
| Event compute | Cloud Run (Pub/Sub push targets) | Independent reactors to business events | One service per bounded context; authenticated push (OIDC); --no-allow-unauthenticated |
| Central event bus | Pub/Sub topic + filtered subscriptions | Fan-out of business facts to many consumers | Subscription filters so each consumer gets only its event types; per-subscription dead-letter topic + maxDeliveryAttempts |
| Event routing / outbox | Eventarc (Firestore, Storage, Audit Log triggers) | Turns Google-service activity into routed CloudEvents; native outbox | Firestore document triggers for state-change events; Audit Log triggers for control-plane reactions |
| Operational store | Firestore (Native mode) | Source of truth for parcels/shipments; real-time read models | Multi-region (e.g. nam5/eur3) for HA; security rules; composite indexes; transactions for idempotency |
| Orchestration | Workflows | Long-running saga: reserve → optimize → assign → confirm, with compensation | Built-in retries + backoff; try/except for compensation; callbacks for human/driver waits |
| Analytics & replay | BigQuery | Telemetry warehouse, SLA reporting, event archive for reprocessing | Partitioned by ingest day; streaming inserts via BQ subscription; replay back onto Pub/Sub |
| Observability | Cloud Logging + Trace + Monitoring + Error Reporting | Structured logs, distributed traces, SLO-based alerting across the async graph | Trace context propagated through Pub/Sub attributes; log-based metrics; SLO burn-rate alerts |
A few of these choices carry the design and deserve the why, not just the what.
Pub/Sub is the backbone — and its lack of partitions is the point. On other clouds you size a partition count for your broker and then live with that ceiling; a 3x unexpected spike means hot partitions and rebalancing pain. Pub/Sub has no partitions to provision — throughput scales elastically, which is exactly what you want for a telemetry firehose whose peak is 3x its baseline and arrives in a 9-minute window. Two features make it the right integration backbone, not just a queue: subscription filters (a server-side predicate on message attributes, so the billing service literally never receives RouteEtaUpdated, cutting both delivery and cost) and dead-letter topics with maxDeliveryAttempts (a poison message is sidelined with full delivery metadata instead of crash-looping a consumer forever). Use ordering keys sparingly — only where per-entity order matters (all events for one parcel), because global ordering kills the elastic throughput that makes Pub/Sub special.
Eventarc is the native outbox; don’t build one. The classic transactional-outbox pattern — write your row and an “outbox” row in one transaction, then run a poller that publishes the outbox and marks it sent — exists to guarantee the store and the broker agree. On GCP you get that for free: Eventarc’s Firestore trigger fires a CloudEvent precisely when the document commits, so a ParcelDelivered event is published if and only if the parcel actually reached DELIVERED. No outbox table, no poller, no dual-write race. Eventarc’s other triggers earn their keep too: a Cloud Storage trigger turns a proof-of-delivery image upload into a PodImageReceived event; an Audit Log trigger lets a governance reactor respond to control-plane actions (e.g. a service account key being created). Eventarc is the routing fabric that makes “any Google event becomes a typed, delivered fact” true without glue code.
Cloud Run for both planes, but configured differently. The same primitive (a stateless container that scales to zero) serves the request plane and the event plane, but the knobs differ. Request-facing command services that users wait on get min-instances ≥ 1 to dodge cold starts on the hot path; background event reactors scale from zero and tolerate a cold start because nobody is watching the clock. Push subscriptions deliver to Cloud Run over authenticated HTTPS (Pub/Sub mints an OIDC token; the service runs --no-allow-unauthenticated and only accepts the Pub/Sub service account), so there is no open endpoint. Concurrency is set high (80+) for I/O-bound handlers to maximize utilization, low (1–4) only for memory-heavy or non-thread-safe ones. This is the difference between a Cloud Run bill that tracks real work and one that pays for idle.
Workflows, not a chain of services calling services, for the saga. The temptation is to have the route-optimization service call the assignment service call the confirmation service. That recreates the synchronous monolith with worse failure modes (each hop adds latency and a new way to fail mid-transaction, with no built-in compensation and opaque debugging). A Workflows definition instead gives you declarative steps with retry policies and exponential backoff, parallel branches, callback endpoints that pause the execution until a driver accepts (then resume), try/except blocks that run compensating steps (re-queue the parcels, release the reserved driver) when something fails, and a full per-execution history that turns a 5-hour incident into a 5-minute one. Reach for Cloud Tasks instead only when you need a single deferred/throttled job (rate-limit calls to a fragile client webhook), not a multi-step transaction — the two complement each other.
Implementation guidance
Identity and networking wiring (do this first — it prevents the most common incidents).
- Workload Identity Federation and per-service service accounts, never keys. Each Cloud Run service runs as its own least-privilege service account. The
commandservice has Firestore write on its collections; thenotificationservice has only the secret-accessor role for the SMS gateway credential (in Secret Manager) and nothing else; the billing service cannot touch the live-map collection. CI/CD (Cloud Build or GitHub Actions) authenticates via Workload Identity Federation — no exported service-account JSON keys anywhere, which eliminates the single most common GCP incident class: a leaked key in a repo or a pipeline log. (This codebase has prior history with leaked DB credentials; the rule here is that secrets live only in Secret Manager and are reached via SA identity, never embedded.) - Authenticated Pub/Sub push. Each push subscription is configured with a push auth service account; Pub/Sub attaches an OIDC token, and the target Cloud Run service grants
run.invokeronly to that account. No event endpoint is public. - VPC Service Controls perimeter. Firestore, BigQuery, Pub/Sub, and Secret Manager sit inside a VPC-SC service perimeter so data cannot be exfiltrated to a project outside the boundary even if a credential leaks. Cloud Run egress to the internet (the SMS/maps third parties) goes through a Serverless VPC Access connector with Cloud NAT for stable, allow-listable egress IPs.
- Pub/Sub schemas. Define Pub/Sub schemas (Avro/Protobuf) on the central event topic so producers and consumers share a versioned contract and malformed events are rejected at publish time. Evolve schemas additively (new optional fields), never break a field a consumer reads.
Idempotency is non-negotiable and lives in three places. At-least-once delivery is the law of this land: Pub/Sub, Eventarc, and any retry can hand you the same message twice (Pub/Sub even has an exactly-once delivery mode for pull subscriptions, but you should still design idempotently — it does not absolve a push consumer or a redelivery after a crash). (1) The command write path enforces it structurally: the command service uses a Firestore transaction with the client request ID as the document/key guard, so a retried “mark delivered” is a no-op rather than a second delivery. (2) Side-effecting reactors (send an SMS, post a billing charge) record the processed messageId in a small Firestore “processed-events” collection with a TTL, and short-circuit on replay; the SMS gateway and billing API are also called with their own idempotency keys. (3) The Workflows saga makes each step idempotent and passes idempotency keys to external calls so a retry never double-charges or double-assigns.
Telemetry handling specifics. Raw GPS pings go to Pub/Sub and into BigQuery via a direct Pub/Sub→BigQuery subscription (no code, no Dataflow) for the raw-archive path. Only the derived work — geofence windowing, “vehicle idle > N minutes”, route-progress — runs in a Dataflow streaming job with windowing and watermarks. The live-map updater is a Cloud Run pull consumer that debounces: it writes each vehicle’s current position to Firestore at most once per few seconds, so 8,200 vehicles pinging every 5 s do not generate 8,200 Firestore writes/second of map churn. The transactional Firestore database and the telemetry path are physically different stores/collections so the firehose can never contend with dispatch.
A few IaC snippets (Terraform) that capture the load-bearing wiring people most often get wrong. First, a filtered subscription with a dead-letter topic and retry policy — the part teams omit, then wonder why a poison event crash-loops a service:
resource "google_pubsub_topic" "events" { name = "logistics-events" }
resource "google_pubsub_topic" "events_dlq" { name = "logistics-events-dlq" }
resource "google_pubsub_subscription" "billing" {
name = "billing-parcel-delivered"
topic = google_pubsub_topic.events.id
# Server-side filter: billing only ever sees the events it cares about.
filter = "attributes.eventType = \"ParcelDelivered\" OR attributes.eventType = \"ShipmentClosed\""
# Authenticated push straight to the billing Cloud Run service.
push_config {
push_endpoint = "${google_cloud_run_v2_service.billing.uri}/events"
oidc_token { service_account_email = google_service_account.pubsub_push.email }
}
# Poison messages get sidelined, not retried forever.
dead_letter_policy {
dead_letter_topic = google_pubsub_topic.events_dlq.id
max_delivery_attempts = 5
}
retry_policy {
minimum_backoff = "10s"
maximum_backoff = "600s"
}
ack_deadline_seconds = 60
}
Second, the Eventarc Firestore trigger that turns a committed delivery into a routed CloudEvent — the native outbox, with no poller:
resource "google_eventarc_trigger" "parcel_delivered" {
name = "parcel-delivered"
location = "nam5"
matching_criteria { attribute = "type" value = "google.cloud.firestore.document.v1.updated" }
matching_criteria { attribute = "database" value = "(default)" }
# Fire only on writes under the parcels collection.
matching_criteria { attribute = "document" value = "parcels/{parcelId}" operator = "match-path-pattern" }
destination {
cloud_run_service {
service = google_cloud_run_v2_service.event_publisher.name
region = "us-central1"
path = "/firestore"
}
}
service_account = google_service_account.eventarc.email
}
Third, a Cloud Run event reactor locked down to authenticated Pub/Sub push only:
resource "google_cloud_run_v2_service" "billing" {
name = "billing-svc"
location = "us-central1"
ingress = "INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER"
template {
service_account = google_service_account.billing.email
scaling { min_instance_count = 0 max_instance_count = 50 }
containers {
image = "us-docker.pkg.dev/${var.project}/svc/billing:${var.tag}"
resources { limits = { cpu = "1", memory = "512Mi" } }
}
vpc_access {
connector = google_vpc_access_connector.egress.id
egress = "ALL_TRAFFIC" # all egress through the connector → Cloud NAT
}
}
}
# Only the Pub/Sub push identity may invoke it.
resource "google_cloud_run_v2_service_iam_member" "billing_invoker" {
name = google_cloud_run_v2_service.billing.name
location = google_cloud_run_v2_service.billing.location
role = "roles/run.invoker"
member = "serviceAccount:${google_service_account.pubsub_push.email}"
}
And the orchestration skeleton — a Workflows saga with retry and a compensating branch (the part that makes it a saga, not just a script):
main:
steps:
- reserve:
try:
call: http.post
args: { url: ${reserve_url}, body: { window: ${input.windowId} } }
result: reservation
retry: ${http.default_retry} # exponential backoff, bounded
- optimize:
call: http.post
args: { url: ${optimize_url}, body: ${reservation} }
result: plan
- assign_and_confirm:
try:
call: http.post
args: { url: ${assign_url}, body: ${plan} }
result: assignment
except:
as: e
steps:
- compensate: # release the reservation, re-queue parcels
call: http.post
args: { url: ${release_url}, body: ${reservation} }
- fail:
raise: ${e}
Deployment and progressive delivery. Each service deploys independently through Cloud Build (or GitHub Actions over Workload Identity Federation) to Artifact Registry, then Cloud Run. Use Cloud Run revision traffic splitting for canaries — send 5% to the new revision, watch the error-rate and latency SLOs, then ramp. Pub/Sub schema evolution is additive so a producer and consumer can deploy on different days without breaking. The whole platform — topics, subscriptions, triggers, services, IAM, VPC-SC — is one Terraform state per environment, promoted through dev → staging → prod with the same code.
Enterprise considerations
Security and Zero Trust. The model is “no implicit trust, identity on every hop.” Public traffic enters only through the ALB + Cloud Armor + gateway; every internal endpoint (Cloud Run reactors, Eventarc targets) requires an authenticated identity (run.invoker granted only to the specific Pub/Sub-push or Eventarc service account), so even inside the project there is no anonymous east-west call. Service-to-data access is per-service least-privilege service accounts; secrets live only in Secret Manager, reached via SA identity — no keys in code, configs, or pipeline logs (the leaked-credential failure mode is designed out, not patched). A VPC Service Controls perimeter around Firestore/BigQuery/Pub/Sub/Secret Manager blocks data exfiltration to projects outside the boundary; Organization Policy constraints (e.g. iam.disableServiceAccountKeyCreation, domain-restricted sharing, allowed regions) are enforced org-wide. Pub/Sub schemas plus message validation stop malformed or spoofed events at the door.
Cost optimization. The architecture is structurally cheap because everything scales to zero or near-zero at 3 a.m. Cloud Run charges per request and per vCPU-second of actual work; only the few latency-critical command services hold min-instances warm. Pub/Sub subscription filters mean a consumer is neither delivered nor billed for events it does not want — a real lever when one topic carries a dozen event types and most consumers want one. Raw telemetry to BigQuery via the direct Pub/Sub→BigQuery subscription avoids a standing Dataflow bill for the no-windowing path; Dataflow runs only for the genuinely streaming-analytical work, on autoscaling workers. Firestore is billed per operation, so the debounced live-map writes (cap one position write per vehicle per few seconds) are a deliberate cost control, not just a performance one. BigQuery telemetry tables are day-partitioned with expiration so cold history rolls to cheaper storage or out. Set budgets and alerts, and watch Pub/Sub unacked message age and Cloud Run billable instance time as your two leading cost-and-health indicators.
Scalability and reliability (RTO/RPO). Pub/Sub absorbs spikes by design — the 9-minute dispatch surge becomes elastic broker throughput and a brief, self-draining backlog rather than a database meltdown. Cloud Run scales horizontally to thousands of instances per service and back to zero. For reliability, run Firestore in multi-region mode (e.g. nam5) for automatic synchronous replication across regions — RPO ≈ 0 and RTO in minutes for a regional failure, with no application change. Pub/Sub is a regional-to-global managed service with its own redundancy; message retention (up to 7 days) plus snapshots and seek give you replay — if a consumer ships a bad version and mis-processes an hour of events, you fix it and seek the subscription back to reprocess, with BigQuery holding the raw archive as a second source of truth. Every async hop has a dead-letter topic, so a poison message is quarantined with full metadata instead of stalling the pipeline. Define SLOs (delivery-confirmation p99 latency, event end-to-end lag, telemetry freshness) and alert on burn rate, not raw error counts.
Observability. This is the price of admission for event-driven, and skipping it leaves you with something harder to debug than the monolith. Propagate a trace/correlation ID as a Pub/Sub message attribute end-to-end so Cloud Trace stitches “parcel delivered → ETA recomputed → SMS sent → billed” into one timeline across services and the broker. Cloud Run emits structured JSON logs to Cloud Logging; build log-based metrics for business events (deliveries/min, SLA breaches/hour) and feed them to Cloud Monitoring dashboards. Error Reporting groups exceptions across services. The two operational vital signs are Pub/Sub subscription unacked-message age and backlog (a rising backlog means a consumer can’t keep up — the leading indicator of every async incident) and dead-letter topic depth (anything landing there is a real bug to investigate).
Governance. A clear resource hierarchy (org → folders for environments → projects per environment/domain) with Organization Policies enforced top-down. The event catalog — every event type, its Pub/Sub schema, its producer, and its subscribers — is a governed, versioned artifact (in the repo and surfaced via Data Catalog / Dataplex), because in an event-driven system the schemas are the integration contracts and must be managed like APIs. IAM is least-privilege and reviewed; service accounts are inventoried. Audit Logs (Admin Activity always on; Data Access enabled on the sensitive stores) flow to a logs bucket / BigQuery sink for retention and SIEM. Cost and ownership are attributed per service via labels.
Reference enterprise example
Meridian Freight ran the rebuild over three quarters. Concrete decisions and numbers:
Scale and traffic. Steady state ~8,200 active vehicles emitting telemetry every 5 s (~1,600 msg/s baseline), ~210,000 parcels/day. The morning dispatch wave drives telemetry to ~5,100 msg/s and delivery-confirmation events to ~900/s in a 9-minute window. The old monolith’s Cloud SQL primary saturated at ~1,200 telemetry writes/second, which is below even the steady-state rate — telemetry and dispatch were fighting for the same instance.
What they built.
- Global ALB + Cloud Armor; API Gateway for app/clients, Apigee for two monetized client integrations. Identity Platform for driver/customer auth.
- Pub/Sub
vehicle.telemetrytopic with 3 subscriptions: a direct BigQuery subscription (raw archive), a Cloud Run live-position updater (debounced), and a Dataflow geofence/idle-detection job. Centrallogistics.eventstopic with per-service filtered subscriptions and a dead-letter topic each. - Eventarc Firestore trigger on
parcels/{id}emittingParcelDelivered; a Cloud Storage trigger for proof-of-delivery uploads. - Cloud Run services per bounded context:
command,eta,notification,billing,driver-scoring,pod,client-webhook,live-map— eight squads, eight deploy pipelines.commandandetacarrymin-instances=2; the rest scale from zero. - Firestore multi-region (
nam5), Native mode: write model for parcels/shipments plusLiveMap,ParcelStatusView,RouteBoardread models pushed live to the apps over real-time listeners. - Workflows dispatch saga: reserve → route-optimize (Cloud Fleet Routing) → assign → confirm, with a compensating release-and-requeue branch and a driver-acceptance callback.
- BigQuery for telemetry warehouse, SLA reporting, and event replay; VPC-SC perimeter around Firestore/BigQuery/Pub/Sub/Secret Manager; egress via VPC connector + Cloud NAT.
Decisions that paid off.
- Telemetry off the transactional path. Routing GPS pings to Pub/Sub (not the database) meant the 5,100 msg/s peak became elastic broker throughput, and dispatch latency stopped degrading during the morning wave. The transactional Firestore store never sees a raw ping.
- 202-Accept on delivery confirmation. “Mark delivered” p99 dropped from 4.6 s (under load, waiting on the SMS gateway) to 62 ms, because the driver app no longer waits on SMS/billing/ETA — those are Pub/Sub reactors. Drivers stopped abandoning the button, so the live map stayed fresh.
- SMS gateway degradation became a non-event. A real incident: the SMS provider degraded for ~40 minutes. Because notifications are a Pub/Sub consumer with a dead-letter topic, the affected messages backlogged and then drained (with a few dead-lettered and replayed) while deliveries, billing, and ETAs continued unaffected. In the monolith, that same outage failed delivery confirmations outright.
- Client onboarding of a new grocery chain with custom webhook + SLA rules took 8 days — add a filtered subscription on
logistics.eventspointing at a newclient-webhookconfig. No producer changed. - Independent releases. Eight squads now ship 35–45 times a week combined behind Cloud Run canaries; the monthly-release era is over.
- Live map is genuinely live. Customers see vehicle movement update within ~2–3 s instead of polling-stale 10–30 s, because Firestore push-streams the
LiveMapprojection rather than the app polling the database.
A failure they handled gracefully. A bad deploy of the eta service started raising on routes with a specific stop pattern. Because ETA recomputation is a Pub/Sub consumer with maxDeliveryAttempts=5 and a dead-letter topic, the affected events dead-lettered with an alert instead of crash-looping; deliveries and notifications continued for everyone else, and after the fix they seek-replayed the dead-letter and the original subscription window. The blast radius was “ETAs were briefly stale for some routes,” not “the platform went down.”
Outcome. Steady-state GCP spend ~₹7.8 lakh/month, down from an over-provisioned VM fleet that cost more to keep upright through the morning wave — and most of the new bill is BigQuery analytics they did not previously have. Dispatch-wave survivability proven at 4x baseline with no degradation, telemetry fully decoupled from the transactional path, a live customer map, and a platform eight teams evolve independently. The business signed off on the strength of the first incident-free dispatch wave alone.
When to use it
Use this architecture when:
- You have a high-volume telemetry or event firehose (IoT, fleet, clickstream, sensors) that must be decoupled from your transactional store — this is Pub/Sub’s home turf and the clearest signal you need this shape.
- You have bursty or spiky load where accept-fast / process-reliably genuinely helps (logistics, dispatch, onboarding, claims, command-and-control).
- You have multiple independent consumers of the same business facts (the “many services react to one delivery event” shape) — this is where Pub/Sub fan-out with filters earns its keep.
- You want real-time read models pushed to clients without polling — Firestore’s live listeners make this nearly free once events are flowing.
- You need independent team and deployment autonomy across distinct business capabilities, and never-lose-the-work reliability more than the simplest request/response.
Be honest about the trade-offs:
- Eventual consistency is now your problem. The parcel is
DELIVEREDin the write model a moment before the read model and the client’s billing system catch up. The UI and the business must tolerate “confirmed, processing.” If your domain truly needs strict read-your-write consistency on every operation, an event-driven split fights you. - Operational complexity goes up. You are running a broker, dead-letter topics, sagas, schema evolution, and distributed tracing. Without the observability and idempotency discipline above, you have built something harder to debug than the monolith. Commit to tracing, schemas, and dead-letter alerting from day one or do not adopt this.
- It is more moving parts than a small problem needs. A few hundred deliveries a day, one team, no firehose, no spikes? A well-structured modular monolith on Cloud Run + Cloud SQL is the right answer; carve out events later when a real seam (the telemetry firehose, a second consumer of a fact) actually appears.
Anti-patterns to avoid (each is a real incident waiting to happen):
- Telemetry through the database. Writing the raw GPS firehose to your transactional store is the original sin this architecture exists to prevent. The firehose goes to Pub/Sub; the store sees only derived facts.
- One broker forced into every semantic. Pushing high-fanout notifications, ordered commands, and a long saga all through one undifferentiated topic fights the tools. Match the primitive to the semantic — that is what the component table is for. Workflows for sagas, Cloud Tasks for single throttled jobs, Pub/Sub for fan-out.
- The distributed monolith. Cloud Run services that call each other synchronously and block on the reply have all the cost of distribution with none of the decoupling. If a “command” is just a blocking RPC with extra hops, it should be a direct call or shouldn’t exist.
- Events that carry no state but force a callback. A fact event that only says “something changed, call me back to find out what” reintroduces the coupling you paid to remove. Carry enough state in the event (and version the schema) for common consumers to act without a synchronous round-trip.
- No idempotency / no dead-letter strategy. At-least-once delivery will redeliver. Non-idempotent handlers double-charge and double-notify; subscriptions without a dead-letter topic crash-loop on the first poison message. These are guaranteed in week one, not edge cases.
- Shared database across services. If two services write the same Firestore collection, you do not have microservices; you have a monolith with a network in the middle. Each service owns its data and integrates through events.
- Exported service-account keys. A single leaked JSON key is the most common GCP breach. Use Workload Identity Federation and Secret Manager; set the org policy that disables key creation.
Alternatives worth weighing: a modular monolith (simplest, right until you have proven a firehose or an independent-scaling/independent-deploy need); GKE with Knative/Eventing or a KEDA-scaled deployment if you need full Kubernetes control and want the same event-driven shape on a cluster you operate; Cloud Functions (2nd gen) instead of Cloud Run for the smaller, single-purpose reactors where you want the lightest possible deploy unit (they share the Eventarc/Pub/Sub plumbing exactly); Dataflow-centric streaming if your problem is overwhelmingly continuous analytics rather than discrete business events; and pure Pub/Sub + Cloud Run + Firestore with no Workflows at all for the genuinely small or spiky-but-stateless workload that has no multi-step saga yet. The shape — three planes, telemetry decoupled, commands vs. events, native Eventarc outbox, idempotent handlers, saga — stays the same across all of them. Only the compute host and the orchestration richness change.