Architecture GCP

GCP Enterprise Architecture: Event-Driven Architecture

Most “event-driven on GCP” diagrams I review collapse the moment you ask one question: what happens when the consumer is down for an hour? The honest answer separates a real event-driven platform from a pile of Cloud Functions that happen to fire on a topic. On Google Cloud the building blocks are unusually good — Pub/Sub is a globally-distributed, at-least-once broker that holds messages for up to seven days and never makes you provision a partition; Eventarc turns any Google service’s activity (a file landing in a bucket, a Firestore document changing, an Audit Log entry) into a routed CloudEvent without you writing glue; Cloud Run gives you request-and-event-driven containers that scale to zero and back to thousands; Workflows orchestrates the long, multi-step business transactions that you must never bury inside a function; and Firestore is both a low-latency operational store and a real-time push channel to clients. The trap is treating these as interchangeable. The discipline is matching each one to a semantic — command vs. event, point-to-point vs. fan-out, accept-fast vs. orchestrate-reliably — and that matching is what this article is about.

The running domain here is deliberately not e-commerce checkout (the canonical example everyone reaches for). It is a connected-logistics and last-mile delivery platform, because logistics is where event-driven architecture stops being a style choice and becomes the only thing that works: tens of thousands of vehicles emitting GPS pings every few seconds, dispatch decisions that must react in real time, SLA clocks ticking on every parcel, and a fleet of independent reactors — ETA recalculation, geofence alerts, proof-of-delivery, billing, customer notifications — all keying off the same stream of facts. Commands, events, telemetry ingestion, sagas, idempotency, and real-time read models all show up at once, and you cannot hand-wave any of them.

The business scenario

Meridian Freight is a fictional but representative company: a regional logistics and same-day-delivery operator running roughly 8,200 vehicles (a mix of owned vans and contracted gig drivers) across a metro region, moving about 210,000 parcels a day for retail, grocery, and pharmacy clients. Revenue is around ₹620 crore (~USD 74M) a year. Today they run a three-year-old Java monolith on a fleet of Compute Engine VMs behind a load balancer, backed by a single Cloud SQL (PostgreSQL) primary, with a Redis cache bolted on for the live-tracking map. On a quiet weekday it holds together. The days that matter break it.

The pain that triggered the rebuild:

The business goals are not “go serverless.” They are: ingest the full telemetry firehose without it ever touching the transactional path; confirm a delivery in well under 100 ms regardless of how slow the SMS gateway is; recompute ETAs and fire SLA-breach alerts in seconds, not minutes; onboard a new client integration in under two weeks; give customers a genuinely live map; and stop paying for a peak-sized VM fleet that sits idle two-thirds of the day. Each goal maps directly onto an event-driven primitive — that is what makes this event-driven and not a lift-and-shift of the monolith onto Cloud Run.

Critically, the architecture scales down as cleanly as it scales up. A 12-person startup running 300 drivers and 9,000 parcels a day deploys the identical shape — Pub/Sub on the default quota, Cloud Run scaling to zero, a single-region Firestore in Native mode, one Workflows definition — for a few thousand rupees a month, and grows into the multi-region, BigQuery-backed, provisioned-min-instances version without redrawing the diagram. That is what makes it a reference architecture rather than a hyperscaler special.

Architecture overview

The system separates three planes that the monolith smears into one request thread: the telemetry ingestion plane (a relentless high-volume firehose that must never block anything), the command-and-event plane (business facts that many independent services react to), and the orchestration plane (long, money-and-goods transactions that need durable, multi-step coordination). Each plane gets the GCP primitive whose semantics fit, instead of forcing everything through one broker.

GCP event-driven reference architecture for connected logistics: a telemetry plane publishing GPS pings to a Pub/Sub topic that fans out to BigQuery, a debounced Cloud Run live-position consumer, and Dataflow; a command-and-event plane where a Cloud Run command service writes idempotently to Firestore, an Eventarc Firestore trigger publishes to the logistics.events topic, and filtered subscriptions push-deliver to independent Cloud Run reactors that project Firestore read models; and an orchestration plane running a Workflows dispatch saga, with a cross-cutting Cloud Logging/Trace/Monitoring, IAM, VPC Service Controls and dead-letter governance strip.

The end-to-end flow, traced through a parcel’s life:

  1. Ingress and edge. Clients — driver apps, the customer “track” app, client back-office systems, and warehouse scanners — hit a Global External Application Load Balancer with Cloud Armor (WAF, OWASP rules, geo and rate-based blocking) in front of API Gateway (or Apigee for the partner-facing, monetized APIs). The gateway terminates the API, validates the OIDC token (driver/customer identities from Identity Platform; client systems via API keys or service-account JWTs), and enforces per-key quotas. This is the only public door.

  2. Telemetry: accept the firehose, never touch the database. Driver-app GPS pings do not hit a function that writes to the operational store. They are published straight to a dedicated Pub/Sub topic (vehicle.telemetry) — Pub/Sub absorbs the 5,000-writes/second spike effortlessly because throughput is elastic and there are no partitions to provision. From that one topic, three independent subscriptions fan the stream three ways: (a) a Dataflow streaming job (or a Pub/Sub BigQuery subscription, which writes directly with zero code) lands every raw ping in BigQuery for analytics and replay; (b) a Cloud Run consumer maintains the current position of each vehicle as a single Firestore document per vehicle, which is what powers the live map; © a windowed Dataflow job detects geofence entry/exit and emits higher-level GeofenceCrossed events back onto the event plane. The transactional database never sees a raw GPS ping.

  3. Commands: accept fast, decouple immediately. For write operations that carry business weight — “create shipment”, “assign driver”, “mark delivered” — the gateway does not invoke a do-everything service. The driver app calls a thin Cloud Run command service that validates the request, performs a single idempotent write to Firestore keyed on a client-supplied request ID, and returns 202 Accepted in well under 100 ms. The fact is now durably captured; everything else happens out of band. “Mark delivered” returns instantly even when the SMS gateway is on fire.

  4. Firestore changes become events — natively, via Eventarc. This is the GCP-native transactional-outbox: you do not run a separate outbox table and poller. Eventarc has a first-class Firestore trigger — when the parcels/{id} document transitions to DELIVERED, Eventarc emits a CloudEvent and delivers it (through a managed Pub/Sub channel) to a publisher that puts a well-typed ParcelDelivered event onto the central logistics.events Pub/Sub topic. The event is published because and only because the Firestore write committed, so the store and the event plane can never disagree. (For services where you want explicit control, the command service publishes the event itself in the same logical unit — both patterns coexist; the Firestore-trigger path is the zero-glue default for state-change notifications.)

  5. Fan-out to independent reactors. On the central event topic, each interested service has its own Pub/Sub subscription with a message filter (server-side, so a subscriber only pays for and receives the event types it cares about). Each subscription push-delivers to a Cloud Run service with its own authenticated endpoint, or is pull-consumed by a service that wants flow control. The reactors to ParcelDelivered:

    • ETA service recomputes the route’s remaining-stop ETAs and emits RouteEtaUpdated.
    • Notification service sends the customer “delivered” SMS/push (and is the only thing that ever talks to the SMS gateway).
    • Billing service posts the billable event to the client’s system / ledger.
    • Driver-scoring service updates on-time-delivery stats.
    • POD service finalizes the proof-of-delivery image and signature.
    • Client-webhook service fans the fact out to whichever clients subscribed to it. Each reactor is a separate squad’s Cloud Run service, deployed independently, ignorant of the others. Adding a seventh is a new subscription with a filter — zero changes upstream. This is the property that turns “quarter-long client onboarding” into “two weeks.”
  6. The multi-step transaction runs as a workflow (saga). Dispatch is not one event; it is a sequence with real consequences — reserve capacity, optimize the route across stops (a call to the Route Optimization API / Cloud Fleet Routing), assign drivers, confirm acceptance, and compensate (re-queue parcels, release the driver) if a step fails or times out. That orchestration lives in a Workflows definition, triggered when a dispatch window opens. Workflows owns the retries, exponential backoff, timeouts, parallel branches, callbacks that wait on a driver’s acceptance, and — crucially — the compensating steps. None of that belongs in a tangle of Cloud Run services calling each other synchronously.

  7. Real-time read models for queries. Reactors project events into purpose-built Firestore read models — a LiveMap collection (current vehicle positions), a ParcelStatusView, a RouteBoard for the dispatch desk. Because Firestore push-streams changes to subscribed clients over its real-time listeners, the customer “track” app and the dispatch console get live updates with no polling — the map moves as the events arrive. This is CQRS: the write model and the read models are different shapes, kept eventually consistent by the event stream, and the read side is a push channel rather than a query load on the database.

Drawn out, the diagram is three horizontal bands. Top band (telemetry): Driver apps → LB/Cloud Armor → gateway → vehicle.telemetry Pub/Sub topic → three subscriptions → {BigQuery (raw), Cloud Run (live-position Firestore docs), Dataflow (geofence events)}. Middle band (command + event): App/clients → gateway → command Cloud Run → Firestore (idempotent write) → Eventarc Firestore triggerlogistics.events Pub/Sub topic → per-service filtered subscriptions → push to independent Cloud Run reactors → each projects a Firestore read model that push-streams to clients. Bottom band (orchestration): a dispatch-window event starts a Workflows saga that calls route-optimization, assignment, and confirmation services with built-in retry/compensation, writing terminal results back as events. Cross-cutting all three: Cloud Logging + Cloud Trace + Cloud Monitoring for traces and metrics, dead-letter topics on every subscription, Pub/Sub message retention + snapshots/seek for replay, and VPC Service Controls wrapping the data services.

Component breakdown

Component GCP service What it does here Key configuration choices
Edge & WAF Global External ALB + Cloud Armor TLS, global anycast ingress, OWASP + rate/geo blocking Rate-based rules per IP; preconfigured WAF rule sets; adaptive protection for L7 DDoS
API / authN API Gateway (or Apigee for partners) + Identity Platform The single public door; token validation, quotas, key management OIDC validation; per-key quotas; Apigee only where monetization/dev-portal is needed
Telemetry ingest Pub/Sub topic (elastic) Absorbs the GPS firehose; decouples ingest rate from processing rate No partitions to size; message ordering keys only per-vehicle where needed; 7-day retention
Stream processing Dataflow (streaming) + BigQuery subscription Windowed geofence detection; raw telemetry land-to-warehouse Pub/Sub→BigQuery direct subscription for raw (zero-code); Dataflow only where windowing/joins are needed
Command compute Cloud Run (request-driven) Accept-fast 202 writes; stateless, scale-to-zero min-instances only on latency-critical services; concurrency 80; CPU-always-allocated off unless streaming
Event compute Cloud Run (Pub/Sub push targets) Independent reactors to business events One service per bounded context; authenticated push (OIDC); --no-allow-unauthenticated
Central event bus Pub/Sub topic + filtered subscriptions Fan-out of business facts to many consumers Subscription filters so each consumer gets only its event types; per-subscription dead-letter topic + maxDeliveryAttempts
Event routing / outbox Eventarc (Firestore, Storage, Audit Log triggers) Turns Google-service activity into routed CloudEvents; native outbox Firestore document triggers for state-change events; Audit Log triggers for control-plane reactions
Operational store Firestore (Native mode) Source of truth for parcels/shipments; real-time read models Multi-region (e.g. nam5/eur3) for HA; security rules; composite indexes; transactions for idempotency
Orchestration Workflows Long-running saga: reserve → optimize → assign → confirm, with compensation Built-in retries + backoff; try/except for compensation; callbacks for human/driver waits
Analytics & replay BigQuery Telemetry warehouse, SLA reporting, event archive for reprocessing Partitioned by ingest day; streaming inserts via BQ subscription; replay back onto Pub/Sub
Observability Cloud Logging + Trace + Monitoring + Error Reporting Structured logs, distributed traces, SLO-based alerting across the async graph Trace context propagated through Pub/Sub attributes; log-based metrics; SLO burn-rate alerts

A few of these choices carry the design and deserve the why, not just the what.

Pub/Sub is the backbone — and its lack of partitions is the point. On other clouds you size a partition count for your broker and then live with that ceiling; a 3x unexpected spike means hot partitions and rebalancing pain. Pub/Sub has no partitions to provision — throughput scales elastically, which is exactly what you want for a telemetry firehose whose peak is 3x its baseline and arrives in a 9-minute window. Two features make it the right integration backbone, not just a queue: subscription filters (a server-side predicate on message attributes, so the billing service literally never receives RouteEtaUpdated, cutting both delivery and cost) and dead-letter topics with maxDeliveryAttempts (a poison message is sidelined with full delivery metadata instead of crash-looping a consumer forever). Use ordering keys sparingly — only where per-entity order matters (all events for one parcel), because global ordering kills the elastic throughput that makes Pub/Sub special.

Eventarc is the native outbox; don’t build one. The classic transactional-outbox pattern — write your row and an “outbox” row in one transaction, then run a poller that publishes the outbox and marks it sent — exists to guarantee the store and the broker agree. On GCP you get that for free: Eventarc’s Firestore trigger fires a CloudEvent precisely when the document commits, so a ParcelDelivered event is published if and only if the parcel actually reached DELIVERED. No outbox table, no poller, no dual-write race. Eventarc’s other triggers earn their keep too: a Cloud Storage trigger turns a proof-of-delivery image upload into a PodImageReceived event; an Audit Log trigger lets a governance reactor respond to control-plane actions (e.g. a service account key being created). Eventarc is the routing fabric that makes “any Google event becomes a typed, delivered fact” true without glue code.

Cloud Run for both planes, but configured differently. The same primitive (a stateless container that scales to zero) serves the request plane and the event plane, but the knobs differ. Request-facing command services that users wait on get min-instances ≥ 1 to dodge cold starts on the hot path; background event reactors scale from zero and tolerate a cold start because nobody is watching the clock. Push subscriptions deliver to Cloud Run over authenticated HTTPS (Pub/Sub mints an OIDC token; the service runs --no-allow-unauthenticated and only accepts the Pub/Sub service account), so there is no open endpoint. Concurrency is set high (80+) for I/O-bound handlers to maximize utilization, low (1–4) only for memory-heavy or non-thread-safe ones. This is the difference between a Cloud Run bill that tracks real work and one that pays for idle.

Workflows, not a chain of services calling services, for the saga. The temptation is to have the route-optimization service call the assignment service call the confirmation service. That recreates the synchronous monolith with worse failure modes (each hop adds latency and a new way to fail mid-transaction, with no built-in compensation and opaque debugging). A Workflows definition instead gives you declarative steps with retry policies and exponential backoff, parallel branches, callback endpoints that pause the execution until a driver accepts (then resume), try/except blocks that run compensating steps (re-queue the parcels, release the reserved driver) when something fails, and a full per-execution history that turns a 5-hour incident into a 5-minute one. Reach for Cloud Tasks instead only when you need a single deferred/throttled job (rate-limit calls to a fragile client webhook), not a multi-step transaction — the two complement each other.

Implementation guidance

Identity and networking wiring (do this first — it prevents the most common incidents).

Idempotency is non-negotiable and lives in three places. At-least-once delivery is the law of this land: Pub/Sub, Eventarc, and any retry can hand you the same message twice (Pub/Sub even has an exactly-once delivery mode for pull subscriptions, but you should still design idempotently — it does not absolve a push consumer or a redelivery after a crash). (1) The command write path enforces it structurally: the command service uses a Firestore transaction with the client request ID as the document/key guard, so a retried “mark delivered” is a no-op rather than a second delivery. (2) Side-effecting reactors (send an SMS, post a billing charge) record the processed messageId in a small Firestore “processed-events” collection with a TTL, and short-circuit on replay; the SMS gateway and billing API are also called with their own idempotency keys. (3) The Workflows saga makes each step idempotent and passes idempotency keys to external calls so a retry never double-charges or double-assigns.

Telemetry handling specifics. Raw GPS pings go to Pub/Sub and into BigQuery via a direct Pub/Sub→BigQuery subscription (no code, no Dataflow) for the raw-archive path. Only the derived work — geofence windowing, “vehicle idle > N minutes”, route-progress — runs in a Dataflow streaming job with windowing and watermarks. The live-map updater is a Cloud Run pull consumer that debounces: it writes each vehicle’s current position to Firestore at most once per few seconds, so 8,200 vehicles pinging every 5 s do not generate 8,200 Firestore writes/second of map churn. The transactional Firestore database and the telemetry path are physically different stores/collections so the firehose can never contend with dispatch.

A few IaC snippets (Terraform) that capture the load-bearing wiring people most often get wrong. First, a filtered subscription with a dead-letter topic and retry policy — the part teams omit, then wonder why a poison event crash-loops a service:

resource "google_pubsub_topic" "events"     { name = "logistics-events" }
resource "google_pubsub_topic" "events_dlq" { name = "logistics-events-dlq" }

resource "google_pubsub_subscription" "billing" {
  name  = "billing-parcel-delivered"
  topic = google_pubsub_topic.events.id

  # Server-side filter: billing only ever sees the events it cares about.
  filter = "attributes.eventType = \"ParcelDelivered\" OR attributes.eventType = \"ShipmentClosed\""

  # Authenticated push straight to the billing Cloud Run service.
  push_config {
    push_endpoint = "${google_cloud_run_v2_service.billing.uri}/events"
    oidc_token { service_account_email = google_service_account.pubsub_push.email }
  }

  # Poison messages get sidelined, not retried forever.
  dead_letter_policy {
    dead_letter_topic     = google_pubsub_topic.events_dlq.id
    max_delivery_attempts = 5
  }
  retry_policy {
    minimum_backoff = "10s"
    maximum_backoff = "600s"
  }
  ack_deadline_seconds = 60
}

Second, the Eventarc Firestore trigger that turns a committed delivery into a routed CloudEvent — the native outbox, with no poller:

resource "google_eventarc_trigger" "parcel_delivered" {
  name     = "parcel-delivered"
  location = "nam5"

  matching_criteria { attribute = "type"     value = "google.cloud.firestore.document.v1.updated" }
  matching_criteria { attribute = "database" value = "(default)" }
  # Fire only on writes under the parcels collection.
  matching_criteria { attribute = "document" value = "parcels/{parcelId}" operator = "match-path-pattern" }

  destination {
    cloud_run_service {
      service = google_cloud_run_v2_service.event_publisher.name
      region  = "us-central1"
      path    = "/firestore"
    }
  }
  service_account = google_service_account.eventarc.email
}

Third, a Cloud Run event reactor locked down to authenticated Pub/Sub push only:

resource "google_cloud_run_v2_service" "billing" {
  name     = "billing-svc"
  location = "us-central1"
  ingress  = "INGRESS_TRAFFIC_INTERNAL_LOAD_BALANCER"
  template {
    service_account = google_service_account.billing.email
    scaling { min_instance_count = 0  max_instance_count = 50 }
    containers {
      image = "us-docker.pkg.dev/${var.project}/svc/billing:${var.tag}"
      resources { limits = { cpu = "1", memory = "512Mi" } }
    }
    vpc_access {
      connector = google_vpc_access_connector.egress.id
      egress    = "ALL_TRAFFIC"   # all egress through the connector → Cloud NAT
    }
  }
}

# Only the Pub/Sub push identity may invoke it.
resource "google_cloud_run_v2_service_iam_member" "billing_invoker" {
  name     = google_cloud_run_v2_service.billing.name
  location = google_cloud_run_v2_service.billing.location
  role     = "roles/run.invoker"
  member   = "serviceAccount:${google_service_account.pubsub_push.email}"
}

And the orchestration skeleton — a Workflows saga with retry and a compensating branch (the part that makes it a saga, not just a script):

main:
  steps:
    - reserve:
        try:
          call: http.post
          args: { url: ${reserve_url}, body: { window: ${input.windowId} } }
          result: reservation
        retry: ${http.default_retry}            # exponential backoff, bounded
    - optimize:
        call: http.post
        args: { url: ${optimize_url}, body: ${reservation} }
        result: plan
    - assign_and_confirm:
        try:
          call: http.post
          args: { url: ${assign_url}, body: ${plan} }
          result: assignment
        except:
          as: e
          steps:
            - compensate:                        # release the reservation, re-queue parcels
                call: http.post
                args: { url: ${release_url}, body: ${reservation} }
            - fail:
                raise: ${e}

Deployment and progressive delivery. Each service deploys independently through Cloud Build (or GitHub Actions over Workload Identity Federation) to Artifact Registry, then Cloud Run. Use Cloud Run revision traffic splitting for canaries — send 5% to the new revision, watch the error-rate and latency SLOs, then ramp. Pub/Sub schema evolution is additive so a producer and consumer can deploy on different days without breaking. The whole platform — topics, subscriptions, triggers, services, IAM, VPC-SC — is one Terraform state per environment, promoted through dev → staging → prod with the same code.

Enterprise considerations

Security and Zero Trust. The model is “no implicit trust, identity on every hop.” Public traffic enters only through the ALB + Cloud Armor + gateway; every internal endpoint (Cloud Run reactors, Eventarc targets) requires an authenticated identity (run.invoker granted only to the specific Pub/Sub-push or Eventarc service account), so even inside the project there is no anonymous east-west call. Service-to-data access is per-service least-privilege service accounts; secrets live only in Secret Manager, reached via SA identity — no keys in code, configs, or pipeline logs (the leaked-credential failure mode is designed out, not patched). A VPC Service Controls perimeter around Firestore/BigQuery/Pub/Sub/Secret Manager blocks data exfiltration to projects outside the boundary; Organization Policy constraints (e.g. iam.disableServiceAccountKeyCreation, domain-restricted sharing, allowed regions) are enforced org-wide. Pub/Sub schemas plus message validation stop malformed or spoofed events at the door.

Cost optimization. The architecture is structurally cheap because everything scales to zero or near-zero at 3 a.m. Cloud Run charges per request and per vCPU-second of actual work; only the few latency-critical command services hold min-instances warm. Pub/Sub subscription filters mean a consumer is neither delivered nor billed for events it does not want — a real lever when one topic carries a dozen event types and most consumers want one. Raw telemetry to BigQuery via the direct Pub/Sub→BigQuery subscription avoids a standing Dataflow bill for the no-windowing path; Dataflow runs only for the genuinely streaming-analytical work, on autoscaling workers. Firestore is billed per operation, so the debounced live-map writes (cap one position write per vehicle per few seconds) are a deliberate cost control, not just a performance one. BigQuery telemetry tables are day-partitioned with expiration so cold history rolls to cheaper storage or out. Set budgets and alerts, and watch Pub/Sub unacked message age and Cloud Run billable instance time as your two leading cost-and-health indicators.

Scalability and reliability (RTO/RPO). Pub/Sub absorbs spikes by design — the 9-minute dispatch surge becomes elastic broker throughput and a brief, self-draining backlog rather than a database meltdown. Cloud Run scales horizontally to thousands of instances per service and back to zero. For reliability, run Firestore in multi-region mode (e.g. nam5) for automatic synchronous replication across regions — RPO ≈ 0 and RTO in minutes for a regional failure, with no application change. Pub/Sub is a regional-to-global managed service with its own redundancy; message retention (up to 7 days) plus snapshots and seek give you replay — if a consumer ships a bad version and mis-processes an hour of events, you fix it and seek the subscription back to reprocess, with BigQuery holding the raw archive as a second source of truth. Every async hop has a dead-letter topic, so a poison message is quarantined with full metadata instead of stalling the pipeline. Define SLOs (delivery-confirmation p99 latency, event end-to-end lag, telemetry freshness) and alert on burn rate, not raw error counts.

Observability. This is the price of admission for event-driven, and skipping it leaves you with something harder to debug than the monolith. Propagate a trace/correlation ID as a Pub/Sub message attribute end-to-end so Cloud Trace stitches “parcel delivered → ETA recomputed → SMS sent → billed” into one timeline across services and the broker. Cloud Run emits structured JSON logs to Cloud Logging; build log-based metrics for business events (deliveries/min, SLA breaches/hour) and feed them to Cloud Monitoring dashboards. Error Reporting groups exceptions across services. The two operational vital signs are Pub/Sub subscription unacked-message age and backlog (a rising backlog means a consumer can’t keep up — the leading indicator of every async incident) and dead-letter topic depth (anything landing there is a real bug to investigate).

Governance. A clear resource hierarchy (org → folders for environments → projects per environment/domain) with Organization Policies enforced top-down. The event catalog — every event type, its Pub/Sub schema, its producer, and its subscribers — is a governed, versioned artifact (in the repo and surfaced via Data Catalog / Dataplex), because in an event-driven system the schemas are the integration contracts and must be managed like APIs. IAM is least-privilege and reviewed; service accounts are inventoried. Audit Logs (Admin Activity always on; Data Access enabled on the sensitive stores) flow to a logs bucket / BigQuery sink for retention and SIEM. Cost and ownership are attributed per service via labels.

Reference enterprise example

Meridian Freight ran the rebuild over three quarters. Concrete decisions and numbers:

Scale and traffic. Steady state ~8,200 active vehicles emitting telemetry every 5 s (~1,600 msg/s baseline), ~210,000 parcels/day. The morning dispatch wave drives telemetry to ~5,100 msg/s and delivery-confirmation events to ~900/s in a 9-minute window. The old monolith’s Cloud SQL primary saturated at ~1,200 telemetry writes/second, which is below even the steady-state rate — telemetry and dispatch were fighting for the same instance.

What they built.

Decisions that paid off.

A failure they handled gracefully. A bad deploy of the eta service started raising on routes with a specific stop pattern. Because ETA recomputation is a Pub/Sub consumer with maxDeliveryAttempts=5 and a dead-letter topic, the affected events dead-lettered with an alert instead of crash-looping; deliveries and notifications continued for everyone else, and after the fix they seek-replayed the dead-letter and the original subscription window. The blast radius was “ETAs were briefly stale for some routes,” not “the platform went down.”

Outcome. Steady-state GCP spend ~₹7.8 lakh/month, down from an over-provisioned VM fleet that cost more to keep upright through the morning wave — and most of the new bill is BigQuery analytics they did not previously have. Dispatch-wave survivability proven at 4x baseline with no degradation, telemetry fully decoupled from the transactional path, a live customer map, and a platform eight teams evolve independently. The business signed off on the strength of the first incident-free dispatch wave alone.

When to use it

Use this architecture when:

Be honest about the trade-offs:

Anti-patterns to avoid (each is a real incident waiting to happen):

Alternatives worth weighing: a modular monolith (simplest, right until you have proven a firehose or an independent-scaling/independent-deploy need); GKE with Knative/Eventing or a KEDA-scaled deployment if you need full Kubernetes control and want the same event-driven shape on a cluster you operate; Cloud Functions (2nd gen) instead of Cloud Run for the smaller, single-purpose reactors where you want the lightest possible deploy unit (they share the Eventarc/Pub/Sub plumbing exactly); Dataflow-centric streaming if your problem is overwhelmingly continuous analytics rather than discrete business events; and pure Pub/Sub + Cloud Run + Firestore with no Workflows at all for the genuinely small or spiky-but-stateless workload that has no multi-step saga yet. The shape — three planes, telemetry decoupled, commands vs. events, native Eventarc outbox, idempotent handlers, saga — stays the same across all of them. Only the compute host and the orchestration richness change.

GCPArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading