Data GCP

GCP Enterprise Architecture: Real-Time Analytics

Real-time analytics on Google Cloud fails in a way that looks like success for the first three weeks. A team wires the Pub/Sub-to-BigQuery direct subscription, points Looker at the landing table, sees rows appear two seconds after they’re produced, and declares victory. Then the events arrive out of order, a malformed payload poisons the subscription’s delivery and the backlog quietly climbs, a marketing campaign triples volume and dashboards that used to return in 400 ms now take nine seconds because every Looker query scans the full raw table, and finance notices that BigQuery is being billed for a full-table scan on every dashboard refresh. None of that is a Pub/Sub problem or a BigQuery problem. It is the absence of an architecture — a deliberate separation between the durable ingest log, the stream-processing tier that cleans and shapes events, the analytical store that is laid out for the questions you actually ask, and the serving layer that answers those questions in milliseconds without re-scanning history.

The single most important idea in the Google Cloud version of this pattern is that the warehouse is the streaming sink, not a thing you load into. On AWS or Azure you usually keep your hot operational store and your analytical store physically separate because the warehouse can’t ingest fast enough to be live. BigQuery breaks that assumption: the Storage Write API ingests streaming rows that are queryable within seconds, so a single, well-partitioned BigQuery dataset can be both the system of record and the thing Looker queries — provided you put a real Dataflow pipeline in front of it to handle ordering, deduplication, late data, and schema, and put BI Engine behind it so the dashboard layer never pays a full scan. This article is that architecture, built end to end on Pub/Sub, Dataflow, BigQuery, Looker, and BigQuery BI Engine.

The business scenario

Picture an operator who has events but no way to steer the business while those events still matter. The shape is identical whether you are a 40-person company or a 4,000-person one; only the volume and the number of stakeholders change.

A direct-to-consumer retailer is the clean example. Every web and app session emits a stream: page views, add-to-cart, checkout-started, payment-confirmed, search queries, inventory decrements, delivery-status webhooks from carriers. Today those events land in an application database and get ETL’d into a warehouse on a nightly batch. So the merchandising team plans tomorrow’s homepage off yesterday’s data; a stockout on a hero product is invisible until the morning report; a payment-gateway degradation that’s quietly killing conversion is discovered when someone looks at the dashboard at 10 a.m. and asks why revenue is down. The business questions are not exotic — what is selling right now, in which region, on which channel, and is anything broken — but the data arrives too late to answer them in time to act.

The requirements that fall out of this are consistent across enterprise sizes:

The architecture below meets all five with managed, mostly serverless GCP services, so a small team runs it without a streaming-infrastructure group, and a large enterprise scales the same design to millions of events per second.

Architecture overview

The end-to-end data path is a clean left-to-right pipeline with one durable buffer at the front, one processing tier in the middle, one analytical store that doubles as the serving store, and an acceleration plus presentation layer at the right. Read it as four stages.

Stage 1 — Ingest (Pub/Sub). Producers — web/app SDKs via a lightweight collector on Cloud Run, server-side services, third-party webhooks, and Change Data Capture from operational databases via Datastream — publish events to Pub/Sub topics. Pub/Sub is the durable, decoupling log: it absorbs spikes, holds messages (default 7-day retention, up to 31 days), and lets every downstream consumer read independently. Producers never feel consumer pressure. A topic per event domain (for example clickstream, orders, inventory) keeps schemas coherent and access controllable. Pub/Sub schema definitions (Avro/Protobuf) are attached to topics so bad payloads are rejected at publish time rather than poisoning the pipeline downstream.

Stage 2 — Stream processing (Dataflow). A Dataflow streaming job, written with the Apache Beam unified model, subscribes to the topics and does the real work: parse and validate, deduplicate on a business key, resolve event-time ordering and windowing (so late events land in the right time bucket instead of “now”), enrich (join against slowly-changing reference data — product catalog, customer segment — cached in side inputs or looked up in Bigtable/Memorystore), and route. Clean, shaped rows are written to BigQuery via the Storage Write API with exactly-once semantics. Anything that fails validation goes to a dead-letter Pub/Sub topic and a BigQuery errors table so nothing is lost and nothing stalls the main flow. Dataflow autoscaling and Streaming Engine add and remove workers with load.

Stage 3 — Analytical store (BigQuery). BigQuery is the streaming warehouse. Dataflow’s exactly-once writes land in time-partitioned, clustered tables that are queryable within seconds. The same dataset holds today’s live data and years of history, so “right now” and “year over year” are the same query surface. Scheduled queries and materialized views roll raw events into pre-aggregated marts (revenue by minute/region/channel, funnel conversion), and BigQuery ML can score in place (anomaly detection on the conversion stream) without moving data.

Stage 4 — Acceleration and serving (BI Engine + Looker). BigQuery BI Engine is an in-memory analytical layer that sits transparently in front of BigQuery: reserve memory in the dataset’s region, and the hot tables, marts, and materialized views Looker queries are served from RAM in tens of milliseconds with no query rewrite. Looker is the semantic and presentation layer: its LookML model defines metrics, dimensions, and the funnel once, governs row-level access, and renders the operational dashboards merchandisers and on-call engineers watch. Looker queries BigQuery, BI Engine accelerates them, and refreshes feel instant even as concurrency grows.

The flow in one breath: producers → Pub/Sub (durable buffer) → Dataflow (clean, dedupe, window, enrich, exactly-once write) → BigQuery (live + historical store, marts) → BI Engine (in-memory acceleration) → Looker (governed dashboards), with a dead-letter branch off Dataflow for anything malformed. Everything in the path is managed and elastic; there is no cluster you patch and no broker you capacity-plan by hand.

GCP real-time analytics reference architecture: Pub/Sub ingest, Dataflow stream processing with a dead-letter branch, BigQuery as the streaming warehouse with materialized views and ML, then BI Engine acceleration and Looker dashboards, numbered as an eight-step data flow

Component breakdown

Component Role in the path Key configuration choices
Pub/Sub topics Durable ingest buffer; decouples producers from consumers; absorbs spikes Topic per event domain; attach Avro/Protobuf schema; set message retention (7→up to 31 days) for replay; enable exactly-once delivery on subscriptions; ordering keys only where strictly needed
Pub/Sub subscriptions Deliver to Dataflow; isolate consumers Dedicated subscription per consumer; dead-letter topic + max delivery attempts; tune ack deadline; separate subscription for any direct BigQuery export use cases
Cloud Run collector Server-side event endpoint for web/app SDKs Stateless, autoscaling to zero; validates and publishes to Pub/Sub; behind a global HTTPS Load Balancer + Cloud Armor
Datastream Low-latency CDC from operational DBs into the stream Streams MySQL/Postgres/Oracle changes to Pub/Sub or BigQuery; pairs with Dataflow templates for transform
Dataflow (Beam) streaming job Parse, validate, dedupe, window, enrich, route, write Streaming Engine on; horizontal autoscaling; event-time windowing + watermarks + allowed lateness; dedupe on business key; Storage Write API exactly-once to BigQuery; dead-letter branch; side inputs / Bigtable for enrichment
BigQuery dataset/tables Streaming warehouse: live + historical store of record Time-unit partitioning (DAY/HOUR) on event time; clustering on high-cardinality filters (region, channel, sku); partition expiration for raw retention; materialized views + scheduled queries for marts
BigQuery ML In-warehouse scoring without data movement Anomaly/forecast models on the conversion or volume stream; inference via ML.PREDICT in scheduled queries
BigQuery BI Engine In-memory acceleration of the serving tables Reserve memory (GiB) in the dataset region; size to the hot marts + MVs; transparent — no query change; monitor hit ratio
Looker (LookML) Semantic model + governed dashboards Define metrics/funnel once in LookML; row-level access via user attributes; persist derived tables (PDTs) for heavy rollups; dashboards with auto-refresh

A few choices deserve the why, because they are where this architecture earns its keep.

Pub/Sub schemas and a dead-letter topic are non-negotiable, not nice-to-haves. The failure that kills naive pipelines is a single malformed event that the consumer can’t parse, retries forever, and backs up the whole subscription. Attaching a schema rejects garbage at publish time; the dead-letter topic catches whatever slips through so the main flow never stalls and you can replay or inspect failures later.

Dataflow exists specifically to do what the Pub/Sub→BigQuery direct subscription cannot. The direct subscription is genuinely useful for raw, append-only landing with no transformation. But it can’t deduplicate on a business key, can’t put a late event into the correct historical window (it lands it in “now”), can’t enrich against reference data, and can’t route bad records aside. Those four jobs are the difference between a dashboard you trust and one you don’t. Beam’s event-time windowing with watermarks and allowed lateness is the mechanism that makes “what happened in the 14:05 minute” correct even when some 14:05 events show up at 14:09.

Partitioning and clustering are the cost-control mechanism, applied at the store. A query filtered to today and one region should scan a few partitions and a few clusters, not the whole table. Get the partition key (event time) and cluster keys (the columns dashboards filter on) right and BigQuery’s bytes-scanned — what you pay for on-demand — drops by orders of magnitude. This is the layer that makes the next layer affordable.

BI Engine is what makes Looker feel instant without changing a query. BigQuery is fast but is a scan engine; interactive dashboards with hundreds of users hammering filter changes want memory-speed responses. BI Engine reserves RAM in the dataset’s region and transparently serves the hot tables from memory — Looker’s existing SQL just gets answered in tens of milliseconds. You don’t rewrite anything; you size a reservation and watch the hit ratio.

Implementation guidance

Project and dataset layout. Use a dedicated Google Cloud project for the analytics platform (or one per environment: analytics-dev, analytics-prod) under your org/folder hierarchy, with a shared VPC from the landing-zone host project. BigQuery datasets are regional — co-locate the dataset, BI Engine reservation, and Dataflow workers in the same region (for example us-central1 or europe-west1) to avoid cross-region egress and latency, and to keep BI Engine eligible.

Infrastructure as code (Terraform). Provision the whole path declaratively so it’s reproducible and reviewable:

Keep the Beam pipeline code in its own repo with unit tests on the transforms (Beam’s TestStream lets you assert windowing and late-data behavior deterministically) and a CI pipeline that builds the Flex Template image and updates the job.

The Dataflow pipeline, concretely. A streaming Beam pipeline (Java or Python) that: reads from the Pub/Sub subscription with message attributes for event time; applies a fixed or sliding window with an event-time watermark and allowed_lateness; deduplicates with a stateful Deduplicate/keyed dedup on the business key over a time bound; enriches via a side input refreshed periodically (catalog/segment) or a per-key lookup to Bigtable/Memorystore for high-cardinality joins; branches invalid records to a dead-letter PubsubIO/BigQuery errors table with the failure reason; and writes valid rows to BigQuery using BigQueryIO.write() with the Storage Write API at-least-once or exactly-once method. Prefer Storage Write API over legacy streaming inserts — it’s the current path, cheaper, and supports exactly-once.

Networking. The Cloud Run collector sits behind a global external HTTPS Load Balancer with Cloud Armor (WAF, geo/rate rules, bot defense) and Cloud CDN where applicable. Dataflow workers run in your shared VPC subnet with Private Google Access, so they reach Pub/Sub and BigQuery over Google’s private network with no public egress; use --no_use_public_ips. Lock data services behind VPC Service Controls perimeters so BigQuery and Pub/Sub can’t exfiltrate data outside the perimeter even with valid credentials. Looker reaches BigQuery over Google’s network; restrict Looker admin/UI access via IAP or an allowlist.

Identity and access (least privilege). Give each component its own service account: the collector SA gets pubsub.publisher on its topics only; the Dataflow worker SA gets pubsub.subscriber on its subscriptions, bigquery.dataEditor on the target dataset, and read on enrichment sources; the Looker SA gets bigquery.dataViewer + bigquery.jobUser on the serving dataset and nothing else. Humans get access through groups mapped to IAM roles, not individual grants. Use column-level access (policy tags via Data Catalog/Dataplex) to mask PII (email, payment tokens) so analysts query behavior without seeing identifiers, and authorized views / row-level access to scope what each market or team sees. Looker layers its own row-level controls on top via user attributes for defense in depth.

Enterprise considerations

Security and Zero Trust. The perimeter is identity- and context-based, not network-trust-based. No service account has standing broad access; each is scoped to exactly the topics/datasets it needs, and access is brokered through groups. VPC Service Controls create a data perimeter so even a leaked key can’t move BigQuery data out. Dataflow runs without public IPs over Private Google Access. PII is masked with column-level policy tags so most analysts never see raw identifiers; access to unmasked columns is a separate, audited grant. Cloud Armor fronts the only public ingress (the collector). Every access is logged in Cloud Audit Logs, and Pub/Sub schemas plus the dead-letter path mean malformed or hostile payloads are contained, not propagated. This is Zero Trust applied to a data platform: verify identity, grant least privilege, segment the data, and assume any single credential can be compromised.

Cost optimization. Costs split across four meters, and the architecture is shaped to keep each low:

The decisive cost move is that dashboards never hit raw history: BI Engine + materialized views absorb the interactive load, so analyst curiosity doesn’t translate into scan bills.

Scalability. Every stage scales independently. Pub/Sub is effectively unbounded throughput and absorbs spikes as a buffer. Dataflow autoscaling adds workers as the backlog or input rate grows and removes them when it drains. BigQuery’s storage and (with editions autoscaling) compute scale without capacity planning. BI Engine scales by adding reservation memory. Because the stages are decoupled by the Pub/Sub log, a slow downstream never backs up producers — the buffer just grows and drains. The same design runs at thousands of events/sec for a mid-market company and at millions/sec for a large enterprise; only the autoscaling ceilings and reservation sizes change.

Reliability and DR (RTO/RPO). The durable Pub/Sub log is the backbone of recoverability: with retention configured, you can replay from a timestamp to rebuild a derived table after a logic bug or a bad deploy, and the dead-letter topic preserves anything that failed. Dataflow checkpoints state in Streaming Engine, so a worker failure doesn’t lose in-flight data; exactly-once writes mean a retry doesn’t double-count. For RPO: with Pub/Sub retention and replay, effective data loss approaches zero for the window you retain — design for an RPO inside your retention period (e.g., 7 days). For RTO: BigQuery is regional with high availability inside the region; for regional-outage resilience, choose a multi-region BigQuery location (US/EU) or replicate critical datasets to a second region and keep the Dataflow + Pub/Sub deploy reproducible via Terraform so you can stand the pipeline back up in a paired region in well under an hour. Most enterprises target RTO of an hour or two for the dashboard layer and minutes for ingest (Pub/Sub is global-edge resilient).

Observability. Cloud Monitoring + Logging give you the operational picture: Pub/Sub subscription backlog / oldest-unacked-message age (the leading indicator that processing is falling behind), Dataflow system lag / data freshness / watermark and worker autoscaling, BigQuery slot utilization and bytes scanned, BI Engine hit ratio, and dead-letter topic depth (should be near zero; an alert fires if it grows). Looker’s own usage analytics show slow dashboards and heavy queries. Set SLOs on end-to-end freshness (event-produced to queryable) and alert on backlog age and dead-letter growth — those two catch most real incidents early.

Governance. Dataplex / Data Catalog provide the data catalog, lineage, and policy tags that drive column masking; this is where data products are described and discovered. Looker’s LookML is the governed semantic layer — metrics and the funnel are defined once, version-controlled in Git, and access-scoped, so every team computes “conversion rate” the same way. IAM through groups, audit logs, and VPC Service Controls round out a setup that satisfies SOC 2 / ISO controls and data-residency requirements (pin region to the required jurisdiction).

Reference enterprise example

Lumen & Loom is a mid-market omnichannel home-goods retailer: roughly 900 employees, an e-commerce site and mobile app, 40 physical stores feeding point-of-sale events, and about ₹4,200 crore in annual revenue. Their merchandising and growth teams were flying blind intraday — the warehouse refreshed nightly, so promotions were planned on yesterday’s data, stockouts on hero SKUs surfaced a day late, and a payment-gateway slowdown one Black-Friday-eve cost them an estimated ₹1.1 crore in lost conversion before anyone noticed at the morning standup. They set a target: operational dashboards no more than 90 seconds behind reality, sub-second interactivity, and no per-dashboard scan bills, on a platform a five-person data team could run.

They deployed exactly this architecture in asia-south1 (Mumbai) for data residency. Volume at peak is about 35,000 events/sec across clickstream, orders, inventory, and POS. The web/app SDKs post to a Cloud Run collector behind a global HTTPS LB with Cloud Armor; server services and Datastream CDC from their Postgres order DB publish to four Pub/Sub topics, each with an attached Avro schema and a dead-letter topic. A single Dataflow streaming job (Streaming Engine, autoscaling 4→30 workers) deduplicates on event_id, applies 1-minute event-time windows with 4 minutes of allowed lateness, enriches orders with the product catalog via a refreshed side input, routes ~0.3% malformed events to a dead-letter table, and writes via the Storage Write API (exactly-once) into BigQuery tables partitioned by event-hour and clustered on (region, channel, sku).

Materialized views maintain revenue-by-minute-by-region-by-channel and the checkout funnel; a BigQuery ML anomaly model scores the conversion stream every minute. A BI Engine reservation of 40 GiB sits in front of the marts, and Looker (Google Cloud core) serves the operational dashboards with LookML-defined metrics, row-level access by region for store managers, and column masking on customer email/payment tokens.

The outcome after one quarter:

The decision that paid off most was resisting the temptation to point Looker straight at the Pub/Sub→BigQuery direct subscription’s raw table. Putting Dataflow in the middle (for dedup, windowing, enrichment, dead-lettering) and BI Engine at the end (for speed and cost) is the entire difference between the “works in the demo” version and the one merchandising now plans the homepage on every afternoon.

When to use it

Use this architecture when you need genuinely fresh analytics (seconds-to-minutes) over event streams, and you want the same store to answer historical questions, and you have interactive dashboard users who need fast, governed, self-serve access. It is the sweet spot for clickstream and product analytics, real-time operational/revenue monitoring, IoT and telemetry analytics, fraud/anomaly surfacing, and any “live business dashboard” use case. It scales cleanly from a small team to a large enterprise because every tier is managed and elastic.

The trade-offs and anti-patterns:

Alternatives, and when they fit better. If your latency tolerance is hours, not seconds, skip the streaming tier entirely and use scheduled batch loads into BigQuery — cheaper and simpler. If you need true sub-second per-record operational lookups (serving an app feature, not a dashboard) rather than analytical aggregation, pair this with Bigtable as the low-latency serving store, since BigQuery is a scan engine, not a key-value store. If your team lives in open-source streaming and wants Kafka semantics, Pub/Sub Lite or self-managed Kafka on GKE plus Dataproc/Flink is a heavier but more portable substitute for the Pub/Sub + Dataflow pair. And if you’re standardizing on Spark and the lakehouse rather than the warehouse, Dataproc + BigLake/Iceberg with Looker on top is the analogous pattern — but for a managed, serverless, warehouse-centric real-time analytics platform on Google Cloud, Pub/Sub → Dataflow → BigQuery → BI Engine → Looker is the reference to reach for first.

GCPArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading