Data GCP

GCP Enterprise Architecture: Big Data Processing

Every enterprise eventually hits the same wall: the data exists, but it is trapped. Sales sits in a CRM, clickstream lands in an event bus, the warehouse holds yesterday’s truth, and a dozen CSV exports rot in shared drives. Analysts wait days for a number, data scientists copy production tables to laptops, and nobody trusts the dashboard because three teams compute “active customer” three different ways. This article is a complete, reusable GCP architecture for big data processing — the kind of platform that takes raw events and files in at one end and produces governed, query-ready, ML-ready data at the other. It is built on the services that actually do the heavy lifting on Google Cloud: Cloud Storage as the lake, Dataproc for Spark/Hadoop workloads, Dataflow for streaming and unified batch pipelines, BigQuery as the serverless warehouse and serving layer, and Cloud Composer (managed Apache Airflow) as the orchestration brain.

The business scenario

Picture a mid-market retailer — call the pattern “RetailCo” for now; we will name a concrete company later. They run e-commerce plus 140 physical stores. Their data problem is not exotic, which is exactly why this architecture is widely useful:

The same shape applies to a fintech (transactions + risk), a media company (impressions + content analytics), a manufacturer (IoT telemetry + ERP), or a SaaS vendor (product usage + billing). Small enterprises run a trimmed version (one region, autoscaling to zero); large enterprises run the full multi-zone, multi-project, governed version. The architecture scales down as gracefully as it scales up, which is the test of a real reference design.

The problem this solves, stated plainly: turn many noisy sources into one governed, query-fast, cost-predictable data platform that serves real-time, analytical, and ML consumers from the same source of truth — without locking you into a single processing engine.

Architecture overview

The platform is a lakehouse: a Cloud Storage data lake for cheap, open-format storage, fronting BigQuery as the warehouse and serving engine, with Dataproc and Dataflow as the two complementary processing engines and Composer conducting the whole orchestra. Data flows through medallion-style zones — raw → curated → consumption (bronze/silver/gold).

GCP big-data lakehouse reference architecture: Pub/Sub and Cloud Storage ingestion feed Dataflow (streaming) and Dataproc (batch) engines, writing raw and curated Iceberg zones in Cloud Storage, served through BigQuery to BI, BigQuery ML and Vertex AI, all orchestrated by Cloud Composer and governed by Dataplex, DLP, IAM and VPC Service Controls.

End-to-end data path:

  1. Ingestion. Two front doors. Streaming events (clickstream, app telemetry, CDC) publish to Pub/Sub. Batch files (POS drops, third-party feeds, exports) land in a Cloud Storage raw bucket, organized as raw/<source>/<table>/dt=YYYY-MM-DD/. CDC from operational databases uses Datastream, which writes change records into the raw bucket (and can target BigQuery directly).

  2. Stream processing. A Dataflow streaming pipeline (Apache Beam) subscribes to Pub/Sub, validates and enriches events, dead-letters bad records, and writes two ways: low-latency rows into BigQuery via the Storage Write API (for live dashboards and fraud features) and raw-but-structured Parquet into Cloud Storage raw (so the lake remains the complete system of record). This is the speed layer.

  3. Batch transformation. Cloud Composer (Airflow) orchestrates the batch layer. On schedule (or triggered by file arrival), it runs Dataproc Serverless Spark jobs that read the raw zone, clean/conform/deduplicate, apply business logic and SCD (slowly changing dimension) handling, and write curated data as partitioned, Apache Iceberg-managed Parquet in Cloud Storage. Heavy ML feature engineering and any existing PySpark/Hive workloads run here too — Dataproc gives the team their familiar Spark engine without managing clusters.

  4. Warehouse load and modeling. Curated tables are exposed to BigQuery either as BigLake external/Iceberg tables (query in place, no copy) or loaded as native BigQuery tables when query performance demands it. Composer then runs dbt (or BigQuery scheduled SQL) to build the consumption zone: star schemas, conformed dimensions, aggregate/rollup tables, and the metrics layer where “active customer” is defined exactly once.

  5. Serving and consumption. BigQuery is the single query surface. BI (Looker / Looker Studio) reads consumption tables; ad-hoc analysts query with SQL and BI Engine for sub-second dashboards; ML uses BigQuery ML for in-warehouse models and Vertex AI for heavier training, reading curated feature tables; reverse ETL pushes audiences and scores back to operational systems. Real-time consumers (fraud service, live ops) read the freshly-streamed BigQuery tables.

  6. Cross-cutting. Dataplex provides the data-mesh governance fabric — it catalogs every asset across the buckets and datasets, runs data-quality and data-profiling scans, and applies policy. Cloud DLP / Sensitive Data Protection discovers and classifies PII. IAM + VPC Service Controls enforce access and create a perimeter so data cannot be exfiltrated. Cloud Logging/Monitoring observes pipeline health, and billing exports to BigQuery make cost itself a queryable dataset.

The mental model: Cloud Storage is the durable floor, BigQuery is the serving ceiling, Dataproc and Dataflow are the two engines (batch-heavy/Spark and streaming/Beam respectively), and Composer is the conductor. Each can be adopted incrementally.

Component breakdown

Component GCP service Role in this architecture Key configuration choices
Data lake Cloud Storage System of record for raw + curated open-format data Dual-region or regional buckets; lifecycle to Nearline/Coldline/Archive; uniform bucket-level access; CMEK; hierarchical namespace for analytics
Streaming ingest Pub/Sub Decoupled, durable event intake Topic per domain; dead-letter topic; schema enforcement; 7-day retention for replay; ordering keys where needed
CDC ingest Datastream Change-data-capture from OLTP databases Stream to Cloud Storage and/or BigQuery; backfill + ongoing CDC; private connectivity
Stream/unified processing Dataflow (Apache Beam) Speed layer: validate, enrich, window, dead-letter Streaming Engine on; autoscaling + Dataflow Prime; exactly-once to BigQuery via Storage Write API; Flex Templates for CI/CD
Batch/Spark processing Dataproc Serverless + Dataproc on GCE Heavy transforms, Iceberg writes, existing Spark/Hive, ML features Serverless batches as default (autoscale, pay-per-job); persistent clusters only for interactive/Hive; Spot/preemptible secondary workers; PHS (Persistent History Server)
Warehouse + serving BigQuery Serverless SQL warehouse, serving + ML layer Editions (Standard/Enterprise) with autoscaling slots + reservations; partitioning + clustering; BI Engine; materialized views; BigLake/Iceberg external tables
Orchestration Cloud Composer (Airflow) Schedules, dependencies, retries, SLAs across the pipeline Composer 2 (GKE Autopilot, env autoscaling); deferrable operators; DAG-as-code in Git; data-aware (dataset) triggering
Transformation framework dbt (on Composer or Dataform) Declarative SQL modeling of the consumption zone Tests + docs + lineage; incremental models; one metrics definition
Governance fabric Dataplex Catalog, data quality, profiling, lakes/zones, lineage Lakes map to domains; auto-discovery of GCS + BQ; DQ scans gate publishing; built-in lineage
Data protection Sensitive Data Protection (DLP) Discover, classify, de-identify PII Inspection templates; de-identification (tokenization/masking) in Dataflow; column-level findings into Dataplex
Access control IAM + VPC Service Controls Least-privilege + exfiltration perimeter Custom roles; BigQuery column/row-level security + policy tags; service perimeter around analytics projects

A few component choices deserve the “why,” because they are the decisions that separate a real architecture from a sketch:

Why both Dataproc and Dataflow, not one? They solve different problems and overlap less than it looks. Dataflow (Beam) is the right tool for streaming and for unified pipelines authored once and run in both modes; it is fully serverless, autoscales aggressively, and gives exactly-once semantics into BigQuery. Dataproc is the right tool when you have existing Spark/Hadoop/Hive assets, need specific Spark libraries (Delta/Iceberg, GeoSpark, Spark MLlib), or run heavy batch transforms where Spark’s ecosystem and the team’s existing skills win. Forcing all batch into Beam would mean rewriting thousands of tested PySpark lines; forcing streaming into Spark Structured Streaming on a standing cluster would mean paying for idle nodes and managing them. Use the engine that fits the job.

Why Dataproc Serverless as the default, not standing clusters? Standing clusters bill for every minute whether or not a job runs — the classic way data-platform costs balloon. Dataproc Serverless for Spark spins up per-batch, autoscales, and tears down, so you pay for compute you actually use. Keep a small persistent cluster only for interactive notebooks or Hive metastore-heavy interactive work; everything scheduled goes serverless. Add Spot/preemptible secondary workers for fault-tolerant batch to cut compute cost dramatically.

Why a lakehouse with Iceberg and BigLake instead of “just load everything into BigQuery”? Two reasons. First, open format = no lock-in and one copy: curated data lives as Iceberg/Parquet in Cloud Storage and is queryable by BigQuery (via BigLake), by Spark on Dataproc, and by external engines — without duplicating it per consumer. Second, cost tiering: cold raw data sits on cheap object storage with lifecycle rules, while only the hot serving tables get BigQuery’s premium engine. You load into native BigQuery selectively, where query latency justifies it.

Why BigQuery partitioning + clustering matters so much. BigQuery on-demand pricing charges by bytes scanned. A fact table partitioned by event date and clustered by customer_id/product_id lets the engine prune to a single day and a few clusters, turning a multi-TB scan into a few GB. This single configuration choice is often the largest lever on both query cost and speed.

Why Composer over Cloud Scheduler + Functions. Real pipelines have dependencies (don’t build gold before silver), retries with backoff, backfills, SLAs, and data-aware triggers (run when a partition lands). Airflow expresses all of this as code in Git, with a UI for operators. Composer 2 runs it on autoscaling GKE Autopilot so the orchestrator itself isn’t a cost or ops burden.

Implementation guidance

Project and environment topology. Use a multi-project layout under a folder per environment (dev/stage/prod), provisioned with a landing-zone/Terraform foundation:

This separation lets you scope IAM and VPC Service Controls per blast-radius and bill per domain. A small enterprise can collapse these into one or two projects; the IAM and perimeter patterns still hold.

Infrastructure as Code. Terraform is the natural fit on GCP (Deployment Manager is legacy; Config Connector/Crossplane are options for GitOps-on-GKE shops). Use the Google Cloud Foundation Fabric modules and the resource-specific Terraform modules. Concretely:

Keep pipeline code (Beam pipelines, PySpark jobs, dbt models, Airflow DAGs) in application repos, separate from infrastructure Terraform. CI/CD: Cloud Build (or GitHub Actions) builds Dataflow Flex Template images, packages Spark jobs to GCS, runs dbt build against a CI dataset, and validates/uploads DAGs. Promote artifacts dev → stage → prod; never click-deploy pipelines.

Networking. Run all processing with private connectivity: a Shared VPC with Private Google Access so Dataflow/Dataproc workers have no external IPs. Use Private Service Connect / private endpoints for BigQuery and other APIs. Datastream to on-prem or other-cloud databases uses private connectivity (VPC peering or PSC). Egress is constrained by firewall rules and, crucially, by VPC Service Controls — a service perimeter around the analytics projects means even a leaked credential cannot copy BigQuery data to a bucket outside the perimeter. Place a single egress point with Cloud NAT for the rare cases workers must reach the internet (third-party APIs), and log it.

Identity wiring. Every pipeline runs as a dedicated, least-privilege service account — separate SAs for ingestion Dataflow, Dataproc batches, Composer, and dbt. Grant narrow predefined or custom roles (e.g., Composer’s SA gets roles/dataproc.editor and roles/bigquery.jobUser but not broad storage admin). Human access is via Google Groups mapped to IAM roles, ideally federated from your IdP through Workforce Identity Federation so there are no standalone Google passwords. For cross-cloud or CI systems, use Workload Identity Federation instead of exported SA keys (keys are the most common leak vector — avoid them entirely). In BigQuery, enforce column-level security via policy tags on PII columns and row-level security for tenant/region isolation, so the same table serves many audiences safely.

Enterprise considerations

Security and Zero Trust. Zero Trust here means no implicit trust by network location and least privilege everywhere. The pillars: (1) No SA keys — federation only. (2) VPC Service Controls perimeter so identity and context (device, origin) gate access and exfiltration is structurally blocked. (3) Private-only data plane — no public IPs on workers, private API access. (4) CMEK on buckets, BigQuery, and Dataproc/Dataflow temp storage, with keys in Cloud KMS you control and can rotate/revoke. (5) PII handled at ingest — the Dataflow pipeline calls Sensitive Data Protection to tokenize/mask emails and payment data before they reach curated tables; raw sensitive data is access-restricted and short-lived. (6) Column- and row-level controls in BigQuery via policy tags. (7) Audit everything — Cloud Audit Logs (Data Access logs on BigQuery) exported to a locked, separate log project. A hard lesson worth stating: never commit credentials to Git, and if a secret ever lands in history, rotate it immediately rather than just deleting the file — git history is forever.

Cost optimization. This architecture is deliberately cost-defensive:

Scalability. Each tier scales independently. Pub/Sub absorbs traffic spikes with buffering; Dataflow Streaming Engine adds workers under load; Dataproc Serverless sizes per batch; BigQuery’s serverless slots scale to thousands of concurrent queries. The lake (Cloud Storage) is effectively unbounded. Because storage and compute are separate, you scale the expensive part (compute) only when running, and grow the cheap part (storage) freely. The design that serves 3 TB/day serves 30 TB/day by changing autoscaling ceilings and slot reservations — not by re-architecting.

Reliability and DR (RTO/RPO). Set targets by data tier:

Observability. Three layers. (1) Pipeline health — Cloud Monitoring dashboards for Dataflow (system lag, data freshness, watermark), Dataproc (batch success/duration), Composer (DAG/task duration, SLA misses), and Pub/Sub (unacked messages, oldest-unacked-age) with alerting. (2) Data quality — Dataplex DQ scans (null/uniqueness/range/freshness rules) run as a gate; failures alert and can block publish to the consumption zone. (3) Lineage and catalog — Dataplex auto-captures lineage from BigQuery/Dataproc/Composer so you can answer “where did this number come from and what breaks if I change this column.” Cost and audit are themselves queryable in BigQuery.

Governance. Dataplex provides the mesh: lakes per domain (sales, web, inventory), zones per medallion tier, assets pointing at the actual buckets/datasets. DLP classifies sensitive data and surfaces findings into the catalog; policy tags drive BigQuery column security from those classifications. A clear data-contract discipline — schemas in Pub/Sub, expectations in dbt tests and Dataplex DQ — keeps producers honest. The metrics layer (dbt models / BigQuery views) is the single definition of business terms, ending the “three definitions of active customer” problem.

Reference enterprise example

NorthWave Retail is a fictional omnichannel retailer: ~$1.2B revenue, e-commerce plus 140 stores, ~9 million customers, running on Google Cloud. Their old setup was a nightly Informatica-to-on-prem-warehouse batch plus a swamp of GCS exports; dashboards were a day stale, the data-science team copied tables to VMs, and a GDPR audit flagged unmanaged PII. They adopted this architecture over two quarters.

Sources and volumes. ~3.2 TB/day: 1.1 billion clickstream events/day on Pub/Sub (~2 TB), hourly POS Parquet drops from 140 stores (~600 GB/day) to the raw bucket, Datastream CDC from the order-management Postgres and inventory MySQL (~400 GB/day of changes), and third-party ad-spend/weather feeds (a few GB/day).

What they built.

Decisions and trade-offs they made.

Outcome. Dashboard freshness went from ~24 hours to seconds (live) / under 1 hour (curated). The fraud team’s real-time features cut chargebacks measurably. Data scientists stopped copying data to VMs — they query governed tables in place. The GDPR re-audit passed: PII is discovered, classified, tokenized, access-controlled, and lineage-traceable. And because compute scales to zero and storage tiers automatically, the platform’s run-rate came in roughly 30% below their previous standing-cluster-plus-warehouse spend, with a billing-export dashboard making every rupee/dollar of it visible and attributable per pipeline.

When to use it

Use this architecture when you have (a) more than one source and more than one consumer pattern, (b) both real-time and batch needs, © existing Spark/Hadoop assets you don’t want to throw away, (d) governance/compliance obligations over PII, and (e) a mandate to keep costs predictable. It is the GCP-native answer to “we need a real, governed, multi-engine data platform that won’t lock us in.” It scales from a one-project startup deployment up to a multi-domain data-mesh estate using the same building blocks.

Trade-offs to accept. It is a platform, not a single product — there are more moving parts (Composer, Dataplex, two engines) than a “just BigQuery” setup, so you need genuine data-engineering ownership. The lakehouse indirection (Iceberg + BigLake) adds a small amount of complexity versus loading everything native into BigQuery; you take that on deliberately to get open formats and cost tiering.

Anti-patterns to avoid.

Alternatives and when they fit. If you are all-streaming with simple transforms and zero existing Spark, you might drop Dataproc and run Dataflow + BigQuery only — simpler, fewer parts. If you are a Databricks shop or want a single unified Spark/Delta platform across clouds, Dataproc + open Iceberg still interoperates, but Databricks-on-GCP is a coherent alternative for that team. For pure SQL ELT with no real-time and no Spark, a trimmed Cloud Storage + BigQuery + Dataform/dbt + Composer stack is the lean version of this same design — which is exactly the point: this reference architecture is a superset you can subtract from to fit a small enterprise, and add governance/DR/multi-region to as you grow into a large one.

GCPArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading