GCP Enterprise Architecture: Big Data Processing

Every enterprise eventually hits the same wall: the data exists, but it is trapped. Sales sits in a CRM, clickstream lands in an event bus, the warehouse holds yesterday’s truth, and a dozen CSV exports rot in shared drives. Analysts wait days for a number, data scientists copy production tables to laptops, and nobody trusts the dashboard because three teams compute “active customer” three different ways. This article is a complete, reusable GCP architecture for big data processing — the kind of platform that takes raw events and files in at one end and produces governed, query-ready, ML-ready data at the other. It is built on the services that actually do the heavy lifting on Google Cloud: Cloud Storage as the lake, Dataproc for Spark/Hadoop workloads, Dataflow for streaming and unified batch pipelines, BigQuery as the serverless warehouse and serving layer, and Cloud Composer (managed Apache Airflow) as the orchestration brain.

The business scenario

Picture a mid-market retailer — call the pattern “RetailCo” for now; we will name a concrete company later. They run e-commerce plus 140 physical stores. Their data problem is not exotic, which is exactly why this architecture is widely useful:

Volume and variety. ~3 TB/day of new data: web/app clickstream (JSON events on Pub/Sub), point-of-sale transactions (hourly CSV/Parquet drops from stores), product catalog and inventory (relational, change-data-capture), and third-party feeds (weather, ad spend, market basket benchmarks) arriving as files.
Two clocks. The business needs real-time signals (fraud scoring, live inventory, “trending now” merchandising) and batch truth (nightly revenue, cohort retention, finance reconciliation, ML training sets). One clock is not enough.
Heterogeneous compute. The data-engineering team already has thousands of lines of Apache Spark (PySpark and Spark SQL) and some legacy Hive jobs. The analytics team lives in SQL. The ML team wants notebooks and feature tables. A good platform serves all three without forcing a rewrite.
Governance is now mandatory. PII (emails, addresses, payment tokens) flows through the system. Auditors, GDPR/CCPA, and PCI scope mean access must be least-privilege, lineage must be traceable, and sensitive columns must be discoverable and masked.
Cost is watched. This is not a hyperscaler with an infinite budget. Leadership has seen “data platform” projects balloon, so every design choice must justify itself, and idle infrastructure is the enemy.

The same shape applies to a fintech (transactions + risk), a media company (impressions + content analytics), a manufacturer (IoT telemetry + ERP), or a SaaS vendor (product usage + billing). Small enterprises run a trimmed version (one region, autoscaling to zero); large enterprises run the full multi-zone, multi-project, governed version. The architecture scales down as gracefully as it scales up, which is the test of a real reference design.

The problem this solves, stated plainly: turn many noisy sources into one governed, query-fast, cost-predictable data platform that serves real-time, analytical, and ML consumers from the same source of truth — without locking you into a single processing engine.

Architecture overview

The platform is a lakehouse: a Cloud Storage data lake for cheap, open-format storage, fronting BigQuery as the warehouse and serving engine, with Dataproc and Dataflow as the two complementary processing engines and Composer conducting the whole orchestra. Data flows through medallion-style zones — raw → curated → consumption (bronze/silver/gold).

GCP big-data lakehouse reference architecture: Pub/Sub and Cloud Storage ingestion feed Dataflow (streaming) and Dataproc (batch) engines, writing raw and curated Iceberg zones in Cloud Storage, served through BigQuery to BI, BigQuery ML and Vertex AI, all orchestrated by Cloud Composer and governed by Dataplex, DLP, IAM and VPC Service Controls.

End-to-end data path:

Ingestion. Two front doors. Streaming events (clickstream, app telemetry, CDC) publish to Pub/Sub. Batch files (POS drops, third-party feeds, exports) land in a Cloud Storage raw bucket, organized as raw/<source>/<table>/dt=YYYY-MM-DD/. CDC from operational databases uses Datastream, which writes change records into the raw bucket (and can target BigQuery directly).
Stream processing. A Dataflow streaming pipeline (Apache Beam) subscribes to Pub/Sub, validates and enriches events, dead-letters bad records, and writes two ways: low-latency rows into BigQuery via the Storage Write API (for live dashboards and fraud features) and raw-but-structured Parquet into Cloud Storage raw (so the lake remains the complete system of record). This is the speed layer.
Batch transformation. Cloud Composer (Airflow) orchestrates the batch layer. On schedule (or triggered by file arrival), it runs Dataproc Serverless Spark jobs that read the raw zone, clean/conform/deduplicate, apply business logic and SCD (slowly changing dimension) handling, and write curated data as partitioned, Apache Iceberg-managed Parquet in Cloud Storage. Heavy ML feature engineering and any existing PySpark/Hive workloads run here too — Dataproc gives the team their familiar Spark engine without managing clusters.
Warehouse load and modeling. Curated tables are exposed to BigQuery either as BigLake external/Iceberg tables (query in place, no copy) or loaded as native BigQuery tables when query performance demands it. Composer then runs dbt (or BigQuery scheduled SQL) to build the consumption zone: star schemas, conformed dimensions, aggregate/rollup tables, and the metrics layer where “active customer” is defined exactly once.
Serving and consumption. BigQuery is the single query surface. BI (Looker / Looker Studio) reads consumption tables; ad-hoc analysts query with SQL and BI Engine for sub-second dashboards; ML uses BigQuery ML for in-warehouse models and Vertex AI for heavier training, reading curated feature tables; reverse ETL pushes audiences and scores back to operational systems. Real-time consumers (fraud service, live ops) read the freshly-streamed BigQuery tables.
Cross-cutting. Dataplex provides the data-mesh governance fabric — it catalogs every asset across the buckets and datasets, runs data-quality and data-profiling scans, and applies policy. Cloud DLP / Sensitive Data Protection discovers and classifies PII. IAM + VPC Service Controls enforce access and create a perimeter so data cannot be exfiltrated. Cloud Logging/Monitoring observes pipeline health, and billing exports to BigQuery make cost itself a queryable dataset.

The mental model: Cloud Storage is the durable floor, BigQuery is the serving ceiling, Dataproc and Dataflow are the two engines (batch-heavy/Spark and streaming/Beam respectively), and Composer is the conductor. Each can be adopted incrementally.

Component breakdown

Component	GCP service	Role in this architecture	Key configuration choices
Data lake	Cloud Storage	System of record for raw + curated open-format data	Dual-region or regional buckets; lifecycle to Nearline/Coldline/Archive; uniform bucket-level access; CMEK; hierarchical namespace for analytics
Streaming ingest	Pub/Sub	Decoupled, durable event intake	Topic per domain; dead-letter topic; schema enforcement; 7-day retention for replay; ordering keys where needed
CDC ingest	Datastream	Change-data-capture from OLTP databases	Stream to Cloud Storage and/or BigQuery; backfill + ongoing CDC; private connectivity
Stream/unified processing	Dataflow (Apache Beam)	Speed layer: validate, enrich, window, dead-letter	Streaming Engine on; autoscaling + Dataflow Prime; exactly-once to BigQuery via Storage Write API; Flex Templates for CI/CD
Batch/Spark processing	Dataproc Serverless + Dataproc on GCE	Heavy transforms, Iceberg writes, existing Spark/Hive, ML features	Serverless batches as default (autoscale, pay-per-job); persistent clusters only for interactive/Hive; Spot/preemptible secondary workers; PHS (Persistent History Server)
Warehouse + serving	BigQuery	Serverless SQL warehouse, serving + ML layer	Editions (Standard/Enterprise) with autoscaling slots + reservations; partitioning + clustering; BI Engine; materialized views; BigLake/Iceberg external tables
Orchestration	Cloud Composer (Airflow)	Schedules, dependencies, retries, SLAs across the pipeline	Composer 2 (GKE Autopilot, env autoscaling); deferrable operators; DAG-as-code in Git; data-aware (dataset) triggering
Transformation framework	dbt (on Composer or Dataform)	Declarative SQL modeling of the consumption zone	Tests + docs + lineage; incremental models; one metrics definition
Governance fabric	Dataplex	Catalog, data quality, profiling, lakes/zones, lineage	Lakes map to domains; auto-discovery of GCS + BQ; DQ scans gate publishing; built-in lineage
Data protection	Sensitive Data Protection (DLP)	Discover, classify, de-identify PII	Inspection templates; de-identification (tokenization/masking) in Dataflow; column-level findings into Dataplex
Access control	IAM + VPC Service Controls	Least-privilege + exfiltration perimeter	Custom roles; BigQuery column/row-level security + policy tags; service perimeter around analytics projects

A few component choices deserve the “why,” because they are the decisions that separate a real architecture from a sketch:

Why both Dataproc and Dataflow, not one? They solve different problems and overlap less than it looks. Dataflow (Beam) is the right tool for streaming and for unified pipelines authored once and run in both modes; it is fully serverless, autoscales aggressively, and gives exactly-once semantics into BigQuery. Dataproc is the right tool when you have existing Spark/Hadoop/Hive assets, need specific Spark libraries (Delta/Iceberg, GeoSpark, Spark MLlib), or run heavy batch transforms where Spark’s ecosystem and the team’s existing skills win. Forcing all batch into Beam would mean rewriting thousands of tested PySpark lines; forcing streaming into Spark Structured Streaming on a standing cluster would mean paying for idle nodes and managing them. Use the engine that fits the job.

Why Dataproc Serverless as the default, not standing clusters? Standing clusters bill for every minute whether or not a job runs — the classic way data-platform costs balloon. Dataproc Serverless for Spark spins up per-batch, autoscales, and tears down, so you pay for compute you actually use. Keep a small persistent cluster only for interactive notebooks or Hive metastore-heavy interactive work; everything scheduled goes serverless. Add Spot/preemptible secondary workers for fault-tolerant batch to cut compute cost dramatically.

Why a lakehouse with Iceberg and BigLake instead of “just load everything into BigQuery”? Two reasons. First, open format = no lock-in and one copy: curated data lives as Iceberg/Parquet in Cloud Storage and is queryable by BigQuery (via BigLake), by Spark on Dataproc, and by external engines — without duplicating it per consumer. Second, cost tiering: cold raw data sits on cheap object storage with lifecycle rules, while only the hot serving tables get BigQuery’s premium engine. You load into native BigQuery selectively, where query latency justifies it.

Why BigQuery partitioning + clustering matters so much. BigQuery on-demand pricing charges by bytes scanned. A fact table partitioned by event date and clustered by customer_id/product_id lets the engine prune to a single day and a few clusters, turning a multi-TB scan into a few GB. This single configuration choice is often the largest lever on both query cost and speed.

Why Composer over Cloud Scheduler + Functions. Real pipelines have dependencies (don’t build gold before silver), retries with backoff, backfills, SLAs, and data-aware triggers (run when a partition lands). Airflow expresses all of this as code in Git, with a UI for operators. Composer 2 runs it on autoscaling GKE Autopilot so the orchestrator itself isn’t a cost or ops burden.

Implementation guidance

Project and environment topology. Use a multi-project layout under a folder per environment (dev/stage/prod), provisioned with a landing-zone/Terraform foundation:

prj-data-ingest — Pub/Sub topics, Datastream, ingestion Dataflow jobs, raw bucket.
prj-data-lake — Cloud Storage raw and curated buckets, Dataproc, BigLake connections, Dataplex lakes.
prj-data-warehouse — BigQuery datasets (curated, consumption), BI Engine, reservations.
prj-data-orchestration — Cloud Composer environment, dbt, CI/CD service accounts.
prj-data-governance (often the host project) — Dataplex, DLP templates, central log sink, billing export dataset.

This separation lets you scope IAM and VPC Service Controls per blast-radius and bill per domain. A small enterprise can collapse these into one or two projects; the IAM and perimeter patterns still hold.

Infrastructure as Code. Terraform is the natural fit on GCP (Deployment Manager is legacy; Config Connector/Crossplane are options for GitOps-on-GKE shops). Use the Google Cloud Foundation Fabric modules and the resource-specific Terraform modules. Concretely:

google_storage_bucket with uniform_bucket_level_access, lifecycle_rule (transition raw to Nearline at 30 days, Coldline at 90, Archive at 365), encryption (CMEK from Cloud KMS), and soft_delete_policy.
google_pubsub_topic + google_pubsub_subscription with dead_letter_policy and a schema (google_pubsub_schema).
google_dataproc_batch for serverless Spark jobs (or google_dataproc_cluster for the small persistent cluster) referencing an image version and a Persistent History Server.
google_dataflow_flex_template_job (or the Flex Template launcher) so streaming pipelines deploy from a container image via CI/CD, with enable_streaming_engine = true.
google_bigquery_dataset and google_bigquery_table with time_partitioning, clustering, require_partition_filter = true, and column-level policy_tags.
google_bigquery_reservation + google_bigquery_capacity_commitment for predictable warehouse capacity, plus on-demand for spiky workloads.
google_composer_environment (Composer 2) sized with environment autoscaling; DAGs deployed from Git via Cloud Build syncing to the Composer DAGs bucket.
google_dataplex_lake / _zone / _asset mapping domains to buckets and datasets; google_data_loss_prevention_inspect_template for DLP.

Keep pipeline code (Beam pipelines, PySpark jobs, dbt models, Airflow DAGs) in application repos, separate from infrastructure Terraform. CI/CD: Cloud Build (or GitHub Actions) builds Dataflow Flex Template images, packages Spark jobs to GCS, runs dbt build against a CI dataset, and validates/uploads DAGs. Promote artifacts dev → stage → prod; never click-deploy pipelines.

Networking. Run all processing with private connectivity: a Shared VPC with Private Google Access so Dataflow/Dataproc workers have no external IPs. Use Private Service Connect / private endpoints for BigQuery and other APIs. Datastream to on-prem or other-cloud databases uses private connectivity (VPC peering or PSC). Egress is constrained by firewall rules and, crucially, by VPC Service Controls — a service perimeter around the analytics projects means even a leaked credential cannot copy BigQuery data to a bucket outside the perimeter. Place a single egress point with Cloud NAT for the rare cases workers must reach the internet (third-party APIs), and log it.

Identity wiring. Every pipeline runs as a dedicated, least-privilege service account — separate SAs for ingestion Dataflow, Dataproc batches, Composer, and dbt. Grant narrow predefined or custom roles (e.g., Composer’s SA gets roles/dataproc.editor and roles/bigquery.jobUser but not broad storage admin). Human access is via Google Groups mapped to IAM roles, ideally federated from your IdP through Workforce Identity Federation so there are no standalone Google passwords. For cross-cloud or CI systems, use Workload Identity Federation instead of exported SA keys (keys are the most common leak vector — avoid them entirely). In BigQuery, enforce column-level security via policy tags on PII columns and row-level security for tenant/region isolation, so the same table serves many audiences safely.

Enterprise considerations

Security and Zero Trust. Zero Trust here means no implicit trust by network location and least privilege everywhere. The pillars: (1) No SA keys — federation only. (2) VPC Service Controls perimeter so identity and context (device, origin) gate access and exfiltration is structurally blocked. (3) Private-only data plane — no public IPs on workers, private API access. (4) CMEK on buckets, BigQuery, and Dataproc/Dataflow temp storage, with keys in Cloud KMS you control and can rotate/revoke. (5) PII handled at ingest — the Dataflow pipeline calls Sensitive Data Protection to tokenize/mask emails and payment data before they reach curated tables; raw sensitive data is access-restricted and short-lived. (6) Column- and row-level controls in BigQuery via policy tags. (7) Audit everything — Cloud Audit Logs (Data Access logs on BigQuery) exported to a locked, separate log project. A hard lesson worth stating: never commit credentials to Git, and if a secret ever lands in history, rotate it immediately rather than just deleting the file — git history is forever.

Cost optimization. This architecture is deliberately cost-defensive:

Compute that scales to zero. Dataproc Serverless and Dataflow autoscaling mean no idle cluster bills. Standing infrastructure is minimized to a small interactive cluster and the (autoscaling) Composer environment.
Spot/preemptible secondary workers for fault-tolerant Spark batch — large savings on the heaviest jobs.
Storage tiering via lifecycle rules: most raw data is read once after a week, so it drops to Nearline/Coldline/Archive automatically.
BigQuery: the two biggest levers. Partition + cluster every large table and set require_partition_filter so a runaway query can’t scan a year. Choose pricing deliberately — on-demand (per-TB scanned) for spiky/unpredictable workloads, Editions with autoscaling slots + capacity commitments for steady high-volume workloads (commitments cut the rate). Use materialized views and BI Engine to avoid re-scanning for dashboards.
FinOps as data. Export billing to BigQuery and build a cost dashboard with per-pipeline labels; set budgets and alerts; review the most expensive queries weekly. Cost is a dataset you query like any other.

Scalability. Each tier scales independently. Pub/Sub absorbs traffic spikes with buffering; Dataflow Streaming Engine adds workers under load; Dataproc Serverless sizes per batch; BigQuery’s serverless slots scale to thousands of concurrent queries. The lake (Cloud Storage) is effectively unbounded. Because storage and compute are separate, you scale the expensive part (compute) only when running, and grow the cheap part (storage) freely. The design that serves 3 TB/day serves 30 TB/day by changing autoscaling ceilings and slot reservations — not by re-architecting.

Reliability and DR (RTO/RPO). Set targets by data tier:

Streaming front door (RPO near zero, RTO minutes): Pub/Sub retains messages (7-day retention + dead-letter) so a downstream outage loses nothing — you replay. Dataflow checkpoints; jobs restart from the last commit. Run regional services; for the most critical streams, deploy the pipeline in a second region behind a multi-region Pub/Sub posture.
Lake (RPO ~0): Cloud Storage dual-region or multi-region buckets give geo-redundancy automatically; turnaround on a regional failure is transparent. Object versioning + soft delete protect against bad writes and deletes.
Warehouse (RTO hours, RPO ~minutes-hours): BigQuery offers time-travel (7 days) and table snapshots for point-in-time recovery; use cross-region dataset replication for the consumption datasets that the business cannot live without. Because curated data also lives as Iceberg in dual-region GCS, BigQuery serving tables are rebuildable from the lake — your real DR posture is “re-derive the warehouse from durable open-format data.”
Orchestration: Composer DAGs are code in Git; the environment is reproducible by Terraform, so recovery is redeploy + resync. Idempotent, partition-scoped jobs make re-runs safe — backfill a missed day without corrupting others.

Observability. Three layers. (1) Pipeline health — Cloud Monitoring dashboards for Dataflow (system lag, data freshness, watermark), Dataproc (batch success/duration), Composer (DAG/task duration, SLA misses), and Pub/Sub (unacked messages, oldest-unacked-age) with alerting. (2) Data quality — Dataplex DQ scans (null/uniqueness/range/freshness rules) run as a gate; failures alert and can block publish to the consumption zone. (3) Lineage and catalog — Dataplex auto-captures lineage from BigQuery/Dataproc/Composer so you can answer “where did this number come from and what breaks if I change this column.” Cost and audit are themselves queryable in BigQuery.

Governance. Dataplex provides the mesh: lakes per domain (sales, web, inventory), zones per medallion tier, assets pointing at the actual buckets/datasets. DLP classifies sensitive data and surfaces findings into the catalog; policy tags drive BigQuery column security from those classifications. A clear data-contract discipline — schemas in Pub/Sub, expectations in dbt tests and Dataplex DQ — keeps producers honest. The metrics layer (dbt models / BigQuery views) is the single definition of business terms, ending the “three definitions of active customer” problem.

Reference enterprise example

NorthWave Retail is a fictional omnichannel retailer: ~$1.2B revenue, e-commerce plus 140 stores, ~9 million customers, running on Google Cloud. Their old setup was a nightly Informatica-to-on-prem-warehouse batch plus a swamp of GCS exports; dashboards were a day stale, the data-science team copied tables to VMs, and a GDPR audit flagged unmanaged PII. They adopted this architecture over two quarters.

Sources and volumes. ~3.2 TB/day: 1.1 billion clickstream events/day on Pub/Sub (~2 TB), hourly POS Parquet drops from 140 stores (~600 GB/day) to the raw bucket, Datastream CDC from the order-management Postgres and inventory MySQL (~400 GB/day of changes), and third-party ad-spend/weather feeds (a few GB/day).

What they built.

Streaming: A Dataflow (Beam, Streaming Engine) pipeline consumes clickstream, validates against the Pub/Sub schema, dead-letters ~0.2% malformed events, calls Sensitive Data Protection to tokenize email/IP, and writes via the Storage Write API into a BigQuery events_live table (partitioned by hour, clustered by customer_id). End-to-end latency: under 8 seconds from click to queryable row. A fraud service reads events_live to score checkout sessions in real time.
Batch on Dataproc Serverless: Composer triggers ~40 Spark batches nightly (and hourly for POS). These reuse NorthWave’s existing PySpark — only the cluster-management code was deleted. Jobs conform POS + CDC + clickstream into curated Iceberg tables in dual-region GCS, handling dedup, SCD2 on the customer/product dimensions, and sessionization. Heaviest job (market-basket feature build) runs on Spot secondary workers.
Warehouse + modeling: Curated Iceberg is exposed to BigQuery via BigLake; the most-queried facts are loaded native and partitioned/clustered. dbt (run by Composer) builds the consumption star schema and a metrics layer — active_customer, gross_margin, inventory_days defined exactly once, with dbt tests. Looker and Looker Studio read only consumption tables; BI Engine keeps executive dashboards sub-second.
ML: Propensity and churn models start as BigQuery ML on curated feature tables; the recommendation model graduated to Vertex AI, reading the same features. Scores are pushed back (reverse ETL) to the marketing platform and the storefront.
Governance: Dataplex lakes map to web, sales, inventory domains; DQ scans gate the gold build (a freshness or null-rate failure pages the on-call and blocks publish). DLP findings drive policy tags; BigQuery column-level security masks tokenized PII from analysts, while the fraud SA sees what it needs. Everything sits inside a VPC Service Controls perimeter; all workers are private; all SAs are keyless via Workload Identity Federation.

Decisions and trade-offs they made.

Kept Spark on Dataproc Serverless rather than rewriting to Beam — saved an estimated 6+ engineer-months and let the team move immediately.
Chose BigQuery Editions with autoscaling slots + a baseline capacity commitment for the steady analytics load, but left a smaller on-demand project for data-science exploration so curious queries don’t consume committed capacity.
Stored curated data as Iceberg in dual-region GCS specifically so the warehouse is rebuildable — their DR plan is “re-derive BigQuery from the lake,” tested quarterly.

Outcome. Dashboard freshness went from ~24 hours to seconds (live) / under 1 hour (curated). The fraud team’s real-time features cut chargebacks measurably. Data scientists stopped copying data to VMs — they query governed tables in place. The GDPR re-audit passed: PII is discovered, classified, tokenized, access-controlled, and lineage-traceable. And because compute scales to zero and storage tiers automatically, the platform’s run-rate came in roughly 30% below their previous standing-cluster-plus-warehouse spend, with a billing-export dashboard making every rupee/dollar of it visible and attributable per pipeline.

When to use it

Use this architecture when you have (a) more than one source and more than one consumer pattern, (b) both real-time and batch needs, © existing Spark/Hadoop assets you don’t want to throw away, (d) governance/compliance obligations over PII, and (e) a mandate to keep costs predictable. It is the GCP-native answer to “we need a real, governed, multi-engine data platform that won’t lock us in.” It scales from a one-project startup deployment up to a multi-domain data-mesh estate using the same building blocks.

Trade-offs to accept. It is a platform, not a single product — there are more moving parts (Composer, Dataplex, two engines) than a “just BigQuery” setup, so you need genuine data-engineering ownership. The lakehouse indirection (Iceberg + BigLake) adds a small amount of complexity versus loading everything native into BigQuery; you take that on deliberately to get open formats and cost tiering.

Anti-patterns to avoid.

Standing Dataproc/Spark clusters running 24/7 for scheduled jobs — the number-one cause of runaway data-platform cost. Go serverless and scale to zero.
Loading every raw byte into native BigQuery “to keep it simple” — you pay premium storage and lose open-format portability. Keep raw and cold data on tiered Cloud Storage; load native selectively.
Unpartitioned, unclustered BigQuery tables with no require_partition_filter — one analyst’s SELECT * scans terabytes and the bill arrives next month. Partition and cluster from day one.
Cron + scripts as “orchestration” — no dependency graph, no retries, no lineage, no SLAs. Use Composer/Airflow.
Exported service-account keys for CI or cross-cloud — the classic leak. Use Workload/Workforce Identity Federation; never commit keys; rotate immediately if one leaks.
Streaming straight into a single mega-table with no dead-letter and no schema — one bad producer poisons everything. Enforce schemas at Pub/Sub and dead-letter aggressively.
One engine for everything — forcing streaming into Spark on a standing cluster, or rewriting tested PySpark into Beam for no reason. Match engine to job.

Alternatives and when they fit. If you are all-streaming with simple transforms and zero existing Spark, you might drop Dataproc and run Dataflow + BigQuery only — simpler, fewer parts. If you are a Databricks shop or want a single unified Spark/Delta platform across clouds, Dataproc + open Iceberg still interoperates, but Databricks-on-GCP is a coherent alternative for that team. For pure SQL ELT with no real-time and no Spark, a trimmed Cloud Storage + BigQuery + Dataform/dbt + Composer stack is the lean version of this same design — which is exactly the point: this reference architecture is a superset you can subtract from to fit a small enterprise, and add governance/DR/multi-region to as you grow into a large one.

GCP Enterprise Architecture: Big Data Processing

The business scenario

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Reference enterprise example

When to use it

Written by Vinod

Comments

Keep Reading

Data Contracts and Schema Registry for Reliable Pipelines

Data Quality and Observability Architecture

Enterprise Data Catalog, Lineage and Governance