GCP Enterprise Architecture: Data Warehouse & Lakehouse

Nearly every enterprise data platform fails in the same way, and it is not a technology failure — it is a copies failure. The warehouse has a dim_customer. The data-science team exported a parquet snapshot of it into a bucket six months ago. The marketing stack has a third version, deduplicated by different rules. Finance runs a fourth in a spreadsheet. Three of the four disagree about how many customers there are, and the quarterly business review burns its first twenty minutes arguing about which number is real instead of what to do about it. Underneath, the same physical fact — a customer — has been copied, transformed, and diverged across four systems because the warehouse and the lake were two different worlds with two different access models, and the only way to get warehouse data into a notebook was to extract it. The lakehouse on Google Cloud is the architecture that ends the copying: it makes one physical copy of the data in Cloud Storage and BigQuery’s native storage queryable, with one access model, by the SQL analyst, the Spark data scientist, and the BI tool alike — so there is exactly one dim_customer, governed once, and every persona reads that. The pattern scales down to one analytics team on a few terabytes and up to a regulated multinational on petabytes across dozens of domain teams; what changes is the number of datasets, the strictness of the policy tags, and the reservation size, not the shape.

This article is a reference architecture in the spirit of the Google Cloud Architecture Center: it picks real services — BigQuery, Dataflow, Dataplex, Cloud Storage, BigLake, and Looker — wires them end to end, and makes the trade-off calls explicit. The running example is a mid-market company, but the design genuinely spans from a single team to a large enterprise without changing its bones.

The business scenario

Picture a company anywhere on the small-to-large spectrum — the symptoms are remarkably consistent regardless of size, because they come from the structure of the old split between warehouse and lake, not from scale:

The warehouse and the lake are two separate kingdoms. The warehouse (Teradata, on-prem SQL Server, an early BigQuery silo, or a Redshift cluster) holds the curated, governed, SQL-queryable data. The “data lake” is a pile of CSV and parquet in object storage that the data scientists use. There is no shared access model, so getting warehouse data to a notebook means an extract — and every extract is a new, ungoverned, immediately-stale copy. The two kingdoms drift, and the drift is the problem.
Data is duplicated and divergent. Because every consumer copies, the same logical entity exists in five places with five subtly different cleansing rules. Headcounts, revenue-per-customer, and churn all disagree depending on which copy you started from, so nobody trusts the number that contradicts theirs, and “whose number is right?” has no authoritative answer.
Batch is too slow; streaming is bolted on the side. The warehouse loads at 02:00. By the time a stockout, a fraud pattern, or a SLA breach reaches a dashboard, the event is hours stale. Someone wired a separate Kafka→bucket→notebook path for the one use case that had to be fresh, and now there are two pipelines computing what should be the same metric, and they diverge.
Governance is a spreadsheet and a prayer. Nobody can state with confidence which tables contain PII, who can read them, or what feeds the board’s revenue figure. A regulator or an acquirer’s due-diligence team asks for data lineage and a data catalog, and the honest answer is tribal knowledge in someone’s head and a stale wiki page.
Cost is opaque, lumpy, and welded. The legacy warehouse is sized for the month-end peak and idles the other 28 days; storage and compute are bought as one bundle, so you pay for compute just to hold cold data. When the bill spikes, no one can attribute it to a team, a query, or a dataset.
The skills split is real and forcing one engine pleases no one. SQL analysts want standard SQL and a governed BI tool. Data scientists want Python, Spark, and notebooks. Data engineers want declarative pipelines. ML teams want feature tables. Marching all of them through one engine on one copy-via-export model satisfies none of them and multiplies the copies.

The outcome the lakehouse is chartered to deliver is precise: one governed copy of the data — open formats on cheap object storage plus BigQuery’s columnar store — queryable by every persona in their preferred tool, fresh enough for operational decisions, with a real catalog, automated data-quality checks, column-level access control, end-to-end lineage, and a cost model that attributes spend to the team that caused it. Concretely: a single certified revenue figure; a fraud or stockout signal visible in minutes not hours; month-end close that does not need a six-hour batch window; a self-service catalog where an analyst can discover and request access to a trustworthy dataset; and a lineage graph an auditor can follow from a raw file to the cell in the board deck. The non-goals matter too: this is not an OLTP system (BigQuery is an analytical engine, not your application database), and it is not a real-time sub-second serving store for an app (that is Bigtable/Spanner territory). It is the analytical plane of the enterprise — the place all the data converges, is governed once, and is served to everyone.

Architecture overview

The organizing idea is one storage substrate, many engines, one governance plane. Data lands and lives in Cloud Storage (open formats — Parquet, ORC, Iceberg, Delta) and in BigQuery’s managed storage; BigQuery via BigLake presents both as ordinary SQL tables with one security model; Dataflow handles all movement and transformation for both streaming and batch with the same code; Dataplex wraps the whole estate with catalog, data quality, and governance; and Looker serves a single governed semantic layer to the business. Governance is not a final step bolted on the right — it is a plane that wraps the entire flow.

The data path moves left-to-right through five planes — ingest, store, transform, serve, govern — and the deliberate design decision is that no consumer ever copies data out to query it; they all point at the same lake-and-warehouse storage through BigLake.

Reference architecture for a governed GCP lakehouse: ingest via Pub/Sub, Dataflow, Datastream and Cloud Storage; one governed copy in Cloud Storage and BigQuery unified by BigLake; transform with Dataflow, BigQuery SQL/dbt and Dataproc; serve through BigQuery, Looker, Vertex AI and BigQuery ML; with Dataplex catalog, policy tags, VPC Service Controls and CMEK governing the whole estate.

1. Ingest. Source data arrives through three doors, and none of them transform data on the way in:

Streaming events (clickstream, IoT, application telemetry, CDC from operational databases) publish to Pub/Sub, and a Dataflow streaming job (Apache Beam) reads the topic and writes to BigQuery via the Storage Write API (exactly-once) and/or to Cloud Storage. Datastream handles change-data-capture from MySQL/PostgreSQL/Oracle into the lake with low latency.
Batch and bulk loads (nightly extracts, SaaS exports, partner files, historical backfills) land first in a Cloud Storage landing bucket; a Dataflow batch job or a BigQuery load picks them up. BigQuery Data Transfer Service schedules recurring SaaS and cross-cloud imports.
Federated, no-copy sources stay where they are: BigLake external tables and BigQuery Omni let BigQuery query data already sitting in Cloud Storage (or even in S3/Azure Blob) in place, so some data never moves at all.

2. Store — the source of truth. Everything converges on two tiers of storage that BigLake unifies into one queryable surface:

Cloud Storage holds open-format files (Parquet/ORC, plus Apache Iceberg or Delta tables) — the raw and the open-table-format data, organized as Dataplex zones (a raw zone for faithful-to-source landing, a curated zone for cleaned/conformed/typed data). This is the cheap, open, engine-agnostic layer that Spark, Trino, or anything else can also read.
BigQuery managed storage holds the high-value, heavily-queried curated and business tables in BigQuery’s optimized columnar format with clustering, partitioning, and the fastest query path.
BigLake is the keystone: it puts a single fine-grained security model (row-, column-, and table-level access, enforced regardless of engine) over both the Cloud Storage files and BigQuery tables, so an analyst’s SELECT and a Spark job’s read see the same table, the same columns they are permitted to see, and the same governed copy. There is no second copy and no second access model. Object tables even let BigQuery query unstructured data (images, PDFs) in Cloud Storage as rows, which is what feeds multimodal ML and the increasingly common “join your documents to your warehouse” use cases.

This storage substrate is the centre of gravity; every other component points at it.

3. Transform. Dataflow (managed Apache Beam) is the unified processing engine, and the single most important property is that one Beam pipeline expresses both the streaming and the batch path, so the two physically cannot drift. Streaming jobs read Pub/Sub, apply windowing, enrichment, and quality rules, and MERGE/append into the curated tier with exactly-once semantics; batch jobs do the heavier historical transforms and backfills. Where teams prefer SQL-first transformation, BigQuery itself is the transform engine — scheduled queries, MERGE statements, materialized views, and dbt building curated and business tables in-warehouse. Where teams have existing Spark, Dataproc (or BigQuery’s Spark integration) reads the same BigLake tables. The point is that all of these write to the same governed storage; the engine is a preference, not a fork in the source of truth. Orchestration is Cloud Composer (managed Airflow) or Cloud Workflows for lighter DAGs.

4. Serve. Consumption fans out from the one copy:

SQL analysts and applications query BigQuery directly (standard SQL, BI Engine acceleration, materialized views).
Business users consume Looker, whose LookML semantic layer defines metrics once (revenue, churn, margin) on top of BigQuery, so every dashboard, every embedded report, and every Explore computes the certified figure the same way — this is what kills the “two dashboards, two revenue numbers” problem at the modeling layer.
Data scientists read the same BigLake tables from notebooks (Vertex AI Workbench), train with BigQuery ML (SQL-native models) or Vertex AI, and never export a copy.
Reverse-ETL and apps pull governed Gold/business tables back into operational systems.

5. Govern (wrapping everything). Dataplex is the governance plane across the entire estate: it provides the unified data catalog (auto-discovered metadata for BigQuery datasets and Cloud Storage assets, searchable by every persona), automated data-quality scans (rule-based and recommended), data profiling, data lineage (column-level, captured automatically from BigQuery and Dataflow), and business-domain organization (lakes → zones → assets that map to your business domains). Policy is enforced by BigQuery policy tags (taxonomies driving column-level access — e.g., a pii.email tag that only a named group can read), VPC Service Controls (a service perimeter that stops data from being exfiltrated to projects outside the boundary), and CMEK encryption. Lineage plus the catalog plus policy tags are exactly the three artifacts the auditor and the acquirer ask for.

The mental model: Cloud Storage and BigQuery hold one physical copy; BigLake makes both queryable under one security model so no one ever copies to query; Dataflow moves and transforms everything with shared streaming/batch code; Dataplex catalogs, quality-checks, and lineage-tracks the whole thing; and Looker serves one certified semantic layer to the business. One copy, one access model, one set of metrics — read by everyone.

Component breakdown

Component	GCP service	What it does	Why it’s here / key configuration choices
Streaming ingest	Pub/Sub + Dataflow + Datastream	Durable event bus; managed Beam streaming; CDC from databases	Pub/Sub as the decoupling buffer; Dataflow streaming with the BigQuery Storage Write API for exactly-once; Datastream for low-latency CDC into the lake; schema enforced at the edge, not transformed on ingest
Batch / bulk ingest	Cloud Storage + Dataflow + BQ Data Transfer Service	Landing zone for files; managed batch ETL; scheduled SaaS/cross-cloud imports	Land raw first (immutable), then transform; Data Transfer Service for recurring SaaS feeds; dual-region or multi-region buckets for the landing/raw tier; Autoclass for tiering
Lake storage (open)	Cloud Storage	Cheap, open-format, engine-agnostic data — raw + open tables	Parquet/ORC + Apache Iceberg/Delta; organized as Dataplex raw/curated zones; uniform bucket-level access; lifecycle rules; CMEK; this is the layer non-BigQuery engines (Spark/Trino) also read
Warehouse storage (managed)	BigQuery (managed storage)	Optimized columnar store for high-value, hot, heavily-queried tables	Partition (by ingestion date or an event-time column) + cluster (on high-cardinality filter keys) every large table; time-travel (7-day default, configurable); the fastest query path for serving
Unified table & security layer	BigLake	One queryable surface + one fine-grained security model over CS and BigQuery	BigLake tables over Cloud Storage with row/column/table-level access enforced by any engine; object tables for unstructured data; BigQuery Omni to query S3/Azure in place — the keystone that means no copy to query
Transform / processing	Dataflow (Apache Beam) — and BigQuery SQL / dbt / Dataproc	Streaming + batch transformation with shared code; in-warehouse SQL transforms	One Beam pipeline for stream + batch so they cannot drift; Flexible Resource Scheduling / Streaming Engine; dbt or scheduled `MERGE` for SQL-first teams; Dataproc for existing Spark on the same BigLake tables
Orchestration	Cloud Composer (Airflow) / Cloud Workflows	Schedules and sequences pipelines and quality gates	Composer for complex DAGs and dependencies; Workflows for lightweight serverless orchestration; quality gates between zones (raw→curated only if checks pass)
Governance plane	Dataplex	Catalog, data quality, profiling, lineage, business-domain org	Dataplex lakes → zones → assets mapped to domains; auto data discovery; AutoDQ data-quality scans; column-level lineage (auto from BigQuery + Dataflow); the catalog every persona searches
Access policy & encryption	BigQuery policy tags + VPC Service Controls + Cloud KMS	Column-level access by taxonomy; exfiltration perimeter; CMEK	Policy-tag taxonomies (e.g., `pii.`) bound to IAM groups for column-level* control; VPC-SC perimeter around the data projects; CMEK on BigQuery, Cloud Storage, Dataflow
BI / semantic layer	Looker (LookML)	One governed metric definitions layer served to the business	LookML defines metrics once on top of BigQuery; governed `Explores`, embedded analytics, and scheduled delivery; BI Engine to accelerate dashboards; kills “two numbers” at the modeling layer
ML / data science	BigQuery ML + Vertex AI Workbench	SQL-native and notebook ML on the same governed copy	BigQuery ML (`CREATE MODEL` in SQL) for analysts; Vertex AI notebooks read BigLake tables directly — no export, models trained on the certified data
Identity & org	IAM + Resource Manager + Service Accounts	Least-privilege access; project/folder hierarchy	Folders per domain/environment; service accounts per pipeline (no user keys); IAM Conditions + groups, not individual grants; reads governed by BigLake/policy tags, not raw bucket ACLs

A few choices deserve the “why,” because they are exactly where a naive “warehouse plus a lake” quietly betrays you.

Why BigLake is the keystone, not an optional add-on. The entire premise — one copy, queried by everyone, governed once — collapses without a layer that puts a single fine-grained security model over both the object-store files and the warehouse tables. The naive alternative is to grant data scientists raw storage.objects.get on the bucket so their Spark jobs can read the parquet directly. The moment you do that, you have two access models (BigQuery’s row/column policies for analysts, and coarse bucket IAM for everyone else), the column-level PII controls you built in BigQuery are trivially bypassed by reading the underlying files, and your audit story has a hole you can drive a truck through. BigLake closes the hole: it enforces table-, row-, and column-level access on the Cloud Storage data itself, for whatever engine reads it, using delegated service-account credentials so consumers never get raw object access. That is what makes “let the data scientists read the same governed copy” safe, and it is the difference between a lakehouse and a lake with a warehouse next to it.

Why metrics live in LookML, not in dashboards. The “two dashboards show different revenue” problem is almost never a data problem — it is a definition problem. Two analysts wrote two slightly different SUM(...) with two slightly different filters in two BI tools. If you certify the storage but let every dashboard re-derive metrics ad hoc, the divergence simply moves from the data layer to the presentation layer and you have solved nothing. Defining revenue, churn, margin, and active-customer once in LookML — a version-controlled, reviewed semantic model on top of BigQuery — means every Explore, every dashboard, and every embedded report computes the same number the same way. The certified figure is enforced at the modeling layer, which is the only place it can actually be enforced.

Why partition and cluster every large BigQuery table — it is a cost control, not just performance. BigQuery’s on-demand pricing bills by bytes scanned. An unpartitioned, unclustered table means every query that filters on a date or a customer ID still scans the entire table — you pay full price for data you immediately discard, and a single careless SELECT * on a 50 TB table is a five-figure surprise. Partitioning (by ingestion time or an event-time column) prunes whole date ranges before scanning; clustering (on the high-cardinality keys you filter and join on) prunes within partitions. The combination routinely cuts bytes-scanned — and therefore cost and latency — by an order of magnitude. This is the single highest-leverage configuration choice in the warehouse tier, and skipping it is the most common reason a BigQuery bill spirals.

Why Dataplex zones, not just folders in a bucket. A raw/, curated/, gold/ folder convention in a bucket is documentation — nothing enforces it, nothing scans it, and nothing tells an analyst what is in it. Dataplex zones turn that convention into a managed construct: assets in a zone are auto-discovered and cataloged, data-quality and profiling scans run against them, lineage is captured, and the raw→curated promotion can be gated on quality checks passing. The zone is where governance actually attaches to the physical data, which is what turns a data lake into something other than a data swamp.

Why exactly-once on the streaming write path. A fraud or revenue signal computed from a stream that double-counts events is worse than no signal — it is a wrong signal that people trust. Dataflow writing to BigQuery via the Storage Write API with exactly-once semantics (rather than legacy streaming inserts or at-least-once file drops) means the curated table is correct even across worker restarts and retries, so the same metric computed on the stream and on the batch backfill agree. Without it, the streaming path silently diverges from the batch path and re-creates the exact “two pipelines, two numbers” problem the architecture exists to eliminate.

Implementation guidance

Provision in layers, each with its own Terraform stack and state, so the long-lived foundation (org hierarchy, perimeter, KMS) and the faster-moving data products (datasets, pipelines, dashboards) evolve independently. Terraform is the natural choice on Google Cloud (Config Connector / KCC and the Cloud Foundation Toolkit blueprints are equally valid; on multi-cloud teams Terraform wins). The layering matters more than the tool:

Layer 0 — Foundation & security (platform). Resource-Manager folders (per environment and per domain), the analytics projects, VPC Service Controls perimeter and access levels, Cloud KMS keyrings/keys for CMEK, org policies (deny public buckets, restrict service-account key creation, restrict locations), and the policy-tag taxonomies (pii.*, confidential.*) bound to IAM groups. Long-lived; changes rarely.
Layer 1 — Storage substrate (platform/data-platform). Cloud Storage buckets (landing/raw/curated, multi- or dual-region, uniform bucket-level access, lifecycle + Autoclass), the Dataplex lake, zones, and asset attachments, BigQuery datasets, and the BigLake connection + tables. Stateful; treat with care (prevent_destroy on data buckets and datasets).
Layer 2 — Pipelines & transform (data engineering). Pub/Sub topics/subscriptions, Datastream streams, Dataflow jobs (deployed as Flex Templates), Cloud Composer environment and DAGs, dbt project, scheduled queries, Dataplex data-quality scans. Deploys often.
Layer 3 — Serve (analytics). Looker instance/connection and the LookML project (version-controlled in its own repo), BigQuery authorized views and BI Engine reservation, Vertex AI Workbench instances.

A representative Terraform skeleton for the storage substrate and the BigLake-over-Cloud-Storage governance keystone (Layer 1) — note the policy-tag taxonomy, the BigLake connection that delegates access, and the partitioned/clustered managed table:

# --- Policy-tag taxonomy: column-level access by classification ---
resource "google_data_catalog_taxonomy" "pii" {
  display_name           = "pii"
  region                 = "europe-west2"
  activated_policy_types = ["FINE_GRAINED_ACCESS_CONTROL"]
}

resource "google_data_catalog_policy_tag" "email" {
  taxonomy     = google_data_catalog_taxonomy.pii.id
  display_name = "pii.email"
}

# Only the named group may read columns tagged pii.email
resource "google_data_catalog_policy_tag_iam_member" "email_readers" {
  policy_tag = google_data_catalog_policy_tag.email.name
  role       = "roles/datacatalog.categoryFineGrainedReader"
  member     = "group:pii-approved@kloudvin.com"
}

# --- Curated-zone bucket: open formats, governed, CMEK, no public access ---
resource "google_storage_bucket" "curated" {
  name                        = "kv-lakehouse-curated"
  location                    = "EU"            # multi-region for the lake tier
  uniform_bucket_level_access = true            # IAM only, no legacy ACLs
  encryption { default_kms_key_name = google_kms_crypto_key.lake.id }
  autoclass { enabled = true }                  # auto tiering of cold data
  lifecycle { prevent_destroy = true }
}

# --- BigLake connection: consumers read via THIS SA, never raw bucket access ---
resource "google_bigquery_connection" "biglake" {
  connection_id = "biglake-curated"
  location      = "EU"
  cloud_resource {}                             # creates a delegated service account
}

# Grant the connection's delegated SA read on the bucket (NOT the end users)
resource "google_storage_bucket_iam_member" "biglake_reads" {
  bucket = google_storage_bucket.curated.name
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_bigquery_connection.biglake.cloud_resource[0].service_account_id}"
}

# --- BigLake table over Cloud Storage: queried as SQL, governed by BigLake ---
resource "google_bigquery_table" "events_external" {
  dataset_id          = google_bigquery_dataset.curated.dataset_id
  table_id            = "events"
  deletion_protection = true
  external_data_configuration {
    autodetect    = false
    connection_id = google_bigquery_connection.biglake.name   # makes it a BigLake table
    source_format = "PARQUET"
    source_uris   = ["gs://${google_storage_bucket.curated.name}/events/*"]
    # Apache Iceberg / Hive partitioning configured here for open-table-format data
  }
}

# --- Managed high-value table: PARTITION + CLUSTER (the cost/perf lever) + CMEK ---
resource "google_bigquery_table" "fact_orders" {
  dataset_id          = google_bigquery_dataset.gold.dataset_id
  table_id            = "fact_orders"
  deletion_protection = true
  time_partitioning { type = "DAY"  field = "order_ts" }   # prune by date
  clustering          = ["customer_id", "store_id"]        # prune within partitions
  encryption_configuration { kms_key_name = google_kms_crypto_key.bq.id }
  schema = file("${path.module}/schemas/fact_orders.json") # email column carries policy tag
}

A representative Dataflow streaming pipeline + Dataplex data-quality wiring (Layer 2) — the Flex Template job and the AutoDQ scan that gates the curated zone:

# --- Dataflow streaming job (Flex Template): Pub/Sub -> BigQuery, exactly-once ---
resource "google_dataflow_flex_template_job" "stream_to_bq" {
  provider                = google-beta
  name                    = "events-stream-to-bq"
  container_spec_gcs_path = "gs://kv-templates/events-stream.json"
  parameters = {
    inputSubscription = google_pubsub_subscription.events.id
    outputTable       = "${var.project}:curated.events"   # Storage Write API, exactly-once
    # streaming engine + autoscaling configured via additional parameters
  }
  on_delete = "drain"   # drain in-flight data on teardown, never lose events
}

# --- Dataplex data-quality scan: raw -> curated promotion gate ---
resource "google_dataplex_datascan" "events_dq" {
  location     = "europe-west2"
  data_scan_id = "events-quality"
  data { resource = google_bigquery_table.events_external.id }
  data_quality_spec {
    rules {
      column      = "customer_id"
      dimension   = "COMPLETENESS"
      non_null_expectation {}
    }
    rules {
      column            = "order_amount"
      dimension         = "VALIDITY"
      range_expectation { min_value = "0" }
    }
  }
  execution_spec { trigger { on_demand {} } }   # invoked by Composer between zones
}

Networking and identity wiring, the load-bearing rules:

VPC Service Controls is the data-exfiltration boundary. Put the analytics projects (BigQuery, Cloud Storage, Dataflow, Dataplex) inside a service perimeter. Now even a leaked, over-privileged credential cannot copy a BigQuery table to a bucket in a project outside the perimeter, and bq extract to an external project is denied. Define access levels (corporate IP ranges, device trust via Access Context Manager) for the humans and ingress/egress rules for the few legitimate cross-perimeter flows (e.g., reads from a partner project). This is the control that turns “we trust our IAM” into “exfiltration is structurally blocked.”
Private connectivity, no public data paths. Reach BigQuery, Cloud Storage, and Dataflow over Private Google Access / Private Service Connect from your VPC; Dataflow workers run with no external IPs in your subnetwork. Datastream connects to source databases over private connectivity (VPC peering / PSC), never the public internet. The data plane never traverses a public path.
Identity is service-account-scoped and key-less. Every pipeline runs under its own least-privilege service account (the Dataflow job’s SA can write one dataset and read one subscription; the Composer SA can trigger jobs; the BigLake connection’s delegated SA reads one bucket). Humans get access through IAM groups (analysts@, data-eng@, pii-approved@), never individual grants, and never via downloaded service-account keys (block key creation with an org policy; use Workload Identity Federation for any external CI). Reads of governed data go through BigLake and policy tags, not raw bucket ACLs — uniform bucket-level access removes legacy ACLs entirely so there is one place access is decided.
Column-level access is enforced by policy tags, end to end. The email/ssn/dob columns carry policy tags from the pii.* taxonomy; only pii-approved@ holds the fine-grained-reader role on those tags. An analyst in analysts@ querying SELECT * simply does not get those columns back — across BigQuery and BigLake-governed Cloud Storage data — and you did not have to build separate masked views per audience.

Deployment: ship Dataflow pipelines as Flex Templates (containerized, versioned, parameterized) promoted dev→staging→prod through CI; manage LookML in its own Git repo with the standard Looker dev→production validation and deploy flow; manage dbt models in CI with tests; and keep all infrastructure in the Terraform layers above so every production change is a reviewed, attributed, revertable commit.

Enterprise considerations

Security and Zero Trust. Apply Zero Trust at the data plane, not just the network. Perimeter: VPC Service Controls wraps the analytics projects so data cannot be exfiltrated to projects outside the boundary — the single most important control in a data platform, because the threat is rarely “break in” and usually “copy out.” Identity: least-privilege service accounts per pipeline, human access via groups with IAM Conditions, no downloaded keys, Workload Identity Federation for external systems. Data-level access: policy tags drive column-level masking/denial for PII, row-level security scopes tenants and regions, and BigLake enforces all of it across engines so the data scientist’s Spark job is bound by the same rules as the analyst’s SQL — there is no back door through raw object access. Encryption: CMEK on BigQuery, Cloud Storage, and Dataflow (with Cloud EKM / external key manager if you must hold keys outside Google), TLS in transit everywhere. Posture: Security Command Center for misconfiguration and threat findings, Cloud Audit Logs (Data Access logs on BigQuery and Cloud Storage) feeding a SIEM, and Sensitive Data Protection (Cloud DLP) to discover and classify PII automatically and auto-apply policy tags — so classification is continuous, not a one-time spreadsheet. The blast radius of a compromised analyst credential is bounded by their groups, the columns their policy tags allow, and a perimeter that forbids copying the data out.

Cost optimization (FinOps). The lakehouse’s core economic win is decoupled storage and compute — you store petabytes cheaply in Cloud Storage and BigQuery storage and pay for compute only when you query. Beyond that: (1) partition + cluster every large table — the highest-leverage lever, cutting bytes-scanned (and on-demand cost) by an order of magnitude; (2) choose the right compute pricing model — on-demand (per-TB-scanned) for unpredictable/low volume, BigQuery editions with slot reservations + autoscaling for steady high volume (committed slots cut the rate, autoscaling absorbs spikes, and idle-slot sharing across reservations recovers waste); (3) storage tiering — Autoclass and lifecycle rules move cold lake data to Nearline/Coldline/Archive automatically, and BigQuery’s long-term storage pricing halves the rate on partitions untouched for 90 days; (4) materialized views and BI Engine so repeated dashboard queries hit a cheap cached layer instead of re-scanning; (5) maximum-bytes-billed guards and custom quotas per project to make a runaway SELECT * fail instead of bankrupt; (6) per-team cost attribution via labels on datasets/jobs and BigQuery’s INFORMATION_SCHEMA job stats, exposed as a chargeback dashboard in Looker so each domain sees the cost of its own queries. Dataflow costs drop with Flexible Resource Scheduling (batch on cheaper preemptible-backed capacity) and right-sized Streaming Engine autoscaling. The catalog itself saves money: discoverable, certified datasets stop teams rebuilding (and re-paying for) data that already exists.

Scalability. Independent axes, which is the whole point of decoupling. Storage scales effectively without limit (Cloud Storage and BigQuery storage are managed and elastic — petabytes are routine). Query compute scales by slots: on-demand absorbs bursts automatically, and reservations with autoscaling add slots within a ceiling for predictable heavy workloads — no cluster to resize. Streaming throughput scales by Dataflow autoscaling (the Streaming Engine adds workers as Pub/Sub backlog grows) and Pub/Sub scales horizontally with no provisioning. Concurrency scales because BigQuery’s serving is multi-tenant and reservations isolate noisy workloads from interactive analysts. The architectural discipline that keeps it scaling: partition/cluster so big tables stay cheap to query, separate reservations for ETL vs interactive vs BI so a heavy batch job cannot starve a dashboard, and materialized views / BI Engine to keep the hot serving path off raw scans.

Reliability and DR (RTO/RPO). Most of this stack is regional managed service with built-in redundancy — BigQuery, Cloud Storage (regional/dual/multi-region), Pub/Sub, and Dataflow are highly available within their region by design, so single-zone failures are non-events you do not operate. The deliberate DR decisions are about location and recovery of data and pipelines: (1) Storage durability and locality — choose dual-region or multi-region buckets for the raw/curated lake so the data survives a regional event with synchronous-ish replication (your data-layer RPO is near-zero for multi-region storage); for BigQuery, use cross-region dataset replication / managed disaster recovery or scheduled cross-region copies of critical datasets (RPO = replication lag or copy cadence). (2) Pipeline recoverability — because everything is Terraform + Flex Templates + LookML in Git, standing the platform back up in a second region is terraform apply + redeploy templates, and your RTO is “re-point ingestion + apply IaC in the DR region,” realistically tens of minutes to a few hours depending on how much you pre-stage. (3) Reprocessing as recovery — the raw zone is immutable and faithful to source, so a corrupted curated table is recovered by re-running the (idempotent) Dataflow/dbt transforms from raw, and BigQuery time travel (7 days) plus table snapshots recover from accidental deletes/bad merges within the window. The reliability discipline that makes this real is rehearsal: a DR game day that restores critical datasets to the second region and reprocesses from raw, and confirms the certified numbers reconcile.

Observability. Watch the platform on three fronts. Pipeline health: Cloud Monitoring on Dataflow (system lag, data freshness, watermark, backlog bytes, worker autoscaling), Pub/Sub (oldest-unacked-message age, subscription backlog), and Datastream freshness — alert on data freshness and backlog, the leading indicators that the stream is falling behind. Query and cost health: BigQuery INFORMATION_SCHEMA.JOBS for bytes-scanned, slot-ms, and the most expensive queries; alarms on reservation slot utilization and on any single job exceeding a bytes-billed threshold. Data health: Dataplex data-quality scan results (completeness, validity, freshness, uniqueness) and profiling trends — a quality-rule failure should block the promotion from raw to curated, not silently publish bad data, and should page the owning team. Lineage in Dataplex is the debugging tool: when a number looks wrong, follow the column-level lineage from the dashboard back through the transforms to the raw file. Alert on symptoms that matter to consumers — stale data, failed quality gates, a dashboard’s source table not refreshed — not raw infrastructure metrics.

Governance. Enforce, do not document. Dataplex is the governance system of record: lakes/zones/assets mapped to business domains, automated discovery and cataloging so every dataset is findable, data-quality scans as gates, and column-level lineage captured automatically. Policy-tag taxonomies make data classification operational (a column tagged pii.email is access-controlled everywhere, automatically), and Cloud DLP auto-discovers and tags new PII so classification keeps up with new data. Data contracts — schema + quality expectations declared per dataset and enforced by Dataplex scans — make a producing team accountable for what they publish. Land the platform in a governed resource hierarchy with org policies asserting the invariants (no public buckets, no SA keys, location restrictions, CMEK required), and treat the Terraform / LookML / dbt PR history as the change-management record — every production change to a pipeline, a metric, or an access policy is a reviewed, attributed, revertable commit, which is exactly what an auditor and an acquirer’s due-diligence team ask for and exactly what “export a copy and tweak it in a notebook” never provides. This is also the substrate for data-mesh operating models: each domain owns its zone, datasets, and contracts, while Dataplex provides the federated catalog and the central platform team provides the paved road.

Reference enterprise example

Tideford Grocers is a (fictional) mid-market grocery and e-commerce chain — roughly 420 stores, a growing online business, ~30 TB of curated data growing several TB a quarter, and a data team of about fourteen split across one platform squad and three domain squads (retail/stores, e-commerce, supply chain). They ran a legacy on-prem warehouse that loaded overnight, plus a Cloud Storage “lake” the data-science team filled with parquet exports. The breaking point was a board meeting where three slides — store revenue, e-commerce revenue, and total revenue — did not add up because each was built from a different copy of the orders data with different return-handling logic. In parallel, a supplier-data-sharing agreement required them to prove data lineage and PII access controls, which the spreadsheet-and-wiki status quo could not.

What they built. One governed lakehouse in europe-west2 (London), organized as a Dataplex lake with raw and curated zones and a gold/business layer. Ingest came through three doors: Datastream CDC from the operational MySQL order database into the raw zone; Pub/Sub + Dataflow streaming for clickstream and store point-of-sale events (writing to BigQuery via the Storage Write API, exactly-once); and Cloud Storage landing for nightly supplier and ERP extracts. Raw lived as Parquet in Cloud Storage; the high-value, heavily-queried order, inventory, and customer tables lived in BigQuery managed storage, every large table partitioned by event date and clustered on customer_id/store_id. The data-science team’s parquet — previously read with raw bucket access — was converted to BigLake tables, so their Vertex AI notebooks and Spark on Dataproc now read the same governed copy as the analysts, with PII columns (email, loyalty_id) carrying policy tags that only the pii-approved@ group could read. Transformation was Dataflow for the streaming/CDC path and dbt-on-BigQuery for the SQL-first batch models, orchestrated by Cloud Composer, with Dataplex data-quality scans gating every raw→curated promotion. Serving consolidated on Looker: revenue, returns-adjusted net revenue, basket size, and active-customer were each defined once in LookML, so the three board slides finally drew from one definition. The whole estate sat inside a VPC Service Controls perimeter with CMEK on BigQuery/Cloud Storage/Dataflow, Dataflow workers on no external IPs, and Datastream over private connectivity.

The numbers and decisions. Roughly $11,500/month all-in: ~$2,100 BigQuery storage (active + long-term, the bulk of the 30 TB) and Cloud Storage (with Autoclass tiering cold raw data); ~$4,200 BigQuery compute on a standard-edition reservation with autoscaling (they moved off pure on-demand once query volume became predictable, committing a slot baseline and autoscaling for the morning analyst rush and month-end); ~$2,400 Dataflow (streaming Engine for the always-on POS/clickstream jobs plus batch with Flexible Resource Scheduling); ~$1,300 Looker; ~$900 Dataplex scans, Datastream, Pub/Sub, Composer, and the rest; with maximum-bytes-billed guards and per-project quotas capping runaway queries. They debated pure on-demand vs reservations and chose reservations once INFORMATION_SCHEMA showed steady weekday slot demand — it cut the effective compute rate roughly a third while autoscaling kept month-end from queuing. They debated giving data scientists raw bucket access “to keep it simple” and rejected it outright — BigLake was the keystone that let them share one copy and keep the PII controls, and it was the control that satisfied the supplier agreement’s audit clause. They debated a second BI tool for the data team and standardized on Looker’s semantic layer specifically to stop metric divergence at the source.

The outcome. The three-slides-don’t-add-up problem became structurally impossible: revenue is one LookML definition, and store + e-commerce now reconcile to total by construction — the next board review spent its time on why e-commerce growth slowed, not on which number was right. The fraud-and-stockout latency collapsed from “tomorrow’s batch” to under three minutes on the streaming path (POS events to a Looker tile), and because the streaming and batch metrics share definitions and the stream is exactly-once, the operational dashboard and the next-day reconciliation agree. The supplier data-sharing audit, previously a multi-week scramble, was satisfied by exporting Dataplex column-level lineage (raw file → dbt model → certified table → dashboard) and the policy-tag access report showing exactly who could read email/loyalty_id — a half-day exercise. A SELECT * that previously would have scanned a 12 TB orders table now prunes to a single day’s partition and a clustered customer range, and the maximum-bytes-billed guard caught two genuinely runaway exploratory queries before they cost anything. For DR, the raw zone is multi-region and immutable and the platform is entirely Terraform + Flex Templates + LookML-in-Git; a rehearsed recovery — replicate critical BigQuery datasets to a second region, terraform apply, redeploy templates, and reprocess curated from raw — measured an RTO of ~2 hours with a data-layer RPO near zero for the multi-region storage and minutes for the replicated warehouse tables. Net: one governed copy of the data, served fresh to every persona in their own tool, with the lineage and access controls an auditor accepts — operated by a team of fourteen for a bit over $11k/month.

When to use it

Use this architecture when you need a single analytical plane for the enterprise — one governed copy of the data, queryable by SQL analysts, data scientists, and BI tools alike, fresh enough for operational decisions, with a real catalog, lineage, and column-level governance. It is the correct default when you are tired of the warehouse-versus-lake split and the copy-to-query tax it imposes: it scales down to one analytics team on a few terabytes who want BigQuery + Looker + a little governance, and up to a regulated multinational running a data mesh of dozens of domains behind one Dataplex catalog and one VPC-SC perimeter — the diagram is the same; only the dataset count, policy-tag strictness, and reservation size change. The prerequisites are modest: a willingness to land raw and transform forward (so reprocessing is always possible), to govern through BigLake and policy tags rather than raw bucket ACLs, and to define metrics once in a semantic layer.

Trade-offs to accept going in. This is an analytical platform, not an OLTP database — BigQuery is built for scans and aggregations, not single-row lookups at thousands of QPS, and trying to back an application’s transactional reads with it is the classic mismatch (use Spanner/Cloud SQL/Bigtable for that and let it feed the lakehouse). You are accepting a bytes-scanned cost model that punishes un-partitioned tables and careless SELECT * — a discipline you must enforce with guards and culture, or the bill surprises you. And BigLake/policy-tag governance is non-optional for the safety story; if your organization cannot resist handing out raw bucket access, you do not have a lakehouse, you have a lake with a warehouse next to it and the same divergence problem you started with.

Anti-patterns that quietly defeat the design:

Raw bucket access for “the data scientists.” The single most common failure. Granting storage.objects.get so Spark can read the underlying parquet bypasses every BigQuery column policy you built and re-creates two access models. Govern through BigLake; consumers read via the delegated connection, never raw objects.
No partitioning or clustering “we’ll add it later.” Every query then scans the whole table, latency and cost explode, and one SELECT * is a five-figure surprise. Partition + cluster every large table from creation — it is the highest-leverage choice you make.
Metrics defined in dashboards instead of a semantic layer. Certifying the storage but letting each dashboard re-derive revenue just moves divergence to the presentation layer. Define metrics once in LookML (or an equivalent semantic model).
A separate streaming pipeline that re-implements the batch logic. Two pipelines for the same metric will drift and re-create the “two numbers” problem. Share transformation logic (one Beam pipeline, or the same dbt models fed by both paths) and use exactly-once writes so stream and batch reconcile.
A gold/curated/raw folder convention with nothing enforcing it. Folders are documentation; a swamp forms within a quarter. Use Dataplex zones so discovery, quality scans, lineage, and promotion gates actually attach to the data.
No exfiltration boundary. Trusting IAM alone means one over-privileged credential can copy the warehouse to an external project. Wrap the data projects in VPC Service Controls so copying out is structurally blocked.
Transforming on ingest with no immutable raw. If you clean data as it lands and keep no faithful raw copy, a bad transform is unrecoverable and reprocessing is impossible. Land raw immutably; transform forward.

Alternatives, in increasing capability and operational cost: (1) BigQuery alone — for a team that lives entirely in SQL on structured data, BigQuery + scheduled queries + a BI tool is a lakehouse-lite and may be all you need; add the rest only when you have open-format data, multiple engines, or governance pressure. (2) This article — BigQuery + BigLake + Cloud Storage + Dataflow + Dataplex + Looker — the governed lakehouse: one copy, many engines, one governance plane, the workhorse default for an enterprise analytical platform. (3) The same shape with open table formats front-and-center — BigQuery’s Iceberg/BigLake tables as the system of record so non-Google engines (Spark, Trino, Snowflake) share the exact same tables, when true multi-engine/open-format interoperability is a hard requirement. (4) Databricks or Snowflake on GCP — a third-party lakehouse/warehouse if you are standardizing a multi-cloud data platform on one vendor; you trade the deep native integration (BigLake, Dataplex, policy tags, BigQuery ML) for cross-cloud portability. (5) A full data mesh — this same architecture, federated, with each domain owning its Dataplex zone, datasets, and data contracts and the platform team providing the paved road — when organizational scale (many autonomous domain teams) outgrows a single central data team. Pick the lowest tier that meets your governance, freshness, and multi-engine requirements; most teams reach for a third-party platform or a full mesh when native BigQuery + BigLake + Dataplex with disciplined partitioning and one semantic layer would have done — and the platform you can actually govern and rehearse beats the one you merely drew.

GCP Enterprise Architecture: Data Warehouse & Lakehouse

The business scenario

Architecture overview

Component breakdown

Implementation guidance

Enterprise considerations

Reference enterprise example

When to use it

Written by Vinod

Comments

Keep Reading

Data Contracts and Schema Registry for Reliable Pipelines

Data Quality and Observability Architecture

Enterprise Data Catalog, Lineage and Governance