Data GCP

GCP Enterprise Architecture: Data Warehouse & Lakehouse

Nearly every enterprise data platform fails in the same way, and it is not a technology failure — it is a copies failure. The warehouse has a dim_customer. The data-science team exported a parquet snapshot of it into a bucket six months ago. The marketing stack has a third version, deduplicated by different rules. Finance runs a fourth in a spreadsheet. Three of the four disagree about how many customers there are, and the quarterly business review burns its first twenty minutes arguing about which number is real instead of what to do about it. Underneath, the same physical fact — a customer — has been copied, transformed, and diverged across four systems because the warehouse and the lake were two different worlds with two different access models, and the only way to get warehouse data into a notebook was to extract it. The lakehouse on Google Cloud is the architecture that ends the copying: it makes one physical copy of the data in Cloud Storage and BigQuery’s native storage queryable, with one access model, by the SQL analyst, the Spark data scientist, and the BI tool alike — so there is exactly one dim_customer, governed once, and every persona reads that. The pattern scales down to one analytics team on a few terabytes and up to a regulated multinational on petabytes across dozens of domain teams; what changes is the number of datasets, the strictness of the policy tags, and the reservation size, not the shape.

This article is a reference architecture in the spirit of the Google Cloud Architecture Center: it picks real services — BigQuery, Dataflow, Dataplex, Cloud Storage, BigLake, and Looker — wires them end to end, and makes the trade-off calls explicit. The running example is a mid-market company, but the design genuinely spans from a single team to a large enterprise without changing its bones.

The business scenario

Picture a company anywhere on the small-to-large spectrum — the symptoms are remarkably consistent regardless of size, because they come from the structure of the old split between warehouse and lake, not from scale:

The outcome the lakehouse is chartered to deliver is precise: one governed copy of the data — open formats on cheap object storage plus BigQuery’s columnar store — queryable by every persona in their preferred tool, fresh enough for operational decisions, with a real catalog, automated data-quality checks, column-level access control, end-to-end lineage, and a cost model that attributes spend to the team that caused it. Concretely: a single certified revenue figure; a fraud or stockout signal visible in minutes not hours; month-end close that does not need a six-hour batch window; a self-service catalog where an analyst can discover and request access to a trustworthy dataset; and a lineage graph an auditor can follow from a raw file to the cell in the board deck. The non-goals matter too: this is not an OLTP system (BigQuery is an analytical engine, not your application database), and it is not a real-time sub-second serving store for an app (that is Bigtable/Spanner territory). It is the analytical plane of the enterprise — the place all the data converges, is governed once, and is served to everyone.

Architecture overview

The organizing idea is one storage substrate, many engines, one governance plane. Data lands and lives in Cloud Storage (open formats — Parquet, ORC, Iceberg, Delta) and in BigQuery’s managed storage; BigQuery via BigLake presents both as ordinary SQL tables with one security model; Dataflow handles all movement and transformation for both streaming and batch with the same code; Dataplex wraps the whole estate with catalog, data quality, and governance; and Looker serves a single governed semantic layer to the business. Governance is not a final step bolted on the right — it is a plane that wraps the entire flow.

The data path moves left-to-right through five planes — ingest, store, transform, serve, govern — and the deliberate design decision is that no consumer ever copies data out to query it; they all point at the same lake-and-warehouse storage through BigLake.

Reference architecture for a governed GCP lakehouse: ingest via Pub/Sub, Dataflow, Datastream and Cloud Storage; one governed copy in Cloud Storage and BigQuery unified by BigLake; transform with Dataflow, BigQuery SQL/dbt and Dataproc; serve through BigQuery, Looker, Vertex AI and BigQuery ML; with Dataplex catalog, policy tags, VPC Service Controls and CMEK governing the whole estate.

1. Ingest. Source data arrives through three doors, and none of them transform data on the way in:

2. Store — the source of truth. Everything converges on two tiers of storage that BigLake unifies into one queryable surface:

This storage substrate is the centre of gravity; every other component points at it.

3. Transform. Dataflow (managed Apache Beam) is the unified processing engine, and the single most important property is that one Beam pipeline expresses both the streaming and the batch path, so the two physically cannot drift. Streaming jobs read Pub/Sub, apply windowing, enrichment, and quality rules, and MERGE/append into the curated tier with exactly-once semantics; batch jobs do the heavier historical transforms and backfills. Where teams prefer SQL-first transformation, BigQuery itself is the transform engine — scheduled queries, MERGE statements, materialized views, and dbt building curated and business tables in-warehouse. Where teams have existing Spark, Dataproc (or BigQuery’s Spark integration) reads the same BigLake tables. The point is that all of these write to the same governed storage; the engine is a preference, not a fork in the source of truth. Orchestration is Cloud Composer (managed Airflow) or Cloud Workflows for lighter DAGs.

4. Serve. Consumption fans out from the one copy:

5. Govern (wrapping everything). Dataplex is the governance plane across the entire estate: it provides the unified data catalog (auto-discovered metadata for BigQuery datasets and Cloud Storage assets, searchable by every persona), automated data-quality scans (rule-based and recommended), data profiling, data lineage (column-level, captured automatically from BigQuery and Dataflow), and business-domain organization (lakes → zones → assets that map to your business domains). Policy is enforced by BigQuery policy tags (taxonomies driving column-level access — e.g., a pii.email tag that only a named group can read), VPC Service Controls (a service perimeter that stops data from being exfiltrated to projects outside the boundary), and CMEK encryption. Lineage plus the catalog plus policy tags are exactly the three artifacts the auditor and the acquirer ask for.

The mental model: Cloud Storage and BigQuery hold one physical copy; BigLake makes both queryable under one security model so no one ever copies to query; Dataflow moves and transforms everything with shared streaming/batch code; Dataplex catalogs, quality-checks, and lineage-tracks the whole thing; and Looker serves one certified semantic layer to the business. One copy, one access model, one set of metrics — read by everyone.

Component breakdown

Component GCP service What it does Why it’s here / key configuration choices
Streaming ingest Pub/Sub + Dataflow + Datastream Durable event bus; managed Beam streaming; CDC from databases Pub/Sub as the decoupling buffer; Dataflow streaming with the BigQuery Storage Write API for exactly-once; Datastream for low-latency CDC into the lake; schema enforced at the edge, not transformed on ingest
Batch / bulk ingest Cloud Storage + Dataflow + BQ Data Transfer Service Landing zone for files; managed batch ETL; scheduled SaaS/cross-cloud imports Land raw first (immutable), then transform; Data Transfer Service for recurring SaaS feeds; dual-region or multi-region buckets for the landing/raw tier; Autoclass for tiering
Lake storage (open) Cloud Storage Cheap, open-format, engine-agnostic data — raw + open tables Parquet/ORC + Apache Iceberg/Delta; organized as Dataplex raw/curated zones; uniform bucket-level access; lifecycle rules; CMEK; this is the layer non-BigQuery engines (Spark/Trino) also read
Warehouse storage (managed) BigQuery (managed storage) Optimized columnar store for high-value, hot, heavily-queried tables Partition (by ingestion date or an event-time column) + cluster (on high-cardinality filter keys) every large table; time-travel (7-day default, configurable); the fastest query path for serving
Unified table & security layer BigLake One queryable surface + one fine-grained security model over CS and BigQuery BigLake tables over Cloud Storage with row/column/table-level access enforced by any engine; object tables for unstructured data; BigQuery Omni to query S3/Azure in place — the keystone that means no copy to query
Transform / processing Dataflow (Apache Beam) — and BigQuery SQL / dbt / Dataproc Streaming + batch transformation with shared code; in-warehouse SQL transforms One Beam pipeline for stream + batch so they cannot drift; Flexible Resource Scheduling / Streaming Engine; dbt or scheduled MERGE for SQL-first teams; Dataproc for existing Spark on the same BigLake tables
Orchestration Cloud Composer (Airflow) / Cloud Workflows Schedules and sequences pipelines and quality gates Composer for complex DAGs and dependencies; Workflows for lightweight serverless orchestration; quality gates between zones (raw→curated only if checks pass)
Governance plane Dataplex Catalog, data quality, profiling, lineage, business-domain org Dataplex lakes → zones → assets mapped to domains; auto data discovery; AutoDQ data-quality scans; column-level lineage (auto from BigQuery + Dataflow); the catalog every persona searches
Access policy & encryption BigQuery policy tags + VPC Service Controls + Cloud KMS Column-level access by taxonomy; exfiltration perimeter; CMEK Policy-tag taxonomies (e.g., pii.*) bound to IAM groups for column-level control; VPC-SC perimeter around the data projects; CMEK on BigQuery, Cloud Storage, Dataflow
BI / semantic layer Looker (LookML) One governed metric definitions layer served to the business LookML defines metrics once on top of BigQuery; governed Explores, embedded analytics, and scheduled delivery; BI Engine to accelerate dashboards; kills “two numbers” at the modeling layer
ML / data science BigQuery ML + Vertex AI Workbench SQL-native and notebook ML on the same governed copy BigQuery ML (CREATE MODEL in SQL) for analysts; Vertex AI notebooks read BigLake tables directly — no export, models trained on the certified data
Identity & org IAM + Resource Manager + Service Accounts Least-privilege access; project/folder hierarchy Folders per domain/environment; service accounts per pipeline (no user keys); IAM Conditions + groups, not individual grants; reads governed by BigLake/policy tags, not raw bucket ACLs

A few choices deserve the “why,” because they are exactly where a naive “warehouse plus a lake” quietly betrays you.

Why BigLake is the keystone, not an optional add-on. The entire premise — one copy, queried by everyone, governed once — collapses without a layer that puts a single fine-grained security model over both the object-store files and the warehouse tables. The naive alternative is to grant data scientists raw storage.objects.get on the bucket so their Spark jobs can read the parquet directly. The moment you do that, you have two access models (BigQuery’s row/column policies for analysts, and coarse bucket IAM for everyone else), the column-level PII controls you built in BigQuery are trivially bypassed by reading the underlying files, and your audit story has a hole you can drive a truck through. BigLake closes the hole: it enforces table-, row-, and column-level access on the Cloud Storage data itself, for whatever engine reads it, using delegated service-account credentials so consumers never get raw object access. That is what makes “let the data scientists read the same governed copy” safe, and it is the difference between a lakehouse and a lake with a warehouse next to it.

Why metrics live in LookML, not in dashboards. The “two dashboards show different revenue” problem is almost never a data problem — it is a definition problem. Two analysts wrote two slightly different SUM(...) with two slightly different filters in two BI tools. If you certify the storage but let every dashboard re-derive metrics ad hoc, the divergence simply moves from the data layer to the presentation layer and you have solved nothing. Defining revenue, churn, margin, and active-customer once in LookML — a version-controlled, reviewed semantic model on top of BigQuery — means every Explore, every dashboard, and every embedded report computes the same number the same way. The certified figure is enforced at the modeling layer, which is the only place it can actually be enforced.

Why partition and cluster every large BigQuery table — it is a cost control, not just performance. BigQuery’s on-demand pricing bills by bytes scanned. An unpartitioned, unclustered table means every query that filters on a date or a customer ID still scans the entire table — you pay full price for data you immediately discard, and a single careless SELECT * on a 50 TB table is a five-figure surprise. Partitioning (by ingestion time or an event-time column) prunes whole date ranges before scanning; clustering (on the high-cardinality keys you filter and join on) prunes within partitions. The combination routinely cuts bytes-scanned — and therefore cost and latency — by an order of magnitude. This is the single highest-leverage configuration choice in the warehouse tier, and skipping it is the most common reason a BigQuery bill spirals.

Why Dataplex zones, not just folders in a bucket. A raw/, curated/, gold/ folder convention in a bucket is documentation — nothing enforces it, nothing scans it, and nothing tells an analyst what is in it. Dataplex zones turn that convention into a managed construct: assets in a zone are auto-discovered and cataloged, data-quality and profiling scans run against them, lineage is captured, and the raw→curated promotion can be gated on quality checks passing. The zone is where governance actually attaches to the physical data, which is what turns a data lake into something other than a data swamp.

Why exactly-once on the streaming write path. A fraud or revenue signal computed from a stream that double-counts events is worse than no signal — it is a wrong signal that people trust. Dataflow writing to BigQuery via the Storage Write API with exactly-once semantics (rather than legacy streaming inserts or at-least-once file drops) means the curated table is correct even across worker restarts and retries, so the same metric computed on the stream and on the batch backfill agree. Without it, the streaming path silently diverges from the batch path and re-creates the exact “two pipelines, two numbers” problem the architecture exists to eliminate.

Implementation guidance

Provision in layers, each with its own Terraform stack and state, so the long-lived foundation (org hierarchy, perimeter, KMS) and the faster-moving data products (datasets, pipelines, dashboards) evolve independently. Terraform is the natural choice on Google Cloud (Config Connector / KCC and the Cloud Foundation Toolkit blueprints are equally valid; on multi-cloud teams Terraform wins). The layering matters more than the tool:

A representative Terraform skeleton for the storage substrate and the BigLake-over-Cloud-Storage governance keystone (Layer 1) — note the policy-tag taxonomy, the BigLake connection that delegates access, and the partitioned/clustered managed table:

# --- Policy-tag taxonomy: column-level access by classification ---
resource "google_data_catalog_taxonomy" "pii" {
  display_name           = "pii"
  region                 = "europe-west2"
  activated_policy_types = ["FINE_GRAINED_ACCESS_CONTROL"]
}

resource "google_data_catalog_policy_tag" "email" {
  taxonomy     = google_data_catalog_taxonomy.pii.id
  display_name = "pii.email"
}

# Only the named group may read columns tagged pii.email
resource "google_data_catalog_policy_tag_iam_member" "email_readers" {
  policy_tag = google_data_catalog_policy_tag.email.name
  role       = "roles/datacatalog.categoryFineGrainedReader"
  member     = "group:pii-approved@kloudvin.com"
}

# --- Curated-zone bucket: open formats, governed, CMEK, no public access ---
resource "google_storage_bucket" "curated" {
  name                        = "kv-lakehouse-curated"
  location                    = "EU"            # multi-region for the lake tier
  uniform_bucket_level_access = true            # IAM only, no legacy ACLs
  encryption { default_kms_key_name = google_kms_crypto_key.lake.id }
  autoclass { enabled = true }                  # auto tiering of cold data
  lifecycle { prevent_destroy = true }
}

# --- BigLake connection: consumers read via THIS SA, never raw bucket access ---
resource "google_bigquery_connection" "biglake" {
  connection_id = "biglake-curated"
  location      = "EU"
  cloud_resource {}                             # creates a delegated service account
}

# Grant the connection's delegated SA read on the bucket (NOT the end users)
resource "google_storage_bucket_iam_member" "biglake_reads" {
  bucket = google_storage_bucket.curated.name
  role   = "roles/storage.objectViewer"
  member = "serviceAccount:${google_bigquery_connection.biglake.cloud_resource[0].service_account_id}"
}

# --- BigLake table over Cloud Storage: queried as SQL, governed by BigLake ---
resource "google_bigquery_table" "events_external" {
  dataset_id          = google_bigquery_dataset.curated.dataset_id
  table_id            = "events"
  deletion_protection = true
  external_data_configuration {
    autodetect    = false
    connection_id = google_bigquery_connection.biglake.name   # makes it a BigLake table
    source_format = "PARQUET"
    source_uris   = ["gs://${google_storage_bucket.curated.name}/events/*"]
    # Apache Iceberg / Hive partitioning configured here for open-table-format data
  }
}

# --- Managed high-value table: PARTITION + CLUSTER (the cost/perf lever) + CMEK ---
resource "google_bigquery_table" "fact_orders" {
  dataset_id          = google_bigquery_dataset.gold.dataset_id
  table_id            = "fact_orders"
  deletion_protection = true
  time_partitioning { type = "DAY"  field = "order_ts" }   # prune by date
  clustering          = ["customer_id", "store_id"]        # prune within partitions
  encryption_configuration { kms_key_name = google_kms_crypto_key.bq.id }
  schema = file("${path.module}/schemas/fact_orders.json") # email column carries policy tag
}

A representative Dataflow streaming pipeline + Dataplex data-quality wiring (Layer 2) — the Flex Template job and the AutoDQ scan that gates the curated zone:

# --- Dataflow streaming job (Flex Template): Pub/Sub -> BigQuery, exactly-once ---
resource "google_dataflow_flex_template_job" "stream_to_bq" {
  provider                = google-beta
  name                    = "events-stream-to-bq"
  container_spec_gcs_path = "gs://kv-templates/events-stream.json"
  parameters = {
    inputSubscription = google_pubsub_subscription.events.id
    outputTable       = "${var.project}:curated.events"   # Storage Write API, exactly-once
    # streaming engine + autoscaling configured via additional parameters
  }
  on_delete = "drain"   # drain in-flight data on teardown, never lose events
}

# --- Dataplex data-quality scan: raw -> curated promotion gate ---
resource "google_dataplex_datascan" "events_dq" {
  location     = "europe-west2"
  data_scan_id = "events-quality"
  data { resource = google_bigquery_table.events_external.id }
  data_quality_spec {
    rules {
      column      = "customer_id"
      dimension   = "COMPLETENESS"
      non_null_expectation {}
    }
    rules {
      column            = "order_amount"
      dimension         = "VALIDITY"
      range_expectation { min_value = "0" }
    }
  }
  execution_spec { trigger { on_demand {} } }   # invoked by Composer between zones
}

Networking and identity wiring, the load-bearing rules:

Deployment: ship Dataflow pipelines as Flex Templates (containerized, versioned, parameterized) promoted dev→staging→prod through CI; manage LookML in its own Git repo with the standard Looker dev→production validation and deploy flow; manage dbt models in CI with tests; and keep all infrastructure in the Terraform layers above so every production change is a reviewed, attributed, revertable commit.

Enterprise considerations

Security and Zero Trust. Apply Zero Trust at the data plane, not just the network. Perimeter: VPC Service Controls wraps the analytics projects so data cannot be exfiltrated to projects outside the boundary — the single most important control in a data platform, because the threat is rarely “break in” and usually “copy out.” Identity: least-privilege service accounts per pipeline, human access via groups with IAM Conditions, no downloaded keys, Workload Identity Federation for external systems. Data-level access: policy tags drive column-level masking/denial for PII, row-level security scopes tenants and regions, and BigLake enforces all of it across engines so the data scientist’s Spark job is bound by the same rules as the analyst’s SQL — there is no back door through raw object access. Encryption: CMEK on BigQuery, Cloud Storage, and Dataflow (with Cloud EKM / external key manager if you must hold keys outside Google), TLS in transit everywhere. Posture: Security Command Center for misconfiguration and threat findings, Cloud Audit Logs (Data Access logs on BigQuery and Cloud Storage) feeding a SIEM, and Sensitive Data Protection (Cloud DLP) to discover and classify PII automatically and auto-apply policy tags — so classification is continuous, not a one-time spreadsheet. The blast radius of a compromised analyst credential is bounded by their groups, the columns their policy tags allow, and a perimeter that forbids copying the data out.

Cost optimization (FinOps). The lakehouse’s core economic win is decoupled storage and compute — you store petabytes cheaply in Cloud Storage and BigQuery storage and pay for compute only when you query. Beyond that: (1) partition + cluster every large table — the highest-leverage lever, cutting bytes-scanned (and on-demand cost) by an order of magnitude; (2) choose the right compute pricing modelon-demand (per-TB-scanned) for unpredictable/low volume, BigQuery editions with slot reservations + autoscaling for steady high volume (committed slots cut the rate, autoscaling absorbs spikes, and idle-slot sharing across reservations recovers waste); (3) storage tieringAutoclass and lifecycle rules move cold lake data to Nearline/Coldline/Archive automatically, and BigQuery’s long-term storage pricing halves the rate on partitions untouched for 90 days; (4) materialized views and BI Engine so repeated dashboard queries hit a cheap cached layer instead of re-scanning; (5) maximum-bytes-billed guards and custom quotas per project to make a runaway SELECT * fail instead of bankrupt; (6) per-team cost attribution via labels on datasets/jobs and BigQuery’s INFORMATION_SCHEMA job stats, exposed as a chargeback dashboard in Looker so each domain sees the cost of its own queries. Dataflow costs drop with Flexible Resource Scheduling (batch on cheaper preemptible-backed capacity) and right-sized Streaming Engine autoscaling. The catalog itself saves money: discoverable, certified datasets stop teams rebuilding (and re-paying for) data that already exists.

Scalability. Independent axes, which is the whole point of decoupling. Storage scales effectively without limit (Cloud Storage and BigQuery storage are managed and elastic — petabytes are routine). Query compute scales by slots: on-demand absorbs bursts automatically, and reservations with autoscaling add slots within a ceiling for predictable heavy workloads — no cluster to resize. Streaming throughput scales by Dataflow autoscaling (the Streaming Engine adds workers as Pub/Sub backlog grows) and Pub/Sub scales horizontally with no provisioning. Concurrency scales because BigQuery’s serving is multi-tenant and reservations isolate noisy workloads from interactive analysts. The architectural discipline that keeps it scaling: partition/cluster so big tables stay cheap to query, separate reservations for ETL vs interactive vs BI so a heavy batch job cannot starve a dashboard, and materialized views / BI Engine to keep the hot serving path off raw scans.

Reliability and DR (RTO/RPO). Most of this stack is regional managed service with built-in redundancy — BigQuery, Cloud Storage (regional/dual/multi-region), Pub/Sub, and Dataflow are highly available within their region by design, so single-zone failures are non-events you do not operate. The deliberate DR decisions are about location and recovery of data and pipelines: (1) Storage durability and locality — choose dual-region or multi-region buckets for the raw/curated lake so the data survives a regional event with synchronous-ish replication (your data-layer RPO is near-zero for multi-region storage); for BigQuery, use cross-region dataset replication / managed disaster recovery or scheduled cross-region copies of critical datasets (RPO = replication lag or copy cadence). (2) Pipeline recoverability — because everything is Terraform + Flex Templates + LookML in Git, standing the platform back up in a second region is terraform apply + redeploy templates, and your RTO is “re-point ingestion + apply IaC in the DR region,” realistically tens of minutes to a few hours depending on how much you pre-stage. (3) Reprocessing as recovery — the raw zone is immutable and faithful to source, so a corrupted curated table is recovered by re-running the (idempotent) Dataflow/dbt transforms from raw, and BigQuery time travel (7 days) plus table snapshots recover from accidental deletes/bad merges within the window. The reliability discipline that makes this real is rehearsal: a DR game day that restores critical datasets to the second region and reprocesses from raw, and confirms the certified numbers reconcile.

Observability. Watch the platform on three fronts. Pipeline health: Cloud Monitoring on Dataflow (system lag, data freshness, watermark, backlog bytes, worker autoscaling), Pub/Sub (oldest-unacked-message age, subscription backlog), and Datastream freshness — alert on data freshness and backlog, the leading indicators that the stream is falling behind. Query and cost health: BigQuery INFORMATION_SCHEMA.JOBS for bytes-scanned, slot-ms, and the most expensive queries; alarms on reservation slot utilization and on any single job exceeding a bytes-billed threshold. Data health: Dataplex data-quality scan results (completeness, validity, freshness, uniqueness) and profiling trends — a quality-rule failure should block the promotion from raw to curated, not silently publish bad data, and should page the owning team. Lineage in Dataplex is the debugging tool: when a number looks wrong, follow the column-level lineage from the dashboard back through the transforms to the raw file. Alert on symptoms that matter to consumers — stale data, failed quality gates, a dashboard’s source table not refreshed — not raw infrastructure metrics.

Governance. Enforce, do not document. Dataplex is the governance system of record: lakes/zones/assets mapped to business domains, automated discovery and cataloging so every dataset is findable, data-quality scans as gates, and column-level lineage captured automatically. Policy-tag taxonomies make data classification operational (a column tagged pii.email is access-controlled everywhere, automatically), and Cloud DLP auto-discovers and tags new PII so classification keeps up with new data. Data contracts — schema + quality expectations declared per dataset and enforced by Dataplex scans — make a producing team accountable for what they publish. Land the platform in a governed resource hierarchy with org policies asserting the invariants (no public buckets, no SA keys, location restrictions, CMEK required), and treat the Terraform / LookML / dbt PR history as the change-management record — every production change to a pipeline, a metric, or an access policy is a reviewed, attributed, revertable commit, which is exactly what an auditor and an acquirer’s due-diligence team ask for and exactly what “export a copy and tweak it in a notebook” never provides. This is also the substrate for data-mesh operating models: each domain owns its zone, datasets, and contracts, while Dataplex provides the federated catalog and the central platform team provides the paved road.

Reference enterprise example

Tideford Grocers is a (fictional) mid-market grocery and e-commerce chain — roughly 420 stores, a growing online business, ~30 TB of curated data growing several TB a quarter, and a data team of about fourteen split across one platform squad and three domain squads (retail/stores, e-commerce, supply chain). They ran a legacy on-prem warehouse that loaded overnight, plus a Cloud Storage “lake” the data-science team filled with parquet exports. The breaking point was a board meeting where three slides — store revenue, e-commerce revenue, and total revenue — did not add up because each was built from a different copy of the orders data with different return-handling logic. In parallel, a supplier-data-sharing agreement required them to prove data lineage and PII access controls, which the spreadsheet-and-wiki status quo could not.

What they built. One governed lakehouse in europe-west2 (London), organized as a Dataplex lake with raw and curated zones and a gold/business layer. Ingest came through three doors: Datastream CDC from the operational MySQL order database into the raw zone; Pub/Sub + Dataflow streaming for clickstream and store point-of-sale events (writing to BigQuery via the Storage Write API, exactly-once); and Cloud Storage landing for nightly supplier and ERP extracts. Raw lived as Parquet in Cloud Storage; the high-value, heavily-queried order, inventory, and customer tables lived in BigQuery managed storage, every large table partitioned by event date and clustered on customer_id/store_id. The data-science team’s parquet — previously read with raw bucket access — was converted to BigLake tables, so their Vertex AI notebooks and Spark on Dataproc now read the same governed copy as the analysts, with PII columns (email, loyalty_id) carrying policy tags that only the pii-approved@ group could read. Transformation was Dataflow for the streaming/CDC path and dbt-on-BigQuery for the SQL-first batch models, orchestrated by Cloud Composer, with Dataplex data-quality scans gating every raw→curated promotion. Serving consolidated on Looker: revenue, returns-adjusted net revenue, basket size, and active-customer were each defined once in LookML, so the three board slides finally drew from one definition. The whole estate sat inside a VPC Service Controls perimeter with CMEK on BigQuery/Cloud Storage/Dataflow, Dataflow workers on no external IPs, and Datastream over private connectivity.

The numbers and decisions. Roughly $11,500/month all-in: ~$2,100 BigQuery storage (active + long-term, the bulk of the 30 TB) and Cloud Storage (with Autoclass tiering cold raw data); ~$4,200 BigQuery compute on a standard-edition reservation with autoscaling (they moved off pure on-demand once query volume became predictable, committing a slot baseline and autoscaling for the morning analyst rush and month-end); ~$2,400 Dataflow (streaming Engine for the always-on POS/clickstream jobs plus batch with Flexible Resource Scheduling); ~$1,300 Looker; ~$900 Dataplex scans, Datastream, Pub/Sub, Composer, and the rest; with maximum-bytes-billed guards and per-project quotas capping runaway queries. They debated pure on-demand vs reservations and chose reservations once INFORMATION_SCHEMA showed steady weekday slot demand — it cut the effective compute rate roughly a third while autoscaling kept month-end from queuing. They debated giving data scientists raw bucket access “to keep it simple” and rejected it outright — BigLake was the keystone that let them share one copy and keep the PII controls, and it was the control that satisfied the supplier agreement’s audit clause. They debated a second BI tool for the data team and standardized on Looker’s semantic layer specifically to stop metric divergence at the source.

The outcome. The three-slides-don’t-add-up problem became structurally impossible: revenue is one LookML definition, and store + e-commerce now reconcile to total by construction — the next board review spent its time on why e-commerce growth slowed, not on which number was right. The fraud-and-stockout latency collapsed from “tomorrow’s batch” to under three minutes on the streaming path (POS events to a Looker tile), and because the streaming and batch metrics share definitions and the stream is exactly-once, the operational dashboard and the next-day reconciliation agree. The supplier data-sharing audit, previously a multi-week scramble, was satisfied by exporting Dataplex column-level lineage (raw file → dbt model → certified table → dashboard) and the policy-tag access report showing exactly who could read email/loyalty_id — a half-day exercise. A SELECT * that previously would have scanned a 12 TB orders table now prunes to a single day’s partition and a clustered customer range, and the maximum-bytes-billed guard caught two genuinely runaway exploratory queries before they cost anything. For DR, the raw zone is multi-region and immutable and the platform is entirely Terraform + Flex Templates + LookML-in-Git; a rehearsed recovery — replicate critical BigQuery datasets to a second region, terraform apply, redeploy templates, and reprocess curated from raw — measured an RTO of ~2 hours with a data-layer RPO near zero for the multi-region storage and minutes for the replicated warehouse tables. Net: one governed copy of the data, served fresh to every persona in their own tool, with the lineage and access controls an auditor accepts — operated by a team of fourteen for a bit over $11k/month.

When to use it

Use this architecture when you need a single analytical plane for the enterprise — one governed copy of the data, queryable by SQL analysts, data scientists, and BI tools alike, fresh enough for operational decisions, with a real catalog, lineage, and column-level governance. It is the correct default when you are tired of the warehouse-versus-lake split and the copy-to-query tax it imposes: it scales down to one analytics team on a few terabytes who want BigQuery + Looker + a little governance, and up to a regulated multinational running a data mesh of dozens of domains behind one Dataplex catalog and one VPC-SC perimeter — the diagram is the same; only the dataset count, policy-tag strictness, and reservation size change. The prerequisites are modest: a willingness to land raw and transform forward (so reprocessing is always possible), to govern through BigLake and policy tags rather than raw bucket ACLs, and to define metrics once in a semantic layer.

Trade-offs to accept going in. This is an analytical platform, not an OLTP database — BigQuery is built for scans and aggregations, not single-row lookups at thousands of QPS, and trying to back an application’s transactional reads with it is the classic mismatch (use Spanner/Cloud SQL/Bigtable for that and let it feed the lakehouse). You are accepting a bytes-scanned cost model that punishes un-partitioned tables and careless SELECT * — a discipline you must enforce with guards and culture, or the bill surprises you. And BigLake/policy-tag governance is non-optional for the safety story; if your organization cannot resist handing out raw bucket access, you do not have a lakehouse, you have a lake with a warehouse next to it and the same divergence problem you started with.

Anti-patterns that quietly defeat the design:

Alternatives, in increasing capability and operational cost: (1) BigQuery alone — for a team that lives entirely in SQL on structured data, BigQuery + scheduled queries + a BI tool is a lakehouse-lite and may be all you need; add the rest only when you have open-format data, multiple engines, or governance pressure. (2) This article — BigQuery + BigLake + Cloud Storage + Dataflow + Dataplex + Looker — the governed lakehouse: one copy, many engines, one governance plane, the workhorse default for an enterprise analytical platform. (3) The same shape with open table formats front-and-center — BigQuery’s Iceberg/BigLake tables as the system of record so non-Google engines (Spark, Trino, Snowflake) share the exact same tables, when true multi-engine/open-format interoperability is a hard requirement. (4) Databricks or Snowflake on GCP — a third-party lakehouse/warehouse if you are standardizing a multi-cloud data platform on one vendor; you trade the deep native integration (BigLake, Dataplex, policy tags, BigQuery ML) for cross-cloud portability. (5) A full data mesh — this same architecture, federated, with each domain owning its Dataplex zone, datasets, and data contracts and the platform team providing the paved road — when organizational scale (many autonomous domain teams) outgrows a single central data team. Pick the lowest tier that meets your governance, freshness, and multi-engine requirements; most teams reach for a third-party platform or a full mesh when native BigQuery + BigLake + Dataplex with disciplined partitioning and one semantic layer would have done — and the platform you can actually govern and rehearse beats the one you merely drew.

GCPArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading