Data AWS

AWS Enterprise Architecture: Data Lakehouse

A lakehouse is not “a data lake with a SQL engine bolted on.” It is a deliberate architecture where open table formats give your S3 data the transactional guarantees of a warehouse, a single governance plane (Lake Formation) controls who sees which rows and columns, and you pick the cheapest engine that fits each workload — Athena for ad-hoc, Redshift Spectrum for BI joins against curated dimensions, EMR for heavy transforms and ML feature engineering. This article builds that architecture on AWS end to end and treats the hard parts — table format choice, governance propagation, small-file economics, and DR for a catalog that lives in three places — as first-class concerns.

The business scenario

Picture a company that has outgrown its warehouse but cannot afford a second one. This is true at 50 people and at 50,000.

The early-stage version: a Series B fintech runs a single Amazon Redshift cluster. It started as the reporting database; now it ingests clickstream, holds 18 months of transaction history, runs the dbt models, and serves the data-science notebooks. Every new use case means a bigger cluster, and the cluster is 24/7 even though batch jobs run for three hours a night. The CFO sees a line item that grows linearly with the company and asks the question every architect dreads: “Why does storing data we rarely read cost the same as the data we read constantly?”

The large-enterprise version: a retailer with 1,200 stores has a Redshift estate, a Hadoop cluster a previous team built, three Glue jobs nobody owns, and an analytics team that exports CSVs to laptops because getting access through the warehouse takes two weeks. Data is duplicated five times. Nobody can answer “who can see PII” with a straight face during the audit.

Both companies have the same underlying problem and it is not a tooling problem — it is a coupling problem:

The lakehouse decouples all four. One copy of data on S3 in open formats; many engines reading the same files; one governance plane (Lake Formation) that every engine honours; and compute you turn on only when a query runs. Storage drops to S3 prices (and tiers down further when cold), each workload runs on the engine sized for it, and “who can see this column” becomes a single Lake Formation policy that Athena, Redshift Spectrum, and EMR all enforce identically.

The target outcome: store once, govern once, query with whatever is cheapest for the job, and pay for compute by the query — not by the calendar.

Architecture overview

The data path runs left to right through five logical tiers. Imagine the diagram as a wide flow with a governance plane spanning the full width underneath and an observability plane spanning the full width on top — everything in the middle passes through both.

AWS data lakehouse reference architecture: ingestion (DMS, Kinesis/Firehose, Glue/AppFlow) lands raw data in the bronze S3 zone; medallion bronze→silver→gold with Apache Iceberg gives ACID tables on S3; Glue ETL and EMR transform; Athena, Redshift Spectrum and EMR/SageMaker query the same gold tables; a Glue Data Catalog plus Lake Formation governance plane vends scoped credentials to every engine, with a CloudTrail / CloudWatch / Glue Data Quality observability plane on top.

1. Ingestion (left edge). Three classes of source land in a raw landing bucket on S3. Batch files (vendor drops, exports) arrive via AWS DMS for relational CDC or direct S3 uploads. Streaming events (clickstream, app telemetry, IoT) flow through Amazon Kinesis Data Streams into Amazon Data Firehose, which buffers and writes partitioned objects to S3. SaaS and operational data come through AWS Glue connectors or AWS AppFlow. Nothing transforms here — raw is immutable and append-only, the system of record for replay.

2. The lake on S3 (center, the spine). S3 holds three zones following the medallion patternbronze (raw, as-landed), silver (cleaned, conformed, deduplicated), and gold (business-level aggregates and dimensional models). Crucially, silver and gold tables are written as open table formatsApache Iceberg is the default on AWS in 2026 — which means each table directory contains data files plus a metadata layer that gives ACID transactions, snapshot isolation, schema evolution, and time travel directly on S3. This is the line between a “data lake” and a “lakehouse.”

3. Catalog and governance (the plane underneath). The AWS Glue Data Catalog is the single technical metadata store — every table, schema, and partition is registered once and every engine reads it. AWS Lake Formation sits on top of the catalog as the permission layer: it owns the S3 locations, and instead of granting engines raw S3 access, you grant database/table/column/row/cell-level permissions in Lake Formation. Athena, Redshift Spectrum, and EMR all call Lake Formation’s credential-vending API at query time, so a single policy (“analysts cannot see the ssn column; the EU team sees only EU rows”) is enforced identically across every engine.

4. Processing and transformation. AWS Glue (Spark serverless ETL) does the bronze→silver→gold transforms on a schedule or event trigger, writing Iceberg tables and committing atomically. For heavy, long-running, or specialized work — large reprocessing, complex Spark/ML feature pipelines, or jobs needing specific libraries — Amazon EMR (on EC2 with Spot, or EMR Serverless) reads the same S3 + Glue Catalog and writes back the same Iceberg tables. Glue and EMR are not competitors here; they are two compute shapes over one storage and one catalog, chosen per job.

5. Consumption (right edge). Three query engines serve three patterns, all reading the same gold (and silver) tables:

BI tools (Amazon QuickSight, Tableau, Power BI) sit beyond the engines. The key property: the data is never copied into any engine permanently. S3 is the one copy; engines are ephemeral lenses over it, and Lake Formation governs every lens the same way.

Component breakdown

Component Role in the lakehouse Key configuration choices
S3 (3 buckets/zones) Single physical store for bronze/silver/gold; the one copy of all data Separate buckets per zone; partition by event date; S3 Intelligent-Tiering on bronze; lifecycle to Glacier for cold raw; Block Public Access + default SSE-KMS
Apache Iceberg ACID table format over S3 — transactions, time travel, schema evolution, hidden partitioning Default format for silver/gold; copy-on-write for gold (read-heavy), merge-on-read for high-churn silver; schedule OPTIMIZE (compaction) and snapshot expiry
Glue Data Catalog One technical metadata store every engine shares One catalog per environment; databases per domain; Iceberg tables registered natively; crawlers only for bronze schema discovery
Lake Formation The governance plane — fine-grained, engine-agnostic permissions Register S3 locations; hybrid access mode during migration; tag-based access control (LF-Tags) for scale; column masking + row/cell filters; cross-account sharing
AWS Glue ETL Serverless Spark for routine bronze→silver→gold transforms Glue 5.0 (Spark 3.5); job bookmarks for incremental; Auto Scaling workers; Flex execution for non-urgent jobs (~↓ cost)
Amazon EMR Heavy / specialized Spark, big reprocessing, ML feature pipelines EMR on EC2 with Spot for task nodes (60–90% off) or EMR Serverless for spiky jobs; EMRFS + Glue Catalog; runtime role for Lake Formation enforcement
Amazon Athena Serverless ad-hoc SQL, the default exploration engine Athena engine v3 (Trino); workgroups with per-query data-scan limits + cost guardrails; results to a dedicated S3 location; reusable query results cache
Redshift Spectrum Warehouse BI joining hot local dims against S3 gold tables Redshift Serverless or small RA3 cluster; external schema → Glue Catalog; keep only hot dimensions resident; materialized views for hot aggregates
Kinesis + Firehose Streaming ingestion into bronze Firehose dynamic partitioning; buffer to balance freshness vs. small-file count; optional Parquet conversion on write
DMS / AppFlow Relational CDC and SaaS ingestion DMS CDC to bronze, merged into silver Iceberg with MERGE INTO; AppFlow for SaaS sources

Why each is here, briefly:

S3 zones, not one bucket. Three buckets give you blast-radius isolation, distinct lifecycle policies (bronze tiers aggressively, gold rarely), and clean IAM/Lake Formation boundaries. Bronze is immutable and replayable; silver and gold are recomputable from bronze, which is the backbone of your DR story.

Iceberg is the decision that makes it a lakehouse. Plain Parquet on S3 has no transactions: a failed Spark job leaves half-written files and readers see corruption. Iceberg’s metadata layer gives atomic commits (a query sees the table before or after a write, never mid-write), time travel (FOR TIMESTAMP AS OF for audits and “oops” recovery), and schema evolution without rewriting data. On AWS in 2026, Iceberg has first-class support across Glue, Athena (v3), EMR, and Redshift, which is why it’s the default over Hudi/Delta here — one format every engine reads natively.

Glue Catalog is the contract; Lake Formation is the lock. The catalog says what exists (one source of truth for schema). Lake Formation says who may see it — and because every engine vends credentials through Lake Formation rather than reading S3 directly, you cannot accidentally have “Athena can see the SSN column but the BI tool can’t.” The policy lives in one place and travels with the data.

Glue vs. EMR is a workload decision, not a religious one. Glue: serverless, fast to start, ideal for the steady bronze→silver→gold pipeline and teams who don’t want to manage clusters. EMR: when you need Spot economics on long jobs, specific library versions, very large reprocessing, or co-locating ML feature engineering with training. Both write the same Iceberg tables to the same S3 catalog — you can move a job from one to the other without changing the data.

Three query engines because one size never fits. A data scientist running SELECT ... LIMIT 100 ten times an hour should not touch a warehouse cluster — that’s Athena (zero standing cost, pay per scan). A nightly executive dashboard joining a 10-row date dimension against a billion-row fact wants the warehouse’s local-dimension speed — that’s Redshift Spectrum (hot dims resident, facts on S3). A 4-hour model-training run wants EMR/SageMaker reading Parquet directly. Forcing all three onto one engine is exactly the coupling the lakehouse exists to break.

Implementation guidance

Bucket and zone layout. Three buckets with consistent prefixing:

s3://acme-lake-bronze-<acct>/<source>/<table>/ingest_date=YYYY-MM-DD/
s3://acme-lake-silver-<acct>/<domain>/<table>/        # Iceberg-managed layout
s3://acme-lake-gold-<acct>/<domain>/<table>/          # Iceberg-managed layout

Bronze is partitioned by ingest date for cheap pruning and lifecycle. Silver/gold use Iceberg hidden partitioning — you declare PARTITIONED BY (days(event_ts)) and queries filter on event_ts directly without knowing the physical layout, so you can change partitioning later without rewriting queries.

Terraform is the right IaC here (the team standardizes on it). Manage as code:

Iceberg table creation (via Athena/Glue DDL) — note the explicit format and compaction intent:

CREATE TABLE silver.transactions ( ... )
PARTITIONED BY (days(event_ts))
LOCATION 's3://acme-lake-silver-<acct>/payments/transactions/'
TBLPROPERTIES (
  'table_type'='ICEBERG',
  'format'='parquet',
  'write_compression'='zstd',
  'optimize_rewrite_delete_file_threshold'='10'
);

Then schedule OPTIMIZE silver.transactions REWRITE DATA USING BIN_PACK (compaction) and snapshot expiry as a Glue job — this is the small-file fix, covered below.

Incremental upserts from CDC use Iceberg MERGE INTO, which is the lakehouse’s answer to the old “you can’t update a data lake” problem:

MERGE INTO silver.customers t USING bronze_cdc.customers s
ON t.id = s.id
WHEN MATCHED AND s.op='D' THEN DELETE
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;

Networking. Keep all data-plane traffic on the AWS backbone via VPC Gateway Endpoint for S3 (free) and Interface (PrivateLink) endpoints for Glue, Lake Formation, Athena, KMS, and STS. EMR and Redshift run in private subnets, no public IPs; NAT only for OS/package egress. QuickSight reaches Redshift/Athena through a VPC connection. The result: queryable data never traverses the public internet.

Identity wiring. Human access flows from your IdP (the org standardizes on Entra ID) through IAM Identity Center with SAML/SCIM, mapping IdP groups to permission sets that assume data-access roles. Those roles are then granted Lake Formation permissions (not S3 permissions). So “analysts” in Entra → an Identity Center group → a permission set/role → a Lake Formation grant of SELECT on gold with the ssn column masked. Pipeline identities (Glue/EMR job roles) get their own Lake Formation grants. No human or job role gets direct s3:GetObject on the lake buckets — Lake Formation vends scoped, temporary credentials per query. This is the single most important wiring decision in the build.

Enterprise considerations

Security and Zero Trust. The model is “no implicit S3 access; every read is brokered.” Concretely: (1) Block Public Access on all three buckets, account-wide; (2) SSE-KMS with separate CMKs per zone so you can revoke gold independently and get per-zone audit in CloudTrail; (3) Lake Formation as the only path to data — pipeline and human roles hold Lake Formation grants, not bucket policies; (4) column masking on PII (ssn, card_pan) and row/cell filters for tenancy/region (EU analysts see only region='EU'); (5) all access on PrivateLink; (6) CloudTrail data events on the buckets plus Lake Formation’s own audit log give you a complete “who read which column when” trail. Tag-based access control (LF-Tags) keeps this manageable: tag tables confidentiality=pii once and grant on the tag, so new tables inherit policy automatically instead of needing per-table grants.

Cost optimization — this is where the lakehouse earns its keep, and it’s multi-layered:

Scalability. S3 scales to effectively unlimited objects and throughput; partition-prefix design avoids hotspots. Athena and Glue are serverless and absorb concurrency automatically. EMR and Redshift scale horizontally (and Redshift concurrency-scaling adds transient clusters for BI spikes). The catalog scales to millions of partitions. The real scaling discipline is file hygiene, not capacity: streaming ingestion creates millions of tiny files that wreck query performance, so scheduled Iceberg compaction (OPTIMIZE) and snapshot/orphan-file expiry are mandatory operational jobs, not nice-to-haves.

Reliability and DR (RTO/RPO).

Observability. Glue/EMR push Spark metrics and logs to CloudWatch; Athena and Spectrum query history expose bytes scanned and runtime per query (the cost signal). Critically, add data-quality gatesAWS Glue Data Quality (DQDL rules) on the silver→gold boundary so freshness, null-rate, and referential checks fail the pipeline before bad data reaches dashboards. CloudTrail + Lake Formation audit logs cover access observability. Track three SLOs: pipeline freshness (gold lag), query cost (bytes scanned/day), and DQ pass rate.

Governance. Lake Formation LF-Tags are the scalable model — a taxonomy like domain, confidentiality, retention applied to databases/tables, with grants written against tags so policy is inherited, not hand-maintained per object. This is also the foundation for a data-mesh evolution: each domain owns its silver/gold databases and grants access to others via Lake Formation cross-account sharing, with a central team owning only the tag taxonomy and platform. Pair with a business catalog (e.g. Amazon DataZone) for discovery and access requests on top of the technical Glue catalog.

Reference enterprise example

NorthPeak Retail is a fictional 1,400-store home-goods chain, ~$4.2B revenue, with the classic mess: a Redshift cluster at the limit of its node count, an aging self-managed Hadoop cluster, eight unowned Glue jobs, and an analytics team that exports CSVs because warehouse access takes two weeks. Storage and compute are fused; the annual data-platform bill is ~$2.1M and rising, and the last audit flagged that nobody could prove who could read customer PII.

What they built. Over two quarters they moved to the architecture above:

Decisions worth noting. They chose Iceberg over Delta for native Redshift Spectrum + Athena v3 support across every engine. They kept Redshift (Serverless) only for hot dimensions rather than loading the lake into it — Spectrum does the heavy fact joins on S3. They set Athena workgroup per-query scan caps at 2 TB after one analyst’s SELECT * scanned 40 TB in the pilot. They made nightly OPTIMIZE + snapshot expiry a first-class pipeline after streaming ingestion produced 9M tiny files in week one and Athena latency tripled.

The outcome (12 months).

When to use it

Use this lakehouse when:

Trade-offs and anti-patterns:

Alternatives and how to choose:

Situation Better fit
Single team, pure SQL BI, modest/predictable volume, no ML Redshift-only warehouse — simpler ops, fewer moving parts
Want managed lakehouse platform, less AWS plumbing, multi-cloud Databricks on AWS (Unity Catalog + Delta)
Mostly ad-hoc SQL on S3, governance not yet a requirement Athena + Glue, defer Lake Formation until access control matters
Sub-second operational analytics / serving DynamoDB / Aurora / OpenSearch, not a lakehouse
Streaming-first, real-time materialized views the core need Kinesis + Flink / Managed Service for Apache Flink as the spine

The lakehouse on AWS is the right default when one governed copy of data on S3, queried by the cheapest engine that fits each job, beats the cost and access friction of a single all-purpose warehouse — which, for any organization running ad-hoc, BI, and ML over a growing dataset, is most of the time.

AWSArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading