A lakehouse is not “a data lake with a SQL engine bolted on.” It is a deliberate architecture where open table formats give your S3 data the transactional guarantees of a warehouse, a single governance plane (Lake Formation) controls who sees which rows and columns, and you pick the cheapest engine that fits each workload — Athena for ad-hoc, Redshift Spectrum for BI joins against curated dimensions, EMR for heavy transforms and ML feature engineering. This article builds that architecture on AWS end to end and treats the hard parts — table format choice, governance propagation, small-file economics, and DR for a catalog that lives in three places — as first-class concerns.
The business scenario
Picture a company that has outgrown its warehouse but cannot afford a second one. This is true at 50 people and at 50,000.
The early-stage version: a Series B fintech runs a single Amazon Redshift cluster. It started as the reporting database; now it ingests clickstream, holds 18 months of transaction history, runs the dbt models, and serves the data-science notebooks. Every new use case means a bigger cluster, and the cluster is 24/7 even though batch jobs run for three hours a night. The CFO sees a line item that grows linearly with the company and asks the question every architect dreads: “Why does storing data we rarely read cost the same as the data we read constantly?”
The large-enterprise version: a retailer with 1,200 stores has a Redshift estate, a Hadoop cluster a previous team built, three Glue jobs nobody owns, and an analytics team that exports CSVs to laptops because getting access through the warehouse takes two weeks. Data is duplicated five times. Nobody can answer “who can see PII” with a straight face during the audit.
Both companies have the same underlying problem and it is not a tooling problem — it is a coupling problem:
- Storage is coupled to compute. You pay for warehouse storage at warehouse prices, and you scale the two together even though they have nothing to do with each other.
- One engine is forced to do every job. Ad-hoc exploration, scheduled BI, and 4-hour ML transforms have wildly different cost and latency profiles, but they all hit the same cluster.
- Governance is per-engine. Redshift grants, Glue IAM, and laptop CSVs each have their own access model, so the real answer to “who can read this column” is unknowable.
- Data is copied to be used. Every team that needs the data makes a copy, because the original is locked inside an engine.
The lakehouse decouples all four. One copy of data on S3 in open formats; many engines reading the same files; one governance plane (Lake Formation) that every engine honours; and compute you turn on only when a query runs. Storage drops to S3 prices (and tiers down further when cold), each workload runs on the engine sized for it, and “who can see this column” becomes a single Lake Formation policy that Athena, Redshift Spectrum, and EMR all enforce identically.
The target outcome: store once, govern once, query with whatever is cheapest for the job, and pay for compute by the query — not by the calendar.
Architecture overview
The data path runs left to right through five logical tiers. Imagine the diagram as a wide flow with a governance plane spanning the full width underneath and an observability plane spanning the full width on top — everything in the middle passes through both.
1. Ingestion (left edge). Three classes of source land in a raw landing bucket on S3. Batch files (vendor drops, exports) arrive via AWS DMS for relational CDC or direct S3 uploads. Streaming events (clickstream, app telemetry, IoT) flow through Amazon Kinesis Data Streams into Amazon Data Firehose, which buffers and writes partitioned objects to S3. SaaS and operational data come through AWS Glue connectors or AWS AppFlow. Nothing transforms here — raw is immutable and append-only, the system of record for replay.
2. The lake on S3 (center, the spine). S3 holds three zones following the medallion pattern — bronze (raw, as-landed), silver (cleaned, conformed, deduplicated), and gold (business-level aggregates and dimensional models). Crucially, silver and gold tables are written as open table formats — Apache Iceberg is the default on AWS in 2026 — which means each table directory contains data files plus a metadata layer that gives ACID transactions, snapshot isolation, schema evolution, and time travel directly on S3. This is the line between a “data lake” and a “lakehouse.”
3. Catalog and governance (the plane underneath). The AWS Glue Data Catalog is the single technical metadata store — every table, schema, and partition is registered once and every engine reads it. AWS Lake Formation sits on top of the catalog as the permission layer: it owns the S3 locations, and instead of granting engines raw S3 access, you grant database/table/column/row/cell-level permissions in Lake Formation. Athena, Redshift Spectrum, and EMR all call Lake Formation’s credential-vending API at query time, so a single policy (“analysts cannot see the ssn column; the EU team sees only EU rows”) is enforced identically across every engine.
4. Processing and transformation. AWS Glue (Spark serverless ETL) does the bronze→silver→gold transforms on a schedule or event trigger, writing Iceberg tables and committing atomically. For heavy, long-running, or specialized work — large reprocessing, complex Spark/ML feature pipelines, or jobs needing specific libraries — Amazon EMR (on EC2 with Spot, or EMR Serverless) reads the same S3 + Glue Catalog and writes back the same Iceberg tables. Glue and EMR are not competitors here; they are two compute shapes over one storage and one catalog, chosen per job.
5. Consumption (right edge). Three query engines serve three patterns, all reading the same gold (and silver) tables:
- Amazon Athena — serverless, pay-per-TB-scanned SQL for ad-hoc exploration, data-science queries, and anything bursty. Zero standing cost.
- Amazon Redshift Spectrum — the warehouse’s external-table feature: BI tools connect to a small Redshift cluster (or Serverless) that holds hot dimensions locally and joins them against the gold Iceberg tables on S3 via Spectrum, so you get warehouse-grade BI performance without loading the whole lake into the warehouse.
- Amazon EMR / SageMaker — read the lake directly for ML training and feature engineering.
BI tools (Amazon QuickSight, Tableau, Power BI) sit beyond the engines. The key property: the data is never copied into any engine permanently. S3 is the one copy; engines are ephemeral lenses over it, and Lake Formation governs every lens the same way.
Component breakdown
| Component | Role in the lakehouse | Key configuration choices |
|---|---|---|
| S3 (3 buckets/zones) | Single physical store for bronze/silver/gold; the one copy of all data | Separate buckets per zone; partition by event date; S3 Intelligent-Tiering on bronze; lifecycle to Glacier for cold raw; Block Public Access + default SSE-KMS |
| Apache Iceberg | ACID table format over S3 — transactions, time travel, schema evolution, hidden partitioning | Default format for silver/gold; copy-on-write for gold (read-heavy), merge-on-read for high-churn silver; schedule OPTIMIZE (compaction) and snapshot expiry |
| Glue Data Catalog | One technical metadata store every engine shares | One catalog per environment; databases per domain; Iceberg tables registered natively; crawlers only for bronze schema discovery |
| Lake Formation | The governance plane — fine-grained, engine-agnostic permissions | Register S3 locations; hybrid access mode during migration; tag-based access control (LF-Tags) for scale; column masking + row/cell filters; cross-account sharing |
| AWS Glue ETL | Serverless Spark for routine bronze→silver→gold transforms | Glue 5.0 (Spark 3.5); job bookmarks for incremental; Auto Scaling workers; Flex execution for non-urgent jobs (~↓ cost) |
| Amazon EMR | Heavy / specialized Spark, big reprocessing, ML feature pipelines | EMR on EC2 with Spot for task nodes (60–90% off) or EMR Serverless for spiky jobs; EMRFS + Glue Catalog; runtime role for Lake Formation enforcement |
| Amazon Athena | Serverless ad-hoc SQL, the default exploration engine | Athena engine v3 (Trino); workgroups with per-query data-scan limits + cost guardrails; results to a dedicated S3 location; reusable query results cache |
| Redshift Spectrum | Warehouse BI joining hot local dims against S3 gold tables | Redshift Serverless or small RA3 cluster; external schema → Glue Catalog; keep only hot dimensions resident; materialized views for hot aggregates |
| Kinesis + Firehose | Streaming ingestion into bronze | Firehose dynamic partitioning; buffer to balance freshness vs. small-file count; optional Parquet conversion on write |
| DMS / AppFlow | Relational CDC and SaaS ingestion | DMS CDC to bronze, merged into silver Iceberg with MERGE INTO; AppFlow for SaaS sources |
Why each is here, briefly:
S3 zones, not one bucket. Three buckets give you blast-radius isolation, distinct lifecycle policies (bronze tiers aggressively, gold rarely), and clean IAM/Lake Formation boundaries. Bronze is immutable and replayable; silver and gold are recomputable from bronze, which is the backbone of your DR story.
Iceberg is the decision that makes it a lakehouse. Plain Parquet on S3 has no transactions: a failed Spark job leaves half-written files and readers see corruption. Iceberg’s metadata layer gives atomic commits (a query sees the table before or after a write, never mid-write), time travel (FOR TIMESTAMP AS OF for audits and “oops” recovery), and schema evolution without rewriting data. On AWS in 2026, Iceberg has first-class support across Glue, Athena (v3), EMR, and Redshift, which is why it’s the default over Hudi/Delta here — one format every engine reads natively.
Glue Catalog is the contract; Lake Formation is the lock. The catalog says what exists (one source of truth for schema). Lake Formation says who may see it — and because every engine vends credentials through Lake Formation rather than reading S3 directly, you cannot accidentally have “Athena can see the SSN column but the BI tool can’t.” The policy lives in one place and travels with the data.
Glue vs. EMR is a workload decision, not a religious one. Glue: serverless, fast to start, ideal for the steady bronze→silver→gold pipeline and teams who don’t want to manage clusters. EMR: when you need Spot economics on long jobs, specific library versions, very large reprocessing, or co-locating ML feature engineering with training. Both write the same Iceberg tables to the same S3 catalog — you can move a job from one to the other without changing the data.
Three query engines because one size never fits. A data scientist running SELECT ... LIMIT 100 ten times an hour should not touch a warehouse cluster — that’s Athena (zero standing cost, pay per scan). A nightly executive dashboard joining a 10-row date dimension against a billion-row fact wants the warehouse’s local-dimension speed — that’s Redshift Spectrum (hot dims resident, facts on S3). A 4-hour model-training run wants EMR/SageMaker reading Parquet directly. Forcing all three onto one engine is exactly the coupling the lakehouse exists to break.
Implementation guidance
Bucket and zone layout. Three buckets with consistent prefixing:
s3://acme-lake-bronze-<acct>/<source>/<table>/ingest_date=YYYY-MM-DD/
s3://acme-lake-silver-<acct>/<domain>/<table>/ # Iceberg-managed layout
s3://acme-lake-gold-<acct>/<domain>/<table>/ # Iceberg-managed layout
Bronze is partitioned by ingest date for cheap pruning and lifecycle. Silver/gold use Iceberg hidden partitioning — you declare PARTITIONED BY (days(event_ts)) and queries filter on event_ts directly without knowing the physical layout, so you can change partitioning later without rewriting queries.
Terraform is the right IaC here (the team standardizes on it). Manage as code:
- S3 buckets + Block Public Access + SSE-KMS + lifecycle/Intelligent-Tiering (
aws_s3_bucket,aws_s3_bucket_lifecycle_configuration). - Glue databases and the catalog (
aws_glue_catalog_database); register Iceberg tables via Glue jobs or Athena DDL rather than hand-defining columns. - Lake Formation:
aws_lakeformation_resourceto register each S3 location,aws_lakeformation_lf_tagfor the tag taxonomy, andaws_lakeformation_permissions/aws_lakeformation_lf_tag_policyfor grants. Critically, remove the IAM-basedIAMAllowedPrincipalsdefault so Lake Formation actually governs (otherwise IAM still grants broad access and your fine-grained policy is decorative). - Glue jobs (
aws_glue_job), triggers/workflows, EMR (aws_emr_clusteroraws_emrserverless_application), Athena workgroups (aws_athena_workgroup) with per-query scan caps, and Redshift Serverless namespace/workgroup.
Iceberg table creation (via Athena/Glue DDL) — note the explicit format and compaction intent:
CREATE TABLE silver.transactions ( ... )
PARTITIONED BY (days(event_ts))
LOCATION 's3://acme-lake-silver-<acct>/payments/transactions/'
TBLPROPERTIES (
'table_type'='ICEBERG',
'format'='parquet',
'write_compression'='zstd',
'optimize_rewrite_delete_file_threshold'='10'
);
Then schedule OPTIMIZE silver.transactions REWRITE DATA USING BIN_PACK (compaction) and snapshot expiry as a Glue job — this is the small-file fix, covered below.
Incremental upserts from CDC use Iceberg MERGE INTO, which is the lakehouse’s answer to the old “you can’t update a data lake” problem:
MERGE INTO silver.customers t USING bronze_cdc.customers s
ON t.id = s.id
WHEN MATCHED AND s.op='D' THEN DELETE
WHEN MATCHED THEN UPDATE SET ...
WHEN NOT MATCHED THEN INSERT ...;
Networking. Keep all data-plane traffic on the AWS backbone via VPC Gateway Endpoint for S3 (free) and Interface (PrivateLink) endpoints for Glue, Lake Formation, Athena, KMS, and STS. EMR and Redshift run in private subnets, no public IPs; NAT only for OS/package egress. QuickSight reaches Redshift/Athena through a VPC connection. The result: queryable data never traverses the public internet.
Identity wiring. Human access flows from your IdP (the org standardizes on Entra ID) through IAM Identity Center with SAML/SCIM, mapping IdP groups to permission sets that assume data-access roles. Those roles are then granted Lake Formation permissions (not S3 permissions). So “analysts” in Entra → an Identity Center group → a permission set/role → a Lake Formation grant of SELECT on gold with the ssn column masked. Pipeline identities (Glue/EMR job roles) get their own Lake Formation grants. No human or job role gets direct s3:GetObject on the lake buckets — Lake Formation vends scoped, temporary credentials per query. This is the single most important wiring decision in the build.
Enterprise considerations
Security and Zero Trust. The model is “no implicit S3 access; every read is brokered.” Concretely: (1) Block Public Access on all three buckets, account-wide; (2) SSE-KMS with separate CMKs per zone so you can revoke gold independently and get per-zone audit in CloudTrail; (3) Lake Formation as the only path to data — pipeline and human roles hold Lake Formation grants, not bucket policies; (4) column masking on PII (ssn, card_pan) and row/cell filters for tenancy/region (EU analysts see only region='EU'); (5) all access on PrivateLink; (6) CloudTrail data events on the buckets plus Lake Formation’s own audit log give you a complete “who read which column when” trail. Tag-based access control (LF-Tags) keeps this manageable: tag tables confidentiality=pii once and grant on the tag, so new tables inherit policy automatically instead of needing per-table grants.
Cost optimization — this is where the lakehouse earns its keep, and it’s multi-layered:
- Storage tiering: Intelligent-Tiering on bronze auto-moves cold raw to cheaper tiers; lifecycle pushes ancient raw to Glacier. Gold stays in Standard (it’s read constantly). You stop paying warehouse prices for cold data entirely.
- Engine right-sizing: Athena’s pay-per-TB-scanned means ad-hoc cost tracks usage, not uptime — and Parquet + zstd + Iceberg partition pruning cut bytes scanned by 90%+ vs. raw, directly cutting the Athena bill. Workgroup per-query scan limits stop a runaway
SELECT *from costing thousands. - Compute economics: EMR Spot for task nodes saves 60–90% on reprocessing; Glue Flex discounts non-urgent jobs; Redshift runs Serverless or a small cluster holding only hot dimensions rather than the whole lake.
- The structural win: decoupling means you scale storage and compute independently and turn compute off between queries — the line item the CFO hated stops growing with the company.
Scalability. S3 scales to effectively unlimited objects and throughput; partition-prefix design avoids hotspots. Athena and Glue are serverless and absorb concurrency automatically. EMR and Redshift scale horizontally (and Redshift concurrency-scaling adds transient clusters for BI spikes). The catalog scales to millions of partitions. The real scaling discipline is file hygiene, not capacity: streaming ingestion creates millions of tiny files that wreck query performance, so scheduled Iceberg compaction (OPTIMIZE) and snapshot/orphan-file expiry are mandatory operational jobs, not nice-to-haves.
Reliability and DR (RTO/RPO).
- S3 durability is 11 nines; Cross-Region Replication on bronze (the system of record) gives geographic protection. Because silver/gold are recomputable from bronze, you replicate the irreplaceable layer and rebuild the rest.
- Catalog is the subtle DR risk — it lives in Glue/Lake Formation, not S3, so it needs its own protection: export catalog definitions and Lake Formation grants via IaC (Terraform state + scripted exports) so the metadata layer is reproducible in the DR region. A lake with no catalog is just opaque files.
- Targets: with CRR + IaC-reproducible catalog, RPO ≈ 15 min for bronze (replication lag) and RTO of a few hours in a region failure — re-point the catalog, re-run pipelines to rebuild silver/gold from replicated bronze. Iceberg time travel additionally gives near-zero-RTO recovery from logical corruption (bad pipeline run): roll a table back to the prior snapshot in seconds.
Observability. Glue/EMR push Spark metrics and logs to CloudWatch; Athena and Spectrum query history expose bytes scanned and runtime per query (the cost signal). Critically, add data-quality gates — AWS Glue Data Quality (DQDL rules) on the silver→gold boundary so freshness, null-rate, and referential checks fail the pipeline before bad data reaches dashboards. CloudTrail + Lake Formation audit logs cover access observability. Track three SLOs: pipeline freshness (gold lag), query cost (bytes scanned/day), and DQ pass rate.
Governance. Lake Formation LF-Tags are the scalable model — a taxonomy like domain, confidentiality, retention applied to databases/tables, with grants written against tags so policy is inherited, not hand-maintained per object. This is also the foundation for a data-mesh evolution: each domain owns its silver/gold databases and grants access to others via Lake Formation cross-account sharing, with a central team owning only the tag taxonomy and platform. Pair with a business catalog (e.g. Amazon DataZone) for discovery and access requests on top of the technical Glue catalog.
Reference enterprise example
NorthPeak Retail is a fictional 1,400-store home-goods chain, ~$4.2B revenue, with the classic mess: a Redshift cluster at the limit of its node count, an aging self-managed Hadoop cluster, eight unowned Glue jobs, and an analytics team that exports CSVs because warehouse access takes two weeks. Storage and compute are fused; the annual data-platform bill is ~$2.1M and rising, and the last audit flagged that nobody could prove who could read customer PII.
What they built. Over two quarters they moved to the architecture above:
- Ingestion: DMS CDC from the Oracle order system and the Postgres loyalty DB into
bronze; Kinesis + Firehose for store-POS and web clickstream (~2.5B events/day); AppFlow for the Salesforce CRM. Bronze landed ~18 TB/day raw. - Lake: ~3.5 PB total on S3. Bronze on Intelligent-Tiering (≈70% of volume, mostly cold) with Glacier lifecycle past 13 months; silver/gold as Iceberg (≈900 TB), zstd-compressed, partitioned by day.
- Processing: Glue 5.0 ran the steady bronze→silver→gold pipeline (
MERGE INTOfor CDC upserts); EMR Serverless handled the monthly 14-month basket-affinity reprocessing and the ML feature pipeline for demand forecasting. They retired the Hadoop cluster. - Governance: Lake Formation with LF-Tags (
domain,confidentiality,retention).customer.emailandpayment.card_pancolumn-masked for analysts; a row filter restricted the EU e-commerce subsidiary’s team to EU customers. All eight legacy Glue jobs’ direct S3 access was removed and re-granted through Lake Formation. - Consumption: Athena for the 140-person analytics/DS org (ad-hoc, no more CSV exports); Redshift Serverless + Spectrum for the ~40 executive QuickSight dashboards (hot date/store/product dimensions resident, billion-row sales facts joined from S3 gold); SageMaker on the gold feature tables for forecasting.
Decisions worth noting. They chose Iceberg over Delta for native Redshift Spectrum + Athena v3 support across every engine. They kept Redshift (Serverless) only for hot dimensions rather than loading the lake into it — Spectrum does the heavy fact joins on S3. They set Athena workgroup per-query scan caps at 2 TB after one analyst’s SELECT * scanned 40 TB in the pilot. They made nightly OPTIMIZE + snapshot expiry a first-class pipeline after streaming ingestion produced 9M tiny files in week one and Athena latency tripled.
The outcome (12 months).
- Platform spend fell from ~$2.1M to ~$1.15M/yr (~45%) — almost entirely from decoupling cold storage from warehouse pricing and turning compute off between jobs.
- Executive dashboard latency dropped materially: Spectrum on compacted, partition-pruned Iceberg gold beat the old over-subscribed cluster.
- The PII audit finding closed: one Lake Formation report now answers “who can read
card_pan” across every engine. - Analyst time-to-data went from ~2 weeks to same-day via Identity Center + LF-Tag grants — and the CSV-on-laptop shadow IT disappeared.
- DR: bronze CRR to a second region + IaC-reproducible catalog gave a tested RPO ≈ 15 min / RTO ≈ 4 h; a corrupted forecasting run was recovered in minutes via Iceberg time travel instead of a half-day rebuild.
When to use it
Use this lakehouse when:
- You have multiple consumption patterns (ad-hoc + BI + ML) over the same data and one engine is being stretched to serve all of them.
- Storage is growing faster than query demand and you’re paying warehouse prices to keep cold data online.
- You need fine-grained, engine-agnostic governance (column/row/cell) provable in an audit — Lake Formation is the headline reason to choose this over a plain lake.
- You want open formats (Iceberg/Parquet) to avoid engine lock-in and let teams pick the right compute per job.
Trade-offs and anti-patterns:
- Operational surface is larger than a single warehouse. You now own file hygiene (compaction, snapshot/orphan expiry), catalog DR, and a multi-engine cost story. Anti-pattern: standing up the lake and skipping compaction — streaming ingestion will bury you in small files and query performance collapses. These jobs are mandatory.
- Lake Formation is easy to leave decorative. If you forget to remove
IAMAllowedPrincipals, IAM still grants broad S3 access and your fine-grained policy does nothing. Anti-pattern: configuring column masks while a job role still has direct bucket access — close every S3 back door or the governance is theatre. - Don’t put sub-second, high-concurrency operational lookups on it. Athena/Spectrum are analytical (seconds-to-minutes). Point-read, low-latency app traffic belongs on DynamoDB/Aurora; the lakehouse is for analytics, not your application’s hot path.
- Don’t force everything into Iceberg on day one. Bronze stays raw; only curate silver/gold. Converting throwaway intermediate data to managed tables adds compaction cost for no benefit.
Alternatives and how to choose:
| Situation | Better fit |
|---|---|
| Single team, pure SQL BI, modest/predictable volume, no ML | Redshift-only warehouse — simpler ops, fewer moving parts |
| Want managed lakehouse platform, less AWS plumbing, multi-cloud | Databricks on AWS (Unity Catalog + Delta) |
| Mostly ad-hoc SQL on S3, governance not yet a requirement | Athena + Glue, defer Lake Formation until access control matters |
| Sub-second operational analytics / serving | DynamoDB / Aurora / OpenSearch, not a lakehouse |
| Streaming-first, real-time materialized views the core need | Kinesis + Flink / Managed Service for Apache Flink as the spine |
The lakehouse on AWS is the right default when one governed copy of data on S3, queried by the cheapest engine that fits each job, beats the cost and access friction of a single all-purpose warehouse — which, for any organization running ad-hoc, BI, and ML over a growing dataset, is most of the time.