Big data processing fails in a way that has almost nothing to do with the data and everything to do with the compute. A team stands up a long-running EMR cluster sized for the year-end reprocessing job, runs the nightly 40-minute pipeline on it, and then pays for 240 idle CPU-hours a day. Or the opposite: they pick a cluster that was right for the demo, the data grows 3x, one Spark stage spills to disk because a single skewed key landed on one executor, and a job that should take 20 minutes runs for six hours and then OOM-kills itself at 95%. Neither problem is a Spark problem. They are the absence of a processing architecture — a deliberate separation between durable storage that never moves, ephemeral compute that you turn on for the duration of a job and tear down after, a serverless tier for the spiky and the small, and a catalog-plus-governance plane so every engine sees the same tables and the same access rules. This article builds that platform end to end on Amazon EMR, S3, AWS Glue, Amazon Athena, AWS Lake Formation, and Apache Spark, and treats the hard parts — cluster topology, Spot economics, data skew, job orchestration, and “which of the three EMRs do I actually run” — as first-class concerns.
The single most important idea is this: storage is permanent and compute is disposable. Your data lives on S3 forever; the Spark cluster that transforms it should exist only while a job runs and cost nothing the rest of the day. Get that decoupling right and big-data processing becomes an exercise in picking the cheapest, correctly-sized, transient compute for each job — and the multi-million-dollar Hadoop cluster that ran 24/7 whether or not anyone had work for it becomes a line item you no longer pay.
This is the processing-and-compute companion to the AWS Data Lakehouse reference (which centres on storage, open table formats, and store-once-query-many) and the Real-Time Streaming reference (which centres on sub-second event handling). Here the spine is the batch Spark pipeline and the question is how the heavy transformation actually runs.
The business scenario
Picture an organisation that has more data than its current tools can chew through in the time it has. The shape is identical at 40 engineers and at 4,000.
The early-stage version. A 60-person ad-tech startup lands ~4 TB/day of impression and click logs into S3. They began with a single r5.4xlarge EC2 box running a Python script and a cron job. It worked at 200 GB/day. At 4 TB/day the nightly aggregation now takes 9 hours, finishes after the morning standup, and falls over roughly twice a week when a partner sends a malformed batch. Every attempt to “just make the box bigger” buys a few weeks before the data outgrows it again. The founding engineer knows the answer is “distributed processing” but has watched friends drown in a self-managed Hadoop cluster and does not want to hire a platform team to babysit YARN.
The large-enterprise version. A telco with 60M subscribers runs a 400-node on-premises Hadoop cluster that a team built in 2017. It processes call-detail records, network telemetry, and billing events — roughly 600 TB/month landing, 12 PB at rest. The cluster runs at 30% average utilisation because it is sized for the monthly fraud-detection reprocessing peak; the other 29 days it is mostly idle but fully powered and fully staffed. Hardware refresh is a capital project with an 18-month lead time. Three different teams want more capacity and there is none to give because adding nodes means buying racks. The CFO sees a fixed, enormous, growing cost and the data team sees a queue of jobs they can’t schedule.
Both organisations have the same structural problem, and — exactly as with the lakehouse — it is a coupling problem, but on a different axis:
- Compute is provisioned for the peak and paid for the average. The cluster is sized for the heaviest job and runs 24/7, so you pay for the year-end reprocessing capacity 365 days a year.
- One cluster runs every workload. A 90-second ad-hoc query, a 40-minute nightly ETL, and a 6-hour annual reprocessing share the same fixed resources and contend for them.
- Scaling is a procurement event, not an API call. On-prem you buy racks; even a fixed cloud cluster means resizing and re-balancing instead of “spin up exactly what this job needs.”
- Failure of one job degrades the shared platform. A runaway job starves everyone else on the cluster, because they’re all on the same YARN scheduler.
The processing platform in this article decouples all four. One durable copy of data on S3. Compute that is born for a job and dies when it finishes — a transient EMR cluster, an EMR Serverless application that scales to zero, or a Glue serverless job. Each workload on the compute shape sized for it. And per-job isolation so a poison batch in the fraud pipeline can’t touch the billing pipeline. You stop paying for peak capacity at idle, scaling becomes an API call, and the heaviest job in the year no longer dictates what you pay in July.
The promise to the business: transform any volume of data, on compute you turn on for exactly as long as the job runs, sized per job, and pay by the second of actual work — not by the calendar.
Architecture overview
The platform is a durable S3 data lake in the centre, surrounded by disposable compute engines that all read and write the same data through one shared catalog and one governance plane. Read it as five tiers, with two cross-cutting planes (catalog/governance underneath, orchestration/observability on top) spanning the whole width.
Tier 1 — Ingestion into the raw lake (S3). Source data lands, untransformed, in a raw zone on S3. Bulk and relational sources arrive via AWS DMS (CDC from operational databases) or direct S3 uploads/partner drops; high-volume event data arrives through Kinesis Data Firehose writing partitioned objects; SaaS sources come through AWS Glue connectors or AppFlow. Nothing is processed here. Raw is immutable, append-only, partitioned by ingest date, and is the system of record for replay — every derived dataset can be rebuilt from it.
Tier 2 — The lake on S3 (the spine). S3 holds the data across curated zones — raw (as-landed), processed/cleaned (conformed, deduplicated, columnar), and curated/aggregated (business-level marts and feature tables). Curated data is written as Parquet (columnar, compressed with zstd/Snappy) and, where transactional semantics matter, as Apache Iceberg tables. This is the one physical copy of all data; every compute engine is a temporary lens over it. (The storage design — zone layout, table formats, small-file economics — is the subject of the lakehouse reference; here we focus on the engines that read and write it.)
Tier 3 — The processing fleet: where the heavy lifting actually happens. This is the centre of gravity of this architecture and where it differs from every other data reference. Three Spark-capable compute shapes read the same S3 + catalog and write back the same tables, each chosen per workload:
- Amazon EMR on EC2 — transient clusters with instance fleets and Spot. The workhorse for large, scheduled, or long-running batch: the nightly multi-terabyte ETL, the monthly multi-month reprocessing, big Spark/ML feature pipelines. A cluster is launched for the job, runs Spark on a mix of On-Demand core nodes and Spot task nodes (60–90% cheaper), auto-scales to the data, and terminates on completion. You pay for minutes of actual compute, not a standing cluster.
- Amazon EMR Serverless — Spark with no cluster to manage. For spiky, intermittent, or unpredictable jobs where you don’t want to think about instance types at all. It provisions workers on demand, scales to zero between runs, and you pay per vCPU-second and GB-second of the job. Ideal for the ad-tech startup and for enterprise teams who want Spark without capacity planning.
- AWS Glue (serverless Spark ETL) — the managed pipeline engine. For the routine, declarative
raw → cleaned → curatedtransforms on a schedule or event. Glue is Spark under the hood but adds job bookmarks (incremental processing), built-in connectors, and tight orchestration. It’s the default for the steady-state pipeline; EMR is what you reach for when you need Spot economics, specific runtimes, or very large reprocessing.
A fourth shape, EMR on EKS, runs Spark as pods on a shared Amazon EKS cluster — the right answer when an org has standardised on Kubernetes and wants Spark to share capacity, tooling, and Karpenter-driven autoscaling with its other containerised workloads. All four write the same Parquet/Iceberg tables to the same Glue catalog; the engine is a per-workload choice, not a platform-wide religion.
Tier 4 — Catalog and governance (the plane underneath everything). The AWS Glue Data Catalog is the single technical metadata store — every table, schema, and partition registered once, read by EMR, Glue, Athena, and Redshift alike. AWS Lake Formation sits on top as the permission layer: it owns the S3 locations and grants database/table/column/row-level access rather than raw S3 access. Every engine — including an EMR cluster running Spark — vends credentials through Lake Formation at runtime, so “analysts cannot see the msisdn column; the EU team sees only EU rows” is enforced identically across Spark, Glue, and Athena.
Tier 5 — Serving and consumption (the right edge). Processed and curated tables are served to three patterns: Amazon Athena (serverless, pay-per-TB-scanned SQL for ad-hoc exploration and data-science queries — zero standing cost), Amazon Redshift / Spectrum (warehouse-grade BI over curated marts joined to hot dimensions), and Amazon SageMaker / EMR (reading curated feature tables directly for ML training). BI tools — QuickSight, Tableau, Power BI — sit beyond the engines. The data is never permanently copied into any engine; S3 is the one copy.
The two cross-cutting planes. Orchestration sits on top: AWS Step Functions and Amazon Managed Workflows for Apache Airflow (MWAA) sequence the pipeline — land raw, run Glue/EMR transforms, validate data quality, publish curated, then signal downstream — with retries, dependencies, and backfills. Observability spans everything: EMR/Glue push Spark metrics and logs to CloudWatch, the Spark History Server and EMR’s managed UIs expose stage-level execution, and Glue Data Quality gates the pipeline before bad data reaches consumers.
The diagram, in words. Picture a wide canvas. On the left, a column of source systems (databases, partner feeds, event streams, SaaS) with arrows through DMS / Firehose / Glue connectors into a single large S3 “raw” bucket. In the centre, S3 is drawn as the spine — three stacked buckets labelled raw → processed → curated, the unmistakable middle of the picture. Sitting above and around the S3 spine, four compute boxes all draw bidirectional arrows to S3: EMR on EC2 (drawn with a small On-Demand core block and a larger dashed “Spot task nodes” block to signal transience), EMR Serverless (a cloud with no fixed shape), AWS Glue (a gear), and EMR on EKS (Spark pods inside an EKS hexagon). Crucially, each compute box is drawn with a dashed border to convey ephemeral — they appear for a job and vanish. Underneath the entire spine runs a full-width band: Glue Data Catalog (one source of schema) with Lake Formation layered on it (the lock), and every compute box’s arrow to S3 passes through this band. On the right, Athena / Redshift / SageMaker read curated S3, feeding QuickSight / BI at the far edge. Across the top, an orchestration ribbon — Step Functions / MWAA (Airflow) — with control arrows reaching down into each compute box, and a parallel CloudWatch / Spark History / Glue Data Quality observability ribbon. Cross-cutting boxes at the very bottom — IAM Identity Center, KMS, VPC, CloudTrail — touch every tier. The defining visual: a permanent storage spine, ringed by disposable compute that is born and dies per job, all governed through one plane.
Component breakdown
| Component | Role in the platform | Key configuration choices |
|---|---|---|
| Amazon S3 (zoned) | The permanent spine — one physical copy of raw/processed/curated data; every engine’s read/write target | Separate buckets/prefixes per zone; partition by date; S3 Intelligent-Tiering on raw; lifecycle to Glacier for cold; Block Public Access + SSE-KMS; columnar Parquet for processed/curated |
| EMR on EC2 (transient) | Heavy/scheduled Spark batch — big ETL, reprocessing, ML feature jobs; launched per job, terminated after | Instance fleets with diversified Spot for task nodes; On-Demand for primary/core; managed scaling; --auto-terminate; EMRFS + Glue Catalog; runtime roles for Lake Formation |
| EMR Serverless | Spark with no cluster to manage — spiky, intermittent, small-to-medium jobs; scales to zero | Pre-initialised capacity for latency-sensitive jobs vs. cold-start for cheap; per-job application; vCPU/GB-second billing; auto-stop idle apps |
| AWS Glue ETL | Serverless Spark for routine raw→processed→curated transforms and orchestration glue |
Glue 5.0 (Spark 3.5); job bookmarks for incremental; Auto Scaling workers; Flex execution for non-urgent jobs (cheaper); Glue connectors for sources |
| EMR on EKS | Spark-as-pods on a shared EKS cluster — for K8s-standardised orgs | Virtual clusters per team; Karpenter + Spot node pools; pod templates; shares tooling/observability with other containerised workloads |
| Apache Spark | The processing engine inside EMR/Glue/EKS — the distributed compute itself | Tune partitions/shuffle; Adaptive Query Execution (AQE) for skew + dynamic coalescing; broadcast joins for small dims; cache hot intermediates; EMR’s optimised Spark runtime |
| Glue Data Catalog | One technical metadata store every engine shares | One catalog per environment; databases per domain; crawlers for schema discovery on raw; tables registered natively for curated |
| Lake Formation | Governance plane — fine-grained, engine-agnostic permissions enforced even inside Spark | Register S3 locations; remove IAMAllowedPrincipals; LF-Tags for scale; column masking + row filters; EMR runtime role enforcement |
| Amazon Athena | Serverless ad-hoc SQL over processed/curated — the default exploration engine | Engine v3 (Trino); workgroups with per-query scan caps; results to dedicated S3; query-result reuse cache |
| Step Functions / MWAA | Orchestration — sequence, retry, branch, backfill the pipeline | Step Functions for AWS-native, event-driven flows; MWAA (Airflow) for complex DAGs/cross-system dependencies; both trigger EMR/Glue and gate on data quality |
| Glue Data Quality | Data-quality gates between zones, before bad data reaches consumers | DQDL rules on processed→curated; fail-fast on freshness/null-rate/referential checks; quality results to CloudWatch/EventBridge |
Why each is here, and the choices that matter:
S3 is the spine, and it never moves. Every other component in the diagram is disposable; S3 is the one thing that persists. Because raw is immutable and replayable, and processed/curated are recomputable from raw, S3 is simultaneously your storage, your backup substrate, and your DR foundation. The single most consequential storage choice for processing performance is columnar Parquet with compression and sensible partitioning: a Spark job that reads partition-pruned, predicate-pushed Parquet touches a fraction of the bytes of one reading raw JSON, which is the difference between a 20-minute job and a 3-hour one.
The three-plus-one compute shapes are the heart of the architecture — and choosing between them is the skill. This table is the decision most teams get wrong:
| If the workload is… | Run it on… | Because… |
|---|---|---|
| Large, scheduled, predictable (nightly multi-TB ETL); long reprocessing | EMR on EC2, transient + Spot | Best $/compute via Spot task nodes and instance-fleet diversification; cluster lives only for the job |
| Spiky, intermittent, unpredictable, or you don’t want to size instances | EMR Serverless | Scales to zero between runs; no capacity planning; pay per vCPU/GB-second of actual work |
| Routine declarative ETL on a schedule/event; incremental loads | AWS Glue | Job bookmarks, connectors, fastest to author; serverless; Flex for cheap non-urgent runs |
| Ad-hoc SQL exploration; “what’s in this table?” | Athena | Zero standing cost; pay per TB scanned; no Spark to manage at all |
| Your org runs everything on Kubernetes already | EMR on EKS | Spark shares capacity, Karpenter autoscaling, and tooling with other K8s workloads |
The anti-pattern is choosing one shape for everything: running ad-hoc exploration on a giant standing EMR cluster (you’ll pay for idle), or forcing a steady nightly ETL onto EMR Serverless when a transient Spot cluster would be 70% cheaper, or hand-managing EMR for a job Glue would author in an afternoon.
Transient EMR + instance fleets + Spot is where the money is saved. A transient cluster launches, runs, and self-terminates, so there is no idle cost. Instance fleets let you specify a target capacity and a menu of acceptable instance types across multiple AZs; EMR fills it from whatever Spot capacity is cheapest and available, dramatically reducing the chance that a Spot shortage stalls your job. The rule that keeps this reliable: primary and core nodes on On-Demand (they hold HDFS/shuffle data and the application master — losing them kills the job); task nodes on Spot (they only run compute — losing one just reruns its tasks). Get that split right and you capture 60–90% Spot savings on the bulk of the cluster without risking job completion.
Spark is the engine, and AQE is the feature that earns its keep. The classic big-data failure — one stage runs for hours while 199 of 200 tasks finished in seconds — is data skew: a single hot key (one giant advertiser, one heavy subscriber) lands all its rows on one executor. Spark’s Adaptive Query Execution detects skewed partitions at runtime and splits them, dynamically coalesces tiny shuffle partitions, and switches join strategies based on actual sizes — turning a class of six-hour-then-OOM jobs into ones that just finish. Combined with broadcast joins for small dimensions and right-sized shuffle partitions, AQE is the single highest-leverage Spark setting on this platform.
Glue Catalog is the contract; Lake Formation is the lock — and the subtlety is that it works inside Spark. On a self-managed Hadoop cluster, once a job has the cluster’s IAM role it can read any S3 it can reach — governance is coarse and per-cluster. Here, an EMR cluster runs with a runtime role, and Lake Formation vends scoped, temporary credentials per query even to Spark, so the same column mask that hides msisdn in Athena also hides it from a Spark SELECT on EMR. Policy lives in one place and is enforced everywhere, including the heavy-compute tier.
Orchestration is not optional at enterprise scale. A real pipeline is a DAG: land raw, dedupe, conform, join reference data, aggregate, validate, publish, notify. Step Functions expresses this as an AWS-native state machine (great for event-driven, EMR/Glue-centric flows); MWAA (managed Airflow) is the choice when you have complex cross-system dependencies, hundreds of DAGs, or a team already fluent in Airflow. Both give you retries, backfills, and the ability to rerun one failed step rather than the whole pipeline — and both gate on Glue Data Quality so a failed freshness or null-rate check stops the pipeline before bad data reaches a dashboard.
Implementation guidance
Bucket and zone layout. Consistent prefixing across zones, partitioned for pruning:
s3://acme-bd-raw-<acct>/<source>/<table>/ingest_date=YYYY-MM-DD/
s3://acme-bd-processed-<acct>/<domain>/<table>/dt=YYYY-MM-DD/ # Parquet
s3://acme-bd-curated-<acct>/<domain>/<table>/ # Parquet/Iceberg
Raw partitions by ingest date for cheap lifecycle and replay; processed/curated partition by the business date queries actually filter on. The processing rule of thumb: aim for 128 MB–1 GB Parquet files — too small and Spark wastes time on task overhead and S3 listing; too large and you lose parallelism.
Terraform is the right IaC here (the team standardises on it). Manage as code:
- S3 buckets + Block Public Access + SSE-KMS + lifecycle/Intelligent-Tiering (
aws_s3_bucket,aws_s3_bucket_lifecycle_configuration). - Glue databases, catalog, and crawlers (
aws_glue_catalog_database,aws_glue_crawler); Glue jobs and triggers (aws_glue_job,aws_glue_trigger) with bookmarks enabled. - EMR: for persistent infra,
aws_emr_clusteror — better for transient jobs — define the cluster spec in the Step Functions / MWAA task so it’s created and torn down per run rather than living in Terraform state. For EMR Serverless,aws_emrserverless_application. For EMR on EKS,aws_emrcontainers_virtual_clusterover an existing EKS cluster. - Lake Formation:
aws_lakeformation_resourceto register each S3 location,aws_lakeformation_lf_tagfor the taxonomy,aws_lakeformation_permissions/aws_lakeformation_lf_tag_policyfor grants. Critically, remove theIAMAllowedPrincipalsdefault or IAM still grants broad access and your fine-grained policy is decorative. - Orchestration:
aws_sfn_state_machinefor Step Functions oraws_mwaa_environmentfor managed Airflow; Athena workgroups (aws_athena_workgroup) with per-query scan caps.
Transient EMR cluster spec — the choices that matter (expressed conceptually; this lives in the orchestration step, not standing infra):
{
"ReleaseLabel": "emr-7.5.0", // Spark 3.5, optimised runtime
"Applications": ["Spark"],
"AutoTerminationPolicy": { "IdleTimeout": 600 }, // die if idle 10 min
"ManagedScalingPolicy": { "MinCapacity": 2, "MaxCapacity": 64 },
"Instances": {
"InstanceFleets": [
{ "InstanceFleetType": "MASTER", "TargetOnDemandCapacity": 1,
"InstanceTypeConfigs": [ { "InstanceType": "m6g.xlarge" } ] },
{ "InstanceFleetType": "CORE", "TargetOnDemandCapacity": 2, // On-Demand: holds data
"InstanceTypeConfigs": [ { "InstanceType": "r6g.2xlarge" } ] },
{ "InstanceFleetType": "TASK", "TargetSpotCapacity": 40, // Spot: pure compute
"InstanceTypeConfigs": [ // diversified menu
{ "InstanceType": "r6g.4xlarge" }, { "InstanceType": "r5.4xlarge" },
{ "InstanceType": "r6i.4xlarge" }, { "InstanceType": "m6g.8xlarge" }
],
"LaunchSpecifications": { "SpotSpecification": {
"AllocationStrategy": "price-capacity-optimized" } } } // best Spot strategy
]
},
"Configurations": [ { "Classification": "spark-defaults", "Properties": {
"spark.sql.adaptive.enabled": "true",
"spark.sql.adaptive.skewJoin.enabled": "true", // the skew killer
"spark.sql.adaptive.coalescePartitions.enabled": "true",
"spark.dynamicAllocation.enabled": "true"
} } ]
}
The headline decisions: price-capacity-optimized Spot allocation (balances cheapest with least-likely-to-be-interrupted), On-Demand core / Spot task split, managed scaling so the cluster grows to the data and shrinks after, auto-termination so it never idles, and AQE on so skew and small-partition problems resolve at runtime.
EMR Serverless removes all of the above for spiky jobs — you submit a Spark job to an application and AWS handles workers:
aws emr-serverless start-job-run \
--application-id <app> --execution-role-arn <runtime-role> \
--job-driver '{ "sparkSubmit": {
"entryPoint": "s3://acme-bd-code/jobs/daily_rollup.py",
"sparkSubmitParameters": "--conf spark.sql.adaptive.enabled=true" } }'
Use pre-initialised capacity when a job is latency-sensitive (workers stay warm); accept cold-start when it’s a cheap background job.
Networking. Keep all data-plane traffic on the AWS backbone via a VPC Gateway Endpoint for S3 (free) and Interface (PrivateLink) endpoints for Glue, Lake Formation, Athena, KMS, STS, and the EMR APIs. EMR clusters, EMR Serverless, and EMR-on-EKS nodes run in private subnets with no public IPs; a NAT gateway handles only OS/package egress. The result: terabytes of shuffle and S3 I/O never traverse the public internet, and you avoid per-GB internet data charges on the heaviest traffic in the system.
Identity wiring. Human access flows from the IdP (the org standardises on Entra ID) through IAM Identity Center (SAML/SCIM) → permission sets → data-access roles, which are then granted Lake Formation permissions (not S3 permissions). Pipeline identities are scoped per engine: each Glue job, each EMR cluster (via its runtime role), and each EMR Serverless application gets its own role with only the Lake Formation grants it needs. No human or job role gets direct s3:GetObject on the lake buckets — Lake Formation vends scoped, temporary credentials per query, even into Spark. This is the wiring decision that makes governance real rather than theatrical, and it’s the one that most distinguishes this from a legacy Hadoop cluster where the cluster role could read everything.
Enterprise considerations
Security and Zero Trust. The model is “no implicit S3 access; every read — including from Spark — is brokered.” Concretely: (1) Block Public Access on every bucket, account-wide; (2) SSE-KMS with separate CMKs per zone for independent revocation and per-zone CloudTrail audit; (3) Lake Formation as the only path to data, so even an EMR cluster’s runtime role holds Lake Formation grants, not bucket policies; (4) column masking on PII (msisdn, imsi, card_pan) and row filters for tenancy/region, enforced in Spark, Glue, and Athena alike; (5) all traffic on PrivateLink; (6) EMR security configurations add in-transit and at-rest encryption for the cluster’s own intermediate data (local disks, shuffle); (7) CloudTrail data events + Lake Formation’s audit log give a complete “who processed/read which column when” trail. LF-Tags keep this scalable — tag a table confidentiality=pii once and grant on the tag, so new tables inherit policy automatically.
Cost optimization — this is the platform’s reason to exist, and it’s layered.
- The structural win (transience): transient EMR and scale-to-zero EMR Serverless mean you pay for compute only while a job runs. The telco that ran a 400-node cluster at 30% utilisation 24/7 now pays for ~30% of that capacity, ~30% of the day — a step-change, not a tuning gain.
- Spot on the bulk of the cluster: diversified Spot task fleets save 60–90% on the compute majority of every transient cluster, with On-Demand core nodes protecting completion.
- Right-sized compute per job: the spiky job goes to Serverless (no idle), the routine ETL to Glue Flex (cheaper non-urgent execution), the ad-hoc query to Athena (no cluster at all). Each lands on its cheapest correct shape.
- Storage tiering and scan reduction: Intelligent-Tiering on raw, Glacier lifecycle on cold; Parquet + compression + partition pruning cut both Spark read time and the Athena pay-per-TB bill by 90%+ versus raw formats. Athena workgroup per-query scan caps stop a runaway
SELECT *from costing thousands. - Graviton everywhere: ARM-based Graviton instance types (
m6g/r6g) for EMR deliver materially better price-performance than x86 for Spark — often 20%+ — and Spark runs on them unchanged.
Scalability. S3 absorbs effectively unlimited objects and throughput; partition-prefix design avoids hotspots. EMR managed scaling grows a cluster mid-job as a stage demands more executors and shrinks it after; EMR Serverless and Glue absorb concurrency automatically; Athena is serverless. The real scaling discipline is not capacity — it’s data engineering: keeping Parquet file sizes in the sweet spot (compaction jobs to fix small files), partitioning on the columns queries filter, and handling skew with AQE so adding nodes actually helps instead of one hot key bottlenecking on one executor regardless of cluster size. Throwing hardware at a skewed job is the most expensive way to not fix it.
Reliability and DR (RTO/RPO).
- Job-level resilience: transient clusters tolerate Spot interruptions because task nodes are disposable (their tasks rerun) and core/AM nodes are On-Demand. Orchestration retries individual failed steps, not whole pipelines, and idempotent, partition-overwrite writes mean a rerun produces the same result rather than duplicates.
- Data durability and DR: S3 is 11 nines; Cross-Region Replication on the raw zone (the system of record) gives geographic protection. Because processed/curated are recomputable from raw, you replicate the irreplaceable layer and rebuild the rest by re-running pipelines in the DR region.
- The catalog is the subtle DR risk: it lives in Glue/Lake Formation, not S3, so export catalog definitions and Lake Formation grants via IaC so the metadata layer is reproducible in DR — a lake with no catalog is opaque files.
- Targets: with CRR + IaC-reproducible catalog + code-as-pipelines, a realistic posture is RPO ≈ 15 min for raw (replication lag) and RTO of a few hours in a region failure — re-point the catalog, relaunch transient clusters, re-run pipelines to rebuild curated from replicated raw. There is no standing cluster to fail over because there is no standing cluster.
Observability. EMR and Glue emit Spark metrics and driver/executor logs to CloudWatch; the Spark History Server (and EMR’s managed Spark UI / Persistent UIs) expose stage-level execution — the single most important diagnostic for “why is this job slow,” surfacing spills, skew, and shuffle volume. Glue Data Quality (DQDL) gates the processed→curated boundary so freshness/null-rate/referential failures stop the pipeline and fire EventBridge alerts before bad data reaches dashboards. Track four SLOs: pipeline freshness (curated lag), job cost (compute-hours and bytes scanned/day), data-quality pass rate, and Spot interruption rate (rising interruptions are a signal to broaden the instance-fleet menu).
Governance. Lake Formation LF-Tags are the scalable model — a taxonomy (domain, confidentiality, retention) applied to databases/tables, grants written against tags so policy is inherited not hand-maintained. This is also the foundation for a data-mesh evolution: each domain owns its processed/curated databases and the jobs that produce them, sharing across domains via Lake Formation cross-account grants, with a central platform team owning only the tag taxonomy, the orchestration substrate, and the EMR/Glue golden patterns. Pair with Amazon DataZone for business-level discovery and access requests on top of the technical Glue catalog.
Reference enterprise example
Meridian Telecom is a fictional mobile carrier with 62M subscribers, running the legacy mess described at the top: a 400-node on-prem Hadoop cluster built in 2017, ~600 TB/month landing (call-detail records, RAN telemetry, billing events, app analytics), 12 PB at rest, ~30% average utilisation because it’s sized for the monthly fraud-reprocessing peak. The data-platform run-rate (hardware amortisation, datacentre, ops staff, Hadoop support) is ~$6.4M/year, three teams are queued for capacity that doesn’t exist, and the last audit flagged that the cluster role could read subscriber imsi/msisdn with no fine-grained control.
What they built. Over three quarters they migrated to the architecture above:
- Ingestion: DMS CDC from the Oracle billing system and the Postgres CRM into
raw; Firehose for the ~9B/day network-event and app-analytics records; partner mediation files dropped directly to S3. Raw landed ~20 TB/day, partitioned by ingest date, on Intelligent-Tiering with Glacier lifecycle past 18 months. - Lake: ~12 PB migrated to S3 over the period; processed/curated written as Parquet (zstd), ~1.4 PB, partitioned by date and region.
- Processing — the heart of it:
- The nightly CDR conformance + dedupe + enrichment pipeline (the old 7-hour job) ran on a transient EMR-on-EC2 cluster:
m6gprimary,r6g.2xlargeOn-Demand core, and a diversified Spot task fleet (~50r6g/r6i/r5nodes,price-capacity-optimized) that managed-scaled to the night’s volume and self-terminated by 04:30. Spark AQE eliminated the skew from a handful of high-volume corporate accounts that used to OOM the job. - The monthly 14-month fraud-detection reprocessing ran on a larger transient cluster for ~5 hours once a month — capacity that previously dictated the entire cluster’s 24/7 size, now summoned and released on demand.
- The routine reference-data and aggregate jobs moved to Glue 5.0 with job bookmarks; ad-hoc and unpredictable analyst jobs went to EMR Serverless (scale-to-zero).
- Orchestration: MWAA (Airflow) ran the ~120-DAG pipeline with per-step retries, backfills, and Glue Data Quality gates on the
processed→curatedboundary.
- The nightly CDR conformance + dedupe + enrichment pipeline (the old 7-hour job) ran on a transient EMR-on-EC2 cluster:
- Governance: Lake Formation with LF-Tags (
domain,confidentiality,retention);subscriber.imsi,subscriber.msisdn, andbilling.card_pancolumn-masked for analysts; a row filter restricted the EU MVNO team to EU subscribers. All legacy direct-S3 access was removed and re-granted through Lake Formation runtime roles — including for the EMR clusters. - Consumption: Athena for the ~180-person analytics/DS org (no more cluster contention); Redshift Serverless for the ~60 executive QuickSight dashboards over curated marts; SageMaker on curated feature tables for the churn and fraud models.
Decisions worth noting. They put core nodes on On-Demand and task nodes on Spot after an early pilot where an all-Spot cluster lost its core nodes mid-job and had to restart from scratch. They standardised on Graviton (m6g/r6g) for ~22% better Spark price-performance. They chose transient EMR over EMR Serverless for the predictable nightly job because Spot on a right-sized fleet was ~65% cheaper than Serverless for that steady, large workload — but used Serverless for the unpredictable analyst jobs where no-idle beat Spot pricing. They turned AQE skew-join on after the corporate-account skew turned a 40-minute job into a 5-hour one in week two. They set Athena workgroup scan caps at 3 TB after an analyst’s SELECT * scanned 50 TB.
The outcome (12 months).
- Platform run-rate fell from ~$6.4M to ~$2.7M/year (~58%) — overwhelmingly from killing 24/7 peak-sized capacity (transient + scale-to-zero), Spot on the compute bulk, and Graviton.
- The nightly pipeline finished by 04:30 instead of mid-morning, and stopped failing twice a week — orchestration retries plus AQE skew handling made it boring.
- Scaling became an API call: the three queued teams got capacity the same week by launching their own transient clusters against the same lake, rather than waiting on an 18-month hardware refresh.
- The PII audit finding closed: one Lake Formation report now answers “who can read
imsi” across Spark, Glue, and Athena — and the masks are enforced inside the Spark jobs, not just in the BI layer. - DR: raw CRR to a second region + IaC-reproducible catalog + code-as-pipelines gave a tested RPO ≈ 15 min / RTO ≈ 3 h, with no standing cluster to fail over.
When to use it
Use this big-data processing platform when:
- You have large-volume batch transformation (terabytes-to-petabytes) that a single machine or a fixed cluster can no longer process in the time available.
- Your compute is provisioned for the peak and paid at the average — a standing cluster sized for the heaviest job, idle most of the time.
- You have heterogeneous workloads (scheduled ETL, spiky ad-hoc, periodic reprocessing, ML feature engineering) that one fixed cluster forces to contend.
- You want Spark’s power without operating Hadoop — managed EMR/Glue instead of a platform team babysitting YARN — and governance that reaches into the compute tier, not just the query layer.
Trade-offs and anti-patterns:
- Operational surface is data engineering, not server management. You trade YARN babysitting for file hygiene (Parquet sizing/compaction), partition design, skew handling, and an orchestration DAG. Anti-pattern: ignoring small files — millions of tiny Parquet objects from frequent writes will make every Spark job and Athena query slow regardless of cluster size; scheduled compaction is mandatory.
- All-Spot is a trap on the wrong nodes. Anti-pattern: putting core/primary nodes on Spot — losing them kills the job and you restart from zero. Core/AM on On-Demand, task on Spot.
- Throwing hardware at skew doesn’t work. Anti-pattern: scaling up a cluster to “fix” a slow job that’s actually skewed — one hot key bottlenecks on one executor no matter how big the cluster. Fix it with AQE/salting, not nodes.
- Don’t run interactive, sub-second workloads here. EMR/Glue/Athena are batch-to-interactive-SQL (seconds-to-hours). Sub-second operational lookups belong on DynamoDB/Aurora; sub-second event reaction belongs on the streaming architecture.
- Don’t choose one engine for everything. Forcing ad-hoc onto a standing cluster, or steady ETL onto Serverless, or hand-managing EMR for a Glue-shaped job, all leave money or simplicity on the table.
Alternatives and how to choose:
| Situation | Better fit |
|---|---|
| Sub-second event reaction is the core need (fraud-in-flight, live dashboards) | Kinesis + Flink / Lambda — the real-time streaming architecture |
| The need is storage design — open tables, store-once-query-many governance | Data Lakehouse on AWS (Iceberg + Lake Formation + multi-engine) |
| Pure SQL BI on modest, predictable volume; no Spark/ML | Redshift-only warehouse — fewer moving parts |
| Mostly ad-hoc SQL on S3, no heavy distributed transforms yet | Athena + Glue, defer EMR until jobs need Spark muscle |
| Want a managed lakehouse/Spark platform with less AWS plumbing, multi-cloud | Databricks on AWS (Unity Catalog + Delta + managed Spark) |
| Org runs everything on Kubernetes and wants Spark to share it | EMR on EKS as the processing tier of this same architecture |
The big-data processing platform on AWS is the right default when heavy, varied batch transformation over a growing dataset would otherwise force you to buy and run peak-sized compute around the clock. By keeping storage permanent on S3 and making compute disposable — transient EMR with Spot for the heavy lifting, EMR Serverless and Glue for the spiky and the routine, Athena for exploration, all governed through one Lake Formation plane — you process any volume on compute sized per job and billed by the second, and the cluster that used to cost the same in July as in December simply ceases to exist.