Data AWS

AWS Enterprise Architecture: Big Data Processing

Big data processing fails in a way that has almost nothing to do with the data and everything to do with the compute. A team stands up a long-running EMR cluster sized for the year-end reprocessing job, runs the nightly 40-minute pipeline on it, and then pays for 240 idle CPU-hours a day. Or the opposite: they pick a cluster that was right for the demo, the data grows 3x, one Spark stage spills to disk because a single skewed key landed on one executor, and a job that should take 20 minutes runs for six hours and then OOM-kills itself at 95%. Neither problem is a Spark problem. They are the absence of a processing architecture — a deliberate separation between durable storage that never moves, ephemeral compute that you turn on for the duration of a job and tear down after, a serverless tier for the spiky and the small, and a catalog-plus-governance plane so every engine sees the same tables and the same access rules. This article builds that platform end to end on Amazon EMR, S3, AWS Glue, Amazon Athena, AWS Lake Formation, and Apache Spark, and treats the hard parts — cluster topology, Spot economics, data skew, job orchestration, and “which of the three EMRs do I actually run” — as first-class concerns.

The single most important idea is this: storage is permanent and compute is disposable. Your data lives on S3 forever; the Spark cluster that transforms it should exist only while a job runs and cost nothing the rest of the day. Get that decoupling right and big-data processing becomes an exercise in picking the cheapest, correctly-sized, transient compute for each job — and the multi-million-dollar Hadoop cluster that ran 24/7 whether or not anyone had work for it becomes a line item you no longer pay.

This is the processing-and-compute companion to the AWS Data Lakehouse reference (which centres on storage, open table formats, and store-once-query-many) and the Real-Time Streaming reference (which centres on sub-second event handling). Here the spine is the batch Spark pipeline and the question is how the heavy transformation actually runs.

The business scenario

Picture an organisation that has more data than its current tools can chew through in the time it has. The shape is identical at 40 engineers and at 4,000.

The early-stage version. A 60-person ad-tech startup lands ~4 TB/day of impression and click logs into S3. They began with a single r5.4xlarge EC2 box running a Python script and a cron job. It worked at 200 GB/day. At 4 TB/day the nightly aggregation now takes 9 hours, finishes after the morning standup, and falls over roughly twice a week when a partner sends a malformed batch. Every attempt to “just make the box bigger” buys a few weeks before the data outgrows it again. The founding engineer knows the answer is “distributed processing” but has watched friends drown in a self-managed Hadoop cluster and does not want to hire a platform team to babysit YARN.

The large-enterprise version. A telco with 60M subscribers runs a 400-node on-premises Hadoop cluster that a team built in 2017. It processes call-detail records, network telemetry, and billing events — roughly 600 TB/month landing, 12 PB at rest. The cluster runs at 30% average utilisation because it is sized for the monthly fraud-detection reprocessing peak; the other 29 days it is mostly idle but fully powered and fully staffed. Hardware refresh is a capital project with an 18-month lead time. Three different teams want more capacity and there is none to give because adding nodes means buying racks. The CFO sees a fixed, enormous, growing cost and the data team sees a queue of jobs they can’t schedule.

Both organisations have the same structural problem, and — exactly as with the lakehouse — it is a coupling problem, but on a different axis:

The processing platform in this article decouples all four. One durable copy of data on S3. Compute that is born for a job and dies when it finishes — a transient EMR cluster, an EMR Serverless application that scales to zero, or a Glue serverless job. Each workload on the compute shape sized for it. And per-job isolation so a poison batch in the fraud pipeline can’t touch the billing pipeline. You stop paying for peak capacity at idle, scaling becomes an API call, and the heaviest job in the year no longer dictates what you pay in July.

The promise to the business: transform any volume of data, on compute you turn on for exactly as long as the job runs, sized per job, and pay by the second of actual work — not by the calendar.

Architecture overview

The platform is a durable S3 data lake in the centre, surrounded by disposable compute engines that all read and write the same data through one shared catalog and one governance plane. Read it as five tiers, with two cross-cutting planes (catalog/governance underneath, orchestration/observability on top) spanning the whole width.

AWS big-data processing platform: source systems and ingestion (DMS, Firehose, Glue connectors) land data in an S3 raw/processed/curated spine; a disposable Spark compute fleet (EMR on EC2 with Spot, EMR Serverless, AWS Glue, EMR on EKS) reads and writes the same tables through a Glue Data Catalog and Lake Formation governance plane; Athena, Redshift and SageMaker serve the curated edge, with Step Functions/MWAA orchestration and an IAM/KMS/VPC/CloudTrail baseline.

Tier 1 — Ingestion into the raw lake (S3). Source data lands, untransformed, in a raw zone on S3. Bulk and relational sources arrive via AWS DMS (CDC from operational databases) or direct S3 uploads/partner drops; high-volume event data arrives through Kinesis Data Firehose writing partitioned objects; SaaS sources come through AWS Glue connectors or AppFlow. Nothing is processed here. Raw is immutable, append-only, partitioned by ingest date, and is the system of record for replay — every derived dataset can be rebuilt from it.

Tier 2 — The lake on S3 (the spine). S3 holds the data across curated zones — raw (as-landed), processed/cleaned (conformed, deduplicated, columnar), and curated/aggregated (business-level marts and feature tables). Curated data is written as Parquet (columnar, compressed with zstd/Snappy) and, where transactional semantics matter, as Apache Iceberg tables. This is the one physical copy of all data; every compute engine is a temporary lens over it. (The storage design — zone layout, table formats, small-file economics — is the subject of the lakehouse reference; here we focus on the engines that read and write it.)

Tier 3 — The processing fleet: where the heavy lifting actually happens. This is the centre of gravity of this architecture and where it differs from every other data reference. Three Spark-capable compute shapes read the same S3 + catalog and write back the same tables, each chosen per workload:

A fourth shape, EMR on EKS, runs Spark as pods on a shared Amazon EKS cluster — the right answer when an org has standardised on Kubernetes and wants Spark to share capacity, tooling, and Karpenter-driven autoscaling with its other containerised workloads. All four write the same Parquet/Iceberg tables to the same Glue catalog; the engine is a per-workload choice, not a platform-wide religion.

Tier 4 — Catalog and governance (the plane underneath everything). The AWS Glue Data Catalog is the single technical metadata store — every table, schema, and partition registered once, read by EMR, Glue, Athena, and Redshift alike. AWS Lake Formation sits on top as the permission layer: it owns the S3 locations and grants database/table/column/row-level access rather than raw S3 access. Every engine — including an EMR cluster running Spark — vends credentials through Lake Formation at runtime, so “analysts cannot see the msisdn column; the EU team sees only EU rows” is enforced identically across Spark, Glue, and Athena.

Tier 5 — Serving and consumption (the right edge). Processed and curated tables are served to three patterns: Amazon Athena (serverless, pay-per-TB-scanned SQL for ad-hoc exploration and data-science queries — zero standing cost), Amazon Redshift / Spectrum (warehouse-grade BI over curated marts joined to hot dimensions), and Amazon SageMaker / EMR (reading curated feature tables directly for ML training). BI tools — QuickSight, Tableau, Power BI — sit beyond the engines. The data is never permanently copied into any engine; S3 is the one copy.

The two cross-cutting planes. Orchestration sits on top: AWS Step Functions and Amazon Managed Workflows for Apache Airflow (MWAA) sequence the pipeline — land raw, run Glue/EMR transforms, validate data quality, publish curated, then signal downstream — with retries, dependencies, and backfills. Observability spans everything: EMR/Glue push Spark metrics and logs to CloudWatch, the Spark History Server and EMR’s managed UIs expose stage-level execution, and Glue Data Quality gates the pipeline before bad data reaches consumers.

The diagram, in words. Picture a wide canvas. On the left, a column of source systems (databases, partner feeds, event streams, SaaS) with arrows through DMS / Firehose / Glue connectors into a single large S3 “raw” bucket. In the centre, S3 is drawn as the spine — three stacked buckets labelled raw → processed → curated, the unmistakable middle of the picture. Sitting above and around the S3 spine, four compute boxes all draw bidirectional arrows to S3: EMR on EC2 (drawn with a small On-Demand core block and a larger dashed “Spot task nodes” block to signal transience), EMR Serverless (a cloud with no fixed shape), AWS Glue (a gear), and EMR on EKS (Spark pods inside an EKS hexagon). Crucially, each compute box is drawn with a dashed border to convey ephemeral — they appear for a job and vanish. Underneath the entire spine runs a full-width band: Glue Data Catalog (one source of schema) with Lake Formation layered on it (the lock), and every compute box’s arrow to S3 passes through this band. On the right, Athena / Redshift / SageMaker read curated S3, feeding QuickSight / BI at the far edge. Across the top, an orchestration ribbon — Step Functions / MWAA (Airflow) — with control arrows reaching down into each compute box, and a parallel CloudWatch / Spark History / Glue Data Quality observability ribbon. Cross-cutting boxes at the very bottom — IAM Identity Center, KMS, VPC, CloudTrail — touch every tier. The defining visual: a permanent storage spine, ringed by disposable compute that is born and dies per job, all governed through one plane.

Component breakdown

Component Role in the platform Key configuration choices
Amazon S3 (zoned) The permanent spine — one physical copy of raw/processed/curated data; every engine’s read/write target Separate buckets/prefixes per zone; partition by date; S3 Intelligent-Tiering on raw; lifecycle to Glacier for cold; Block Public Access + SSE-KMS; columnar Parquet for processed/curated
EMR on EC2 (transient) Heavy/scheduled Spark batch — big ETL, reprocessing, ML feature jobs; launched per job, terminated after Instance fleets with diversified Spot for task nodes; On-Demand for primary/core; managed scaling; --auto-terminate; EMRFS + Glue Catalog; runtime roles for Lake Formation
EMR Serverless Spark with no cluster to manage — spiky, intermittent, small-to-medium jobs; scales to zero Pre-initialised capacity for latency-sensitive jobs vs. cold-start for cheap; per-job application; vCPU/GB-second billing; auto-stop idle apps
AWS Glue ETL Serverless Spark for routine raw→processed→curated transforms and orchestration glue Glue 5.0 (Spark 3.5); job bookmarks for incremental; Auto Scaling workers; Flex execution for non-urgent jobs (cheaper); Glue connectors for sources
EMR on EKS Spark-as-pods on a shared EKS cluster — for K8s-standardised orgs Virtual clusters per team; Karpenter + Spot node pools; pod templates; shares tooling/observability with other containerised workloads
Apache Spark The processing engine inside EMR/Glue/EKS — the distributed compute itself Tune partitions/shuffle; Adaptive Query Execution (AQE) for skew + dynamic coalescing; broadcast joins for small dims; cache hot intermediates; EMR’s optimised Spark runtime
Glue Data Catalog One technical metadata store every engine shares One catalog per environment; databases per domain; crawlers for schema discovery on raw; tables registered natively for curated
Lake Formation Governance plane — fine-grained, engine-agnostic permissions enforced even inside Spark Register S3 locations; remove IAMAllowedPrincipals; LF-Tags for scale; column masking + row filters; EMR runtime role enforcement
Amazon Athena Serverless ad-hoc SQL over processed/curated — the default exploration engine Engine v3 (Trino); workgroups with per-query scan caps; results to dedicated S3; query-result reuse cache
Step Functions / MWAA Orchestration — sequence, retry, branch, backfill the pipeline Step Functions for AWS-native, event-driven flows; MWAA (Airflow) for complex DAGs/cross-system dependencies; both trigger EMR/Glue and gate on data quality
Glue Data Quality Data-quality gates between zones, before bad data reaches consumers DQDL rules on processed→curated; fail-fast on freshness/null-rate/referential checks; quality results to CloudWatch/EventBridge

Why each is here, and the choices that matter:

S3 is the spine, and it never moves. Every other component in the diagram is disposable; S3 is the one thing that persists. Because raw is immutable and replayable, and processed/curated are recomputable from raw, S3 is simultaneously your storage, your backup substrate, and your DR foundation. The single most consequential storage choice for processing performance is columnar Parquet with compression and sensible partitioning: a Spark job that reads partition-pruned, predicate-pushed Parquet touches a fraction of the bytes of one reading raw JSON, which is the difference between a 20-minute job and a 3-hour one.

The three-plus-one compute shapes are the heart of the architecture — and choosing between them is the skill. This table is the decision most teams get wrong:

If the workload is… Run it on… Because…
Large, scheduled, predictable (nightly multi-TB ETL); long reprocessing EMR on EC2, transient + Spot Best $/compute via Spot task nodes and instance-fleet diversification; cluster lives only for the job
Spiky, intermittent, unpredictable, or you don’t want to size instances EMR Serverless Scales to zero between runs; no capacity planning; pay per vCPU/GB-second of actual work
Routine declarative ETL on a schedule/event; incremental loads AWS Glue Job bookmarks, connectors, fastest to author; serverless; Flex for cheap non-urgent runs
Ad-hoc SQL exploration; “what’s in this table?” Athena Zero standing cost; pay per TB scanned; no Spark to manage at all
Your org runs everything on Kubernetes already EMR on EKS Spark shares capacity, Karpenter autoscaling, and tooling with other K8s workloads

The anti-pattern is choosing one shape for everything: running ad-hoc exploration on a giant standing EMR cluster (you’ll pay for idle), or forcing a steady nightly ETL onto EMR Serverless when a transient Spot cluster would be 70% cheaper, or hand-managing EMR for a job Glue would author in an afternoon.

Transient EMR + instance fleets + Spot is where the money is saved. A transient cluster launches, runs, and self-terminates, so there is no idle cost. Instance fleets let you specify a target capacity and a menu of acceptable instance types across multiple AZs; EMR fills it from whatever Spot capacity is cheapest and available, dramatically reducing the chance that a Spot shortage stalls your job. The rule that keeps this reliable: primary and core nodes on On-Demand (they hold HDFS/shuffle data and the application master — losing them kills the job); task nodes on Spot (they only run compute — losing one just reruns its tasks). Get that split right and you capture 60–90% Spot savings on the bulk of the cluster without risking job completion.

Spark is the engine, and AQE is the feature that earns its keep. The classic big-data failure — one stage runs for hours while 199 of 200 tasks finished in seconds — is data skew: a single hot key (one giant advertiser, one heavy subscriber) lands all its rows on one executor. Spark’s Adaptive Query Execution detects skewed partitions at runtime and splits them, dynamically coalesces tiny shuffle partitions, and switches join strategies based on actual sizes — turning a class of six-hour-then-OOM jobs into ones that just finish. Combined with broadcast joins for small dimensions and right-sized shuffle partitions, AQE is the single highest-leverage Spark setting on this platform.

Glue Catalog is the contract; Lake Formation is the lock — and the subtlety is that it works inside Spark. On a self-managed Hadoop cluster, once a job has the cluster’s IAM role it can read any S3 it can reach — governance is coarse and per-cluster. Here, an EMR cluster runs with a runtime role, and Lake Formation vends scoped, temporary credentials per query even to Spark, so the same column mask that hides msisdn in Athena also hides it from a Spark SELECT on EMR. Policy lives in one place and is enforced everywhere, including the heavy-compute tier.

Orchestration is not optional at enterprise scale. A real pipeline is a DAG: land raw, dedupe, conform, join reference data, aggregate, validate, publish, notify. Step Functions expresses this as an AWS-native state machine (great for event-driven, EMR/Glue-centric flows); MWAA (managed Airflow) is the choice when you have complex cross-system dependencies, hundreds of DAGs, or a team already fluent in Airflow. Both give you retries, backfills, and the ability to rerun one failed step rather than the whole pipeline — and both gate on Glue Data Quality so a failed freshness or null-rate check stops the pipeline before bad data reaches a dashboard.

Implementation guidance

Bucket and zone layout. Consistent prefixing across zones, partitioned for pruning:

s3://acme-bd-raw-<acct>/<source>/<table>/ingest_date=YYYY-MM-DD/
s3://acme-bd-processed-<acct>/<domain>/<table>/dt=YYYY-MM-DD/   # Parquet
s3://acme-bd-curated-<acct>/<domain>/<table>/                  # Parquet/Iceberg

Raw partitions by ingest date for cheap lifecycle and replay; processed/curated partition by the business date queries actually filter on. The processing rule of thumb: aim for 128 MB–1 GB Parquet files — too small and Spark wastes time on task overhead and S3 listing; too large and you lose parallelism.

Terraform is the right IaC here (the team standardises on it). Manage as code:

Transient EMR cluster spec — the choices that matter (expressed conceptually; this lives in the orchestration step, not standing infra):

{
  "ReleaseLabel": "emr-7.5.0",                 // Spark 3.5, optimised runtime
  "Applications": ["Spark"],
  "AutoTerminationPolicy": { "IdleTimeout": 600 },  // die if idle 10 min
  "ManagedScalingPolicy": { "MinCapacity": 2, "MaxCapacity": 64 },
  "Instances": {
    "InstanceFleets": [
      { "InstanceFleetType": "MASTER", "TargetOnDemandCapacity": 1,
        "InstanceTypeConfigs": [ { "InstanceType": "m6g.xlarge" } ] },
      { "InstanceFleetType": "CORE",   "TargetOnDemandCapacity": 2,   // On-Demand: holds data
        "InstanceTypeConfigs": [ { "InstanceType": "r6g.2xlarge" } ] },
      { "InstanceFleetType": "TASK",   "TargetSpotCapacity": 40,      // Spot: pure compute
        "InstanceTypeConfigs": [                                       // diversified menu
          { "InstanceType": "r6g.4xlarge" }, { "InstanceType": "r5.4xlarge" },
          { "InstanceType": "r6i.4xlarge" }, { "InstanceType": "m6g.8xlarge" }
        ],
        "LaunchSpecifications": { "SpotSpecification": {
          "AllocationStrategy": "price-capacity-optimized" } } }   // best Spot strategy
    ]
  },
  "Configurations": [ { "Classification": "spark-defaults", "Properties": {
    "spark.sql.adaptive.enabled": "true",
    "spark.sql.adaptive.skewJoin.enabled": "true",            // the skew killer
    "spark.sql.adaptive.coalescePartitions.enabled": "true",
    "spark.dynamicAllocation.enabled": "true"
  } } ]
}

The headline decisions: price-capacity-optimized Spot allocation (balances cheapest with least-likely-to-be-interrupted), On-Demand core / Spot task split, managed scaling so the cluster grows to the data and shrinks after, auto-termination so it never idles, and AQE on so skew and small-partition problems resolve at runtime.

EMR Serverless removes all of the above for spiky jobs — you submit a Spark job to an application and AWS handles workers:

aws emr-serverless start-job-run \
  --application-id <app> --execution-role-arn <runtime-role> \
  --job-driver '{ "sparkSubmit": {
      "entryPoint": "s3://acme-bd-code/jobs/daily_rollup.py",
      "sparkSubmitParameters": "--conf spark.sql.adaptive.enabled=true" } }'

Use pre-initialised capacity when a job is latency-sensitive (workers stay warm); accept cold-start when it’s a cheap background job.

Networking. Keep all data-plane traffic on the AWS backbone via a VPC Gateway Endpoint for S3 (free) and Interface (PrivateLink) endpoints for Glue, Lake Formation, Athena, KMS, STS, and the EMR APIs. EMR clusters, EMR Serverless, and EMR-on-EKS nodes run in private subnets with no public IPs; a NAT gateway handles only OS/package egress. The result: terabytes of shuffle and S3 I/O never traverse the public internet, and you avoid per-GB internet data charges on the heaviest traffic in the system.

Identity wiring. Human access flows from the IdP (the org standardises on Entra ID) through IAM Identity Center (SAML/SCIM) → permission sets → data-access roles, which are then granted Lake Formation permissions (not S3 permissions). Pipeline identities are scoped per engine: each Glue job, each EMR cluster (via its runtime role), and each EMR Serverless application gets its own role with only the Lake Formation grants it needs. No human or job role gets direct s3:GetObject on the lake buckets — Lake Formation vends scoped, temporary credentials per query, even into Spark. This is the wiring decision that makes governance real rather than theatrical, and it’s the one that most distinguishes this from a legacy Hadoop cluster where the cluster role could read everything.

Enterprise considerations

Security and Zero Trust. The model is “no implicit S3 access; every read — including from Spark — is brokered.” Concretely: (1) Block Public Access on every bucket, account-wide; (2) SSE-KMS with separate CMKs per zone for independent revocation and per-zone CloudTrail audit; (3) Lake Formation as the only path to data, so even an EMR cluster’s runtime role holds Lake Formation grants, not bucket policies; (4) column masking on PII (msisdn, imsi, card_pan) and row filters for tenancy/region, enforced in Spark, Glue, and Athena alike; (5) all traffic on PrivateLink; (6) EMR security configurations add in-transit and at-rest encryption for the cluster’s own intermediate data (local disks, shuffle); (7) CloudTrail data events + Lake Formation’s audit log give a complete “who processed/read which column when” trail. LF-Tags keep this scalable — tag a table confidentiality=pii once and grant on the tag, so new tables inherit policy automatically.

Cost optimization — this is the platform’s reason to exist, and it’s layered.

Scalability. S3 absorbs effectively unlimited objects and throughput; partition-prefix design avoids hotspots. EMR managed scaling grows a cluster mid-job as a stage demands more executors and shrinks it after; EMR Serverless and Glue absorb concurrency automatically; Athena is serverless. The real scaling discipline is not capacity — it’s data engineering: keeping Parquet file sizes in the sweet spot (compaction jobs to fix small files), partitioning on the columns queries filter, and handling skew with AQE so adding nodes actually helps instead of one hot key bottlenecking on one executor regardless of cluster size. Throwing hardware at a skewed job is the most expensive way to not fix it.

Reliability and DR (RTO/RPO).

Observability. EMR and Glue emit Spark metrics and driver/executor logs to CloudWatch; the Spark History Server (and EMR’s managed Spark UI / Persistent UIs) expose stage-level execution — the single most important diagnostic for “why is this job slow,” surfacing spills, skew, and shuffle volume. Glue Data Quality (DQDL) gates the processed→curated boundary so freshness/null-rate/referential failures stop the pipeline and fire EventBridge alerts before bad data reaches dashboards. Track four SLOs: pipeline freshness (curated lag), job cost (compute-hours and bytes scanned/day), data-quality pass rate, and Spot interruption rate (rising interruptions are a signal to broaden the instance-fleet menu).

Governance. Lake Formation LF-Tags are the scalable model — a taxonomy (domain, confidentiality, retention) applied to databases/tables, grants written against tags so policy is inherited not hand-maintained. This is also the foundation for a data-mesh evolution: each domain owns its processed/curated databases and the jobs that produce them, sharing across domains via Lake Formation cross-account grants, with a central platform team owning only the tag taxonomy, the orchestration substrate, and the EMR/Glue golden patterns. Pair with Amazon DataZone for business-level discovery and access requests on top of the technical Glue catalog.

Reference enterprise example

Meridian Telecom is a fictional mobile carrier with 62M subscribers, running the legacy mess described at the top: a 400-node on-prem Hadoop cluster built in 2017, ~600 TB/month landing (call-detail records, RAN telemetry, billing events, app analytics), 12 PB at rest, ~30% average utilisation because it’s sized for the monthly fraud-reprocessing peak. The data-platform run-rate (hardware amortisation, datacentre, ops staff, Hadoop support) is ~$6.4M/year, three teams are queued for capacity that doesn’t exist, and the last audit flagged that the cluster role could read subscriber imsi/msisdn with no fine-grained control.

What they built. Over three quarters they migrated to the architecture above:

Decisions worth noting. They put core nodes on On-Demand and task nodes on Spot after an early pilot where an all-Spot cluster lost its core nodes mid-job and had to restart from scratch. They standardised on Graviton (m6g/r6g) for ~22% better Spark price-performance. They chose transient EMR over EMR Serverless for the predictable nightly job because Spot on a right-sized fleet was ~65% cheaper than Serverless for that steady, large workload — but used Serverless for the unpredictable analyst jobs where no-idle beat Spot pricing. They turned AQE skew-join on after the corporate-account skew turned a 40-minute job into a 5-hour one in week two. They set Athena workgroup scan caps at 3 TB after an analyst’s SELECT * scanned 50 TB.

The outcome (12 months).

When to use it

Use this big-data processing platform when:

Trade-offs and anti-patterns:

Alternatives and how to choose:

Situation Better fit
Sub-second event reaction is the core need (fraud-in-flight, live dashboards) Kinesis + Flink / Lambda — the real-time streaming architecture
The need is storage design — open tables, store-once-query-many governance Data Lakehouse on AWS (Iceberg + Lake Formation + multi-engine)
Pure SQL BI on modest, predictable volume; no Spark/ML Redshift-only warehouse — fewer moving parts
Mostly ad-hoc SQL on S3, no heavy distributed transforms yet Athena + Glue, defer EMR until jobs need Spark muscle
Want a managed lakehouse/Spark platform with less AWS plumbing, multi-cloud Databricks on AWS (Unity Catalog + Delta + managed Spark)
Org runs everything on Kubernetes and wants Spark to share it EMR on EKS as the processing tier of this same architecture

The big-data processing platform on AWS is the right default when heavy, varied batch transformation over a growing dataset would otherwise force you to buy and run peak-sized compute around the clock. By keeping storage permanent on S3 and making compute disposable — transient EMR with Spot for the heavy lifting, EMR Serverless and Glue for the spiky and the routine, Athena for exploration, all governed through one Lake Formation plane — you process any volume on compute sized per job and billed by the second, and the cluster that used to cost the same in July as in December simply ceases to exist.

AWSArchitectureEnterpriseReference Architecture
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading