AWS Compute: EC2, Lambda, ECS and EKS — Which One to Choose?

You have a workload — an API, a batch job, a queue consumer, a website — and AWS gives you at least five credible ways to run it: a raw EC2 instance you own end to end, a Lambda function with no server in sight, an ECS task on AWS’s own orchestrator, an EKS pod on managed Kubernetes, or any of the container options backed by Fargate so you never touch a node. Pick wrong and you pay for it twice: once in the monthly bill, and again every week in the operational toil of patching, scaling and debugging a platform that was never the right shape for the job. The wrong default — “spin up an EC2 instance, it’s what we know” — is the single most expensive habit on AWS, because an idle m5.large costs the same whether it serves a million requests or zero, and someone still has to patch its kernel.

This is the decision guide, written the way a 22-year architect actually reasons about it: not “which is best” (none is best) but “which axis does this workload live on” — how much of the stack you must control, how the traffic arrives (steady, bursty, event-driven, scheduled), how long each unit of work runs, and how much undifferentiated heavy lifting you are willing to hand to AWS. We walk every service option by option: EC2’s instance families and purchase models, Lambda’s runtimes and its hard 15-minute / 10 GB ceilings, ECS’s two launch types and task-definition knobs, EKS’s control-plane-plus-data-plane split and its per-cluster hourly charge, and Fargate’s per-second vCPU/GB billing. Every configuration gets an aws CLI snippet and a Terraform snippet, and because you will come back to this mid-design, every comparison — instance families, runtimes, launch types, limits, failure modes, prices — is a table you can scan in ten seconds.

By the end you will stop treating EC2 as the answer to every question. You will look at a workload and know within a minute whether it wants a function (event-driven, sub-15-minute, spiky), a task (a container with no Kubernetes ambitions), a pod (you already run Kubernetes and want the API and ecosystem), or a real instance (GPU, Windows licensing, a kernel module, sustained 24×7 load where reserved capacity is cheapest). And when the workload is genuinely on the fence, you will know the exact knobs — cold start, SNAT, node-group sizing, Savings Plans — that tip the decision.

What problem this solves

The pain is concrete and recurring. A team ships a service on the compute they’re comfortable with, not the compute that fits, and the mismatch shows up as either a bloated bill or a stream of 2 a.m. pages. A cron job that runs for forty seconds once an hour sits on a 24×7 t3.medium — you pay for ~720 hours a month to do ~12 hours of work. A bursty image-thumbnailer runs on a fixed fleet that’s over-provisioned for the median and still falls over at peak. A five-service product adopts EKS because a conference talk said to, then discovers it now owns a Kubernetes control plane’s worth of upgrades, add-on CVEs and IRSA debugging — for five services that ECS would have run with a fraction of the surface area.

What breaks without a deliberate choice: cost creeps (idle instances, over-provisioned fleets, a $0.10/hour EKS cluster per environment that nobody decommissions); operational load balloons (OS patching, AMI rebuilds, node-group rotations, control-plane version skew); and reliability suffers in the gaps the team didn’t design for (a Lambda that quietly hits its 15-minute timeout on a large input, an ECS service with no health check that ships a crash-looping task, an EKS pod stuck Pending because the cluster autoscaler can’t get capacity in the right AZ). None of these are exotic failures — they’re the default outcome of picking compute by familiarity.

Who hits this: essentially every team that runs more than one kind of workload. It bites hardest on teams migrating a monolith (everything lands on EC2 because that’s the lift-and-shift path, and nothing ever moves off), startups that adopt Kubernetes before they have the headcount to operate it, and cost-sensitive shops that never revisit the first instance they launched. The fix is not a tool — it’s a model: match the lifecycle of the work to the billing and operational model of the service. The rest of this article is that model, enumerated.

To frame the whole field before the deep dive, here is every compute service this article covers, the unit you pay for, what AWS manages versus what you manage, and the workload shape it fits:

Service	What it is	You pay for	AWS manages	You manage	Fits this workload shape
EC2	Virtual machine (instance)	Instance-hour (per-second, 60s min)	Hypervisor, hardware, network	OS, patching, scaling, runtime	Full OS control, GPU, Windows licensing, sustained 24×7
Lambda	Function-as-a-service	GB-second + per-request	Everything below your handler	Just the function code + config	Event-driven, bursty, ≤15 min, glue
ECS on EC2	AWS container orchestrator on your nodes	The EC2 instances (free control plane)	Scheduler/control plane	The EC2 node fleet	Containers, want bin-packing & EC2 pricing
ECS on Fargate	AWS orchestrator, serverless data plane	Per-second vCPU + GB	Scheduler and nodes	Just the task definition	Containers, no nodes, variable load
EKS on EC2	Managed Kubernetes on your nodes	$0.10/hr control plane + nodes	K8s control plane (3-AZ)	Node groups, add-ons, upgrades	Already on K8s, need the API/ecosystem
EKS on Fargate	Managed Kubernetes, serverless pods	$0.10/hr + per-pod vCPU/GB	Control plane and pod hosts	Manifests, profiles	K8s API without node management
App Runner	Fully managed container web service	Per-second + provisioned floor	Build, deploy, scale, LB	Just the container image	Stateless HTTP container, minimal ops
Batch	Managed batch scheduler	The underlying EC2/Fargate	Queueing & provisioning	Job definitions	Large fan-out batch / HPC
Lightsail	Bundled VPS	Flat monthly bundle	VM + simple stack	The app	Simple sites, predictable flat pricing

Learning objectives

By the end of this article you can:

Map any workload to EC2, Lambda, ECS, EKS or Fargate using its lifecycle (event-driven, bursty, steady, scheduled, long-running) and your control requirement, not by habit.
Read the EC2 instance-family taxonomy (the letter/number/suffix scheme) and pick the right family — general purpose, compute, memory, storage, accelerated — and the right purchase model (On-Demand, Reserved, Savings Plan, Spot, Dedicated).
State Lambda’s hard limits from memory — 15-minute timeout, 10 GB memory, 10 GB ephemeral /tmp, 6 MB synchronous payload, 250 MB unzipped package, 1,000 default concurrency — and design around each.
Choose between ECS launch types (EC2 vs Fargate) and configure a task definition’s CPU/memory, networking mode, IAM roles and health checks correctly.
Decide between ECS and EKS honestly (operational surface area vs Kubernetes API/portability) and size an EKS data plane with managed node groups, Fargate profiles or Karpenter.
Use Fargate where it earns its premium and fall back to EC2-backed compute where bin-packing or reserved pricing wins.
Right-size and cost-model each option in INR/USD, exploit the relevant free tier, and name the price levers (Graviton, Spot, Savings Plans, memory tuning) that move the bill most.

Prerequisites & where this fits

You should be comfortable with the AWS basics: an AWS account and IAM (roles, policies — every compute service assumes an execution role or instance profile to call other services), the VPC model (subnets, security groups, public vs private), and how to run the aws CLI with credentials. You should know what a container image is (a packaged filesystem + entrypoint, built from a Dockerfile, stored in a registry like ECR) and roughly what Kubernetes does (declarative orchestration of pods across nodes), even if you’ve never operated it. Familiarity with Terraform (resource, provider, terraform apply) helps because every example pairs a CLI command with IaC.

This sits at the foundation of the AWS compute track. It’s the decision upstream of the deeper guides: once you’ve chosen containers, the ECS vs EKS vs Fargate container path goes one level deeper on orchestration, and once you’ve chosen serverless, Lambda event-driven patterns covers the integration shapes. The network your compute lands in is VPC subnets and security groups; the placement across Regions and Availability Zones decides your blast radius; the front door is usually an ALB, NLB or API Gateway; and the identity every service assumes comes from Organizations and IAM foundations. The state your compute talks to lives in RDS, DynamoDB or Aurora and S3 storage classes.

A quick map of who owns what during design and operations, so the trade-off is concrete:

Layer	EC2	Lambda	ECS/Fargate	EKS
Hardware / hypervisor	AWS	AWS	AWS	AWS
Guest OS & kernel patching	You	AWS	AWS (Fargate) / You (EC2)	AWS (Fargate) / You (EC2)
Runtime / language version	You	AWS-provided or your image	Your image	Your image
Orchestration / scheduling	You (ASG)	AWS	AWS	AWS control plane, you tune
Scaling policy	You (ASG/target tracking)	AWS (automatic)	You (service autoscaling)	You (HPA + node scaler)
Kubernetes version upgrades	n/a	n/a	n/a	You (control + data plane)
Networking (ENI, SG)	You	AWS (+ VPC opt-in)	You (awsvpc)	You (CNI)

Core concepts

Five mental models make every later choice obvious.

Compute is a control-versus-convenience dial, not a ladder. EC2, ECS, EKS and Lambda are not “beginner to advanced” — they’re points on a single axis: how much of the stack do you operate yourself. At one end, EC2 hands you a bare virtual machine and you own everything above the hypervisor (OS, patches, runtime, scaling). At the other, Lambda hands you a function signature and AWS owns everything else. ECS and EKS sit in the middle as container orchestrators — they schedule your containers onto compute and keep the desired count running — with Fargate sliding the same orchestrators toward the Lambda end by removing the servers. You don’t climb this; you pick the point where the control you need meets the convenience you want.

The billing unit encodes the right workload shape. EC2 bills per instance-second (60-second minimum) regardless of utilisation — so it’s cheapest for work that keeps the instance busy, and wasteful for idle. Lambda bills per GB-second of execution plus per request — you pay only while code runs, so it’s cheapest for spiky, intermittent work and ruinous for something that runs flat-out 24×7. Fargate bills per second of provisioned vCPU and GB — between the two. EKS adds a fixed $0.10/hour per cluster for the managed control plane on top of whatever data plane you choose. Match the traffic shape to the meter: steady → instances; spiky/event → functions; in-between containerised → tasks.

A container needs an orchestrator, and the orchestrator needs a data plane. A container image is inert; something must place it on a host, restart it when it dies, scale the count, and wire it to networking. That “something” is the orchestrator — ECS (AWS’s own, simpler) or EKS (managed Kubernetes, richer/portable). Each orchestrator runs your containers on a data plane: either EC2 nodes you own (you patch and scale them, but get bin-packing and EC2/Spot pricing) or Fargate (AWS owns the host; you pay per task/pod, no nodes to manage). “ECS vs EKS” is the control-plane choice; “EC2 vs Fargate” is the data-plane choice — they’re orthogonal, and you make both.

Limits are the design, not the fine print. Lambda’s limits define what it can’t do: a request that takes 16 minutes will never finish (15-minute hard ceiling), a job needing 12 GB RAM can’t run (10 GB max), a sync response over 6 MB will be rejected. These aren’t tunables — they’re walls. EC2 has no such walls (you can run a 24-hour job on a 768 GB instance) but trades that freedom for ownership of the whole machine. Knowing each service’s hard limits up front turns “we’ll figure it out” into “this workload is disqualified from Lambda, so it’s a task or an instance.”

Cold start is the tax on scaling from zero. Anything that can scale to zero — Lambda, Fargate, App Runner, Karpenter-provisioned nodes — pays a startup cost the first time a request lands with no warm capacity: the runtime initialises, the image is pulled, the connection pool primes. Lambda cold starts are tens to hundreds of milliseconds (seconds for large packages or VPC-attached functions historically); a new Fargate task takes tens of seconds to pull and start; a new EC2 node under an autoscaler takes a minute or more. An always-on EC2 instance never pays this — that’s part of what its idle cost buys. Whether cold start matters is a property of your latency budget, and it’s often the knob that decides a borderline case.

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary at the end repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to the choice
Instance	A virtual machine you rent	EC2	The control end of the dial
Instance family	A class tuned for a resource profile	EC2 (e.g. `m`, `c`, `r`, `g`)	Picks CPU:RAM:GPU ratio + price
AMI	The disk image an instance boots	EC2	You own its patch level
Auto Scaling Group	Keeps N instances running, scales them	EC2	You write the scaling policy
Function	A handler AWS runs per event	Lambda	The convenience end of the dial
Cold start	First-request latency on fresh capacity	Lambda/Fargate/nodes	Tax on scale-from-zero
Concurrency	Simultaneous executions/tasks	Lambda/ECS	The Lambda scaling + limit unit
Task	A running container group (ECS)	ECS	The ECS unit of work
Task definition	The blueprint for a task	ECS	CPU/mem/role/ports declared here
Service	Keeps N tasks running (ECS)	ECS	ECS’s “desired count” + autoscaling
Pod	The smallest deployable unit (K8s)	EKS	The EKS unit of work
Node / node group	The EC2 host(s) running pods	EKS/ECS-EC2	The data plane you may own
Fargate	Serverless data plane for ECS/EKS	ECS/EKS	Removes nodes; per-task billing
Control plane	The orchestrator’s brain (API/scheduler)	ECS (free) / EKS ($0.10/hr)	EKS’s fixed cost + upgrade burden
Execution / task role	The IAM identity the compute assumes	All	Least-privilege access to AWS APIs

The decision table — pick from the symptom

When you’re staring at a workload and not sure where it goes, match its dominant characteristic in the left column and read across. This is the whole article compressed into one lookup:

If the workload is…	It’s probably best on…	Because…
Event-driven, runs seconds–minutes, spiky	Lambda	Scales to zero; pay per invocation; no servers
A long process or anything > 15 minutes	ECS / EC2	Past Lambda’s hard timeout wall
A stateless container, lean team, AWS-only	ECS on Fargate	Simplest orchestration, no nodes
Containers at high steady density / GPU	ECS on EC2	Bin-pack + Reserved/Spot pricing wins
Already on Kubernetes / needs the K8s API	EKS	Full API + ecosystem + portability
Needs a specific OS, kernel module, BYOL licence	EC2	Only a real machine gives that control
Sustained 24×7 CPU at scale, cost-sensitive	EC2 (Reserved/Spot)	Lowest per-compute cost when always busy
A simple website with predictable flat cost	Lightsail / App Runner	Bundled, minimal decisions
Large fan-out batch / HPC	Batch (on EC2/Fargate)	Managed queueing + provisioning

EC2 — the full-control option, option by option

EC2 gives you a virtual machine and gets out of the way. That freedom is the point and the cost: you choose the instance family (the CPU:memory:accelerator ratio), the size within it, the AMI (and thus the patch level you now own), the purchase model (which swings the price by up to ~90%), and the scaling (you write the Auto Scaling Group policy). Reach for EC2 when you need something the managed options can’t give: a specific OS or kernel module, a GPU, BYOL Windows/Oracle licensing, persistent local NVMe, or simply the lowest per-compute cost for a workload that runs flat-out 24×7.

Instance families — read the naming scheme

EC2 instance type names encode everything: m5.large = family m (general purpose), generation 5, size large. Suffixes refine it: g = Graviton (ARM, cheaper per vCPU), n = enhanced networking, d = local NVMe, a = AMD. Pick the family by the workload’s bottleneck (CPU-bound → c, RAM-bound → r/x, GPU → g/p), then the size by how much you need.

Family	Class	vCPU:RAM ratio	Built for	Example type	Typical workload
`t` (t3/t4g)	Burstable general	1:2–1:4, CPU credits	Spiky low-average CPU	`t3.medium` (2 vCPU/4 GB)	Dev boxes, low-traffic web, microservices
`m` (m6i/m7g)	General purpose	1:4	Balanced steady load	`m7g.large` (2/8)	App servers, mid-tier, small DBs
`c` (c7g/c6i)	Compute optimised	1:2	CPU-bound	`c7g.xlarge` (4/8)	Batch, encoding, game servers, HPC
`r` (r6i/r7g)	Memory optimised	1:8	RAM-bound	`r7g.2xlarge` (8/64)	In-memory caches, big DBs, analytics
`x` (x2)	High memory	1:16+	Huge RAM	`x2idn.16xlarge` (64/1024)	SAP HANA, large in-memory stores
`i` (i4i)	Storage optimised	NVMe-heavy	High local IOPS	`i4i.2xlarge` (8/64 + NVMe)	NoSQL, search, warm caches
`g` (g5/g6)	GPU (inference/graphics)	+ NVIDIA GPU	ML inference, rendering	`g5.xlarge`	Inference, transcode, VDI
`p` (p4/p5)	GPU (training)	+ high-end GPU	ML training, HPC	`p5.48xlarge`	LLM training, simulation
`inf`/`trn`	AWS silicon	Inferentia/Trainium	Cost-optimised ML	`inf2.xlarge`	High-volume inference

The t family deserves a warning: it runs on CPU credits. You bank credits while below a baseline and spend them to burst above it; run out and you’re throttled to the baseline (e.g. ~20% of a vCPU for t3.medium) — unless you enable T3 Unlimited, which lets you burst beyond credits for a surcharge. A t3 instance that’s busy 24×7 is the wrong choice (it’ll throttle or surcharge); move to m.

Purchase models — the 90% price lever

How you buy the same instance changes the price more than which instance you pick. On-Demand is the flexible default; commit to usage and you save up to ~72%; bid on spare capacity (Spot) and save up to ~90% with the risk of a 2-minute eviction notice.

Purchase model	Discount vs On-Demand	Commitment	Interruptible?	Best for
On-Demand	0% (baseline)	None	No	Spiky/unpredictable, short-lived, dev
Reserved Instance (Standard)	up to ~72%	1 or 3 yr, instance family	No	Steady 24×7, known instance type
Reserved Instance (Convertible)	up to ~54%	1 or 3 yr, swappable	No	Steady but family may change
Compute Savings Plan	up to ~66%	1 or 3 yr, $/hr commit	No	Steady spend, flexible across family/region/Fargate/Lambda
EC2 Instance Savings Plan	up to ~72%	1 or 3 yr, family+region	No	Steady within a family
Spot	up to ~90%	None	Yes (2-min notice)	Fault-tolerant batch, stateless, CI
Dedicated Instance	premium	None	No	Compliance: no shared hardware
Dedicated Host	premium	optional	No	BYOL socket/core licensing
Capacity Reservation	On-Demand rate + reserved	None (reserve capacity)	No	Guaranteed capacity in an AZ

Launch a basic instance and attach an instance profile:

aws ec2 run-instances \
  --image-id ami-0abcd1234efgh5678 \
  --instance-type m7g.large \
  --iam-instance-profile Name=app-instance-profile \
  --security-group-ids sg-0a1b2c3d \
  --subnet-id subnet-0e1f2a3b \
  --tag-specifications 'ResourceType=instance,Tags=[{Key=Name,Value=app-prod}]'

resource "aws_instance" "app" {
  ami                    = data.aws_ami.al2023.id
  instance_type          = "m7g.large"   # Graviton: ~20% cheaper per vCPU
  iam_instance_profile   = aws_iam_instance_profile.app.name
  vpc_security_group_ids = [aws_security_group.app.id]
  subnet_id              = aws_subnet.private_a.id
  tags                   = { Name = "app-prod" }
}

Auto Scaling — you own the policy

A single instance is a single point of failure. Production EC2 runs in an Auto Scaling Group (ASG) spanning ≥2 AZs, fronted by a load balancer, scaling on a target metric. You write the policy — that’s the cost of control. The scaling-policy choices:

Scaling policy	How it decides	When to use	Gotcha
Target tracking	Holds a metric at a target (e.g. CPU 50%)	Most web/app tiers	Pick the right metric (CPU vs ALB requests/target)
Step scaling	Add/remove N by alarm thresholds	Fine-grained custom response	More tuning; can oscillate
Simple scaling	One adjustment per alarm + cooldown	Legacy; avoid	Cooldown blocks reacting to fast spikes
Scheduled	Scale at known times	Predictable daily/weekly peaks	Doesn’t react to surprises
Predictive	ML forecasts and pre-scales	Recurring cyclical load	Needs history; pairs with dynamic

Storage choices — durable vs ephemeral

An instance’s disk is a decision, not a given. Get it wrong and you either lose data on a reboot or pay for IOPS you don’t need.

Storage	Persists across stop/terminate?	Performance	Cost model	Use for
EBS gp3	Yes (network block)	Baseline 3,000 IOPS, tunable	Per GB-month + provisioned IOPS/throughput	General root + data volumes
EBS io2 Block Express	Yes	Up to 256k IOPS, sub-ms	Premium per GB + IOPS	Latency-critical databases
EBS st1 / sc1	Yes	Throughput-optimised HDD	Cheapest per GB	Big sequential / cold data
Instance store (NVMe)	No — wiped on stop	Highest local IOPS, no network hop	Included in instance price	Scratch, caches, shardable temp
EFS	Yes (shared NFS)	Scales with usage	Per GB + throughput mode	Shared filesystem across instances
FSx	Yes	Lustre/Windows/ONTAP/OpenZFS	Per GB + throughput	HPC, Windows shares, specialised FS

The trap: instance store looks free and fast, and it is — until someone reboots the instance and the data is gone. Only put reconstructable data on it.

EC2 limits and gotchas that bite

Limit / gotcha	Reality	Why it matters
Per-region vCPU quota	Soft limit, raise via Service Quotas	Large fleets/GPU need a quota increase first
ENI / IP per instance	Bounded by instance size	Pod/IP density on ECS-EC2 capped by ENI count
Instance store is ephemeral	Local NVMe wiped on stop/terminate	Never put durable data on instance store
Stopping ≠ free	EBS volumes still bill when stopped	“Stopped to save money” still costs storage
Patching is yours	No automatic OS patches	Unpatched AMIs = your CVE exposure
Right-sizing drift	Instances outlive their workload	Quarterly review or pay for idle vCPUs

Lambda — serverless functions, and the walls you design around

Lambda runs your code in response to an event with no server to manage: you upload a handler, pick a runtime and a memory size, and AWS executes it on demand, scaling from zero to thousands of concurrent executions automatically. You pay only while code runs (GB-second) plus per request. It is the right answer for event-driven work (an S3 upload, a queue message, an API request, a schedule) that is short (seconds to a few minutes) and spiky — and the wrong answer for anything long-running, steady-state, or needing a fixed environment, because Lambda’s limits are hard walls, not knobs.

The hard limits — memorise these

These define what Lambda cannot do. There is no setting to exceed them; a workload that needs more is disqualified.

Limit	Default / max	Tunable?	What hitting it looks like
Timeout	3 s default, 900 s (15 min) max	Up to 900 s	`Task timed out after 900.00 seconds`; partial work
Memory	128 MB default, 10,240 MB max	In 1 MB steps	OOM kill; `Runtime exited`
vCPU	Scales with memory (~1 vCPU/1,769 MB)	Indirect (via memory)	CPU-bound work slow until you raise memory
Ephemeral `/tmp`	512 MB default, 10,240 MB max	Yes	`No space left on device`
Sync payload	6 MB request + response	No	`RequestEntityTooLarge`
Async payload	256 KB	No	Event rejected
Deployment package	50 MB zipped (direct), 250 MB unzipped	No (use container image: 10 GB)	Upload rejected
Container image	10 GB	No	Image too large
Default concurrency	1,000 per region	Raise via quota	`TooManyRequestsException` (429), throttling
Layers	5 per function, 250 MB unzipped total	No	Can’t add another layer
Env var size	4 KB total	No	Config truncation

Runtimes — managed or your own

Lambda gives you a managed runtime or lets you bring a container image / custom runtime.

Runtime option	Examples	When to choose	Note
Managed runtime	Node.js, Python, Java, .NET, Ruby, Go	Standard languages, fastest start	AWS patches the runtime
Container image	Any, up to 10 GB	Large deps, custom binaries, parity with local	Bigger cold start; built like Docker
Custom runtime	Anything via Runtime API	Unsupported languages	You own the bootstrap
Graviton (`arm64`)	Node/Python/etc. on ARM	~20% cheaper, often faster	Recompile native deps

Concurrency and cold starts — the two performance knobs

Lambda scales by running more concurrent executions. Two controls shape its behaviour under load and its latency on a cold path.

Control	What it does	When to set	Cost impact
Reserved concurrency	Caps (and guarantees) a function’s share of the account pool	Protect a downstream DB from too many connections; isolate a noisy function	Free; reduces pool for others
Provisioned concurrency	Keeps N execution environments warm	Latency-critical paths that can’t pay cold start	Charged per provisioned GB-hour
SnapStart (Java/.NET)	Snapshots an initialised runtime, restores fast	Java/.NET cold-start pain	Lower cold start, some caveats
Account concurrency limit	1,000 default, raisable	Bursty workloads exceeding 1,000	Quota request

What actually drives cold-start latency, and how to cut it:

Cold-start factor	Magnitude	Reduce it by
Runtime init	tens of ms (Node/Python) to seconds (JVM cold)	SnapStart (Java/.NET); lighter runtime
Package/image size	larger = slower pull/init	Trim deps; smaller image; layers
VPC ENI attach	historically seconds (now much faster)	Use only if you need VPC resources
Handler init code	your code outside the handler	Lazy-init; cache clients across invocations
Provisioned concurrency	eliminates cold start for N	Pre-warm latency-critical functions

Invocation models — how the event reaches the function

How Lambda is invoked changes retry behaviour, error handling and the payload limit you’re bound by. Three models:

Invocation model	Triggers	Retry on error	Payload limit	Note
Synchronous	API Gateway, ALB, direct `Invoke`	Caller handles it	6 MB	Caller waits for the response
Asynchronous	S3, SNS, EventBridge	2 automatic retries → DLQ	256 KB	AWS queues and retries for you
Poll-based (event source mapping)	SQS, Kinesis, DynamoDB Streams, Kafka (MSK)	Per-batch, configurable	Batch-bounded	Lambda polls and batches records

The retry column matters: an async failure silently retries twice and then drops to a dead-letter queue (if you configured one) — no DLQ means lost events. A poll-based SQS failure returns the batch to the queue, so a poison message can loop forever without a redrive policy.

Create a function and wire a trigger:

aws lambda create-function \
  --function-name thumbnailer \
  --runtime python3.12 --architectures arm64 \
  --handler app.handler --timeout 60 --memory-size 1024 \
  --role arn:aws:iam::111122223333:role/thumbnailer-exec \
  --zip-file fileb://function.zip

# Let S3 invoke it on object-created (event-driven, the canonical Lambda shape)
aws lambda add-permission --function-name thumbnailer \
  --statement-id s3invoke --action lambda:InvokeFunction \
  --principal s3.amazonaws.com --source-arn arn:aws:s3:::uploads-bucket

resource "aws_lambda_function" "thumbnailer" {
  function_name = "thumbnailer"
  runtime       = "python3.12"
  architectures = ["arm64"]   # Graviton: cheaper GB-second
  handler       = "app.handler"
  timeout       = 60          # seconds; hard ceiling is 900
  memory_size   = 1024        # MB; also scales vCPU
  role          = aws_iam_role.thumbnailer_exec.arn
  filename      = "function.zip"
}

The cost trap: Lambda is not always cheaper

Lambda’s “pay only when it runs” is a gift for spiky work and a trap for steady work. A function pinned at high concurrency 24×7 can cost far more than the equivalent always-on instance. The break-even reasoning:

Workload pattern	Lambda economics	Better on
Spiky / intermittent (idle most of the time)	Pay near-zero when idle — wins big	Lambda
Event-driven (per upload / message)	Pay per event; scales to zero	Lambda
Steady moderate (constant low-mid traffic)	GB-seconds add up	Borderline — model it
Sustained high (flat-out 24×7)	Paying full GB-second every second	EC2/Fargate (reserved)
Long-running (>15 min per unit)	Impossible (timeout wall)	ECS task / EC2

ECS — AWS-native containers, two launch types

ECS is AWS’s own container orchestrator: you describe a task (one or more containers, their CPU/memory, ports, IAM role and health check in a task definition), and a service keeps a desired number of those tasks running, replacing failures and integrating with a load balancer. It’s simpler than Kubernetes — fewer concepts, no control-plane upgrades, deep AWS integration — and the control plane is free; you pay only for the data plane. The big fork is the launch type: run tasks on EC2 nodes you own (bin-packing, Spot, GPU, lowest cost at scale) or on Fargate (no nodes; pay per task-second).

EC2 vs Fargate launch type — the data-plane decision

Dimension	ECS on EC2	ECS on Fargate
Who manages the host	You (AMI, patching, scaling the ASG)	AWS (no host access)
Billing	Per EC2 instance-hour	Per task: vCPU-second + GB-second
Bin-packing	Yes — pack many tasks per instance	No — each task is isolated, sized exactly
Spot support	Yes (Spot instances)	Yes (Fargate Spot)
GPU / special hardware	Yes	Limited
Right-sizing granularity	Per instance	Per task (fine-grained)
Idle cost	Pay for the whole node even if under-packed	Pay only for running tasks
Best for	High, steady density; cost-optimised at scale	Variable load; minimal ops; small/spiky services
Cold start	Node already warm; task start fast	Task pull+start (tens of seconds)

Rule of thumb: start on Fargate (no node fleet to operate), move specific high-density or GPU workloads to EC2-backed capacity when bin-packing or reserved/Spot pricing demonstrably wins.

Task definition — the knobs that matter

The task definition is the blueprint. Get these right or the task won’t place, won’t reach its dependencies, or won’t be replaced when it crashes.

Setting	What it controls	Values / default	Gotcha
`cpu` / `memory` (task-level)	Total task size	Fargate: fixed valid combos (256 CPU/0.5–2 GB … up to 16 vCPU/120 GB)	Fargate rejects invalid CPU:mem pairs
`networkMode`	How containers get networking	`awsvpc` (own ENI), `bridge`, `host`, `none`	Fargate requires `awsvpc`
`executionRoleArn`	Role to pull image / write logs	IAM role	Missing → `CannotPullContainerError`
`taskRoleArn`	Role the app assumes for AWS APIs	IAM role	Don’t overscope; this is your app’s identity
`containerDefinitions[].healthCheck`	Per-container liveness	command + interval/timeout/retries	No health check → crash-loops ship traffic
`essential`	Whether a container’s death kills the task	true/false	Sidecars usually `essential:false`
`logConfiguration`	Where stdout/stderr goes	`awslogs` → CloudWatch	Forgetting it = no logs to debug
`portMappings`	Ports exposed	container/host port	With `awsvpc`, host port = container port
`secrets`	Inject from SSM/Secrets Manager	valueFrom ARN	Don’t bake secrets into the image

aws ecs register-task-definition \
  --family api --network-mode awsvpc \
  --requires-compatibilities FARGATE --cpu 512 --memory 1024 \
  --execution-role-arn arn:aws:iam::111122223333:role/ecsTaskExecutionRole \
  --task-role-arn arn:aws:iam::111122223333:role/api-task-role \
  --container-definitions '[{
    "name":"api","image":"111122223333.dkr.ecr.ap-south-1.amazonaws.com/api:1.4.2",
    "portMappings":[{"containerPort":8080}],
    "healthCheck":{"command":["CMD-SHELL","curl -f http://localhost:8080/healthz || exit 1"],
      "interval":30,"timeout":5,"retries":3},
    "logConfiguration":{"logDriver":"awslogs","options":{
      "awslogs-group":"/ecs/api","awslogs-region":"ap-south-1","awslogs-stream-prefix":"api"}}
  }]'

resource "aws_ecs_service" "api" {
  name            = "api"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.api.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = aws_subnet.private[*].id
    security_groups  = [aws_security_group.api.id]
    assign_public_ip = false
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.api.arn
    container_name   = "api"
    container_port   = 8080
  }
}

Networking modes — and why Fargate forces `awsvpc`

The networkMode decides how containers get IPs and how they’re reachable. Fargate only supports one of them, which removes the choice but also the foot-guns.

`networkMode`	How it works	Pros	Cons	Available on
`awsvpc`	Each task gets its own ENI + private IP	Per-task SG, clean isolation, ALB IP targets	Consumes ENIs/IPs (subnet sizing matters)	EC2 and Fargate (required)
`bridge`	Docker bridge, port mapping on the host	Many tasks per host, fewer IPs	Shared host SG, dynamic ports	EC2 only
`host`	Container shares the host network stack	Lowest overhead	No two tasks on the same port	EC2 only
`none`	No external networking	Maximum isolation	Task can’t reach the network	EC2 only

awsvpc is the modern default — a per-task ENI gives each task its own security group and lets the ALB target task IPs directly. The cost is IP consumption: a service running 200 tasks needs 200 IPs in its subnets, so size the CIDR accordingly or you’ll hit the PROVISIONING-stuck failure above.

Service autoscaling and capacity providers

ECS scales the service (task count) via Application Auto Scaling, and on EC2 you also scale the node fleet via a capacity provider. The two layers:

Scaling layer	Mechanism	Target	Note
Service (task count)	Target tracking / step	ECS service desired count	Scale on CPU, memory, or ALB requests/target
Capacity provider (EC2)	Managed scaling of the ASG	Cluster instance count	Keeps headroom for new tasks
Fargate	None to manage	—	AWS provisions per task automatically

ECS error reference — what blocks a task

Error / state	Meaning	Likely cause	Fix
`CannotPullContainerError`	Image pull failed	Bad image tag, no ECR perms on execution role, no route to ECR	Fix tag; grant `AmazonECSTaskExecutionRolePolicy`; VPC endpoint/NAT
`ResourceInitializationError`	Couldn’t init networking/secrets	No route to fetch secrets/logs	NAT or VPC endpoints for SSM/Secrets/ECR/logs
Task stuck `PROVISIONING`	Can’t get an ENI	Subnet IP exhaustion, SG/subnet misconfig	Free IPs; check subnet/SG
Task `PENDING` forever (EC2)	No capacity to place	Cluster full, wrong instance attributes	Scale ASG; check CPU/mem reservation
Service flapping (tasks cycle)	Health check failing	Bad health path, slow start	Fix `/healthz`; raise start period
`OutOfMemory` (137)	Container exceeded memory	Under-sized task memory	Raise `memory`; fix leak
`essential container exited`	A required container died	App crash on boot	Read CloudWatch logs; fix startup

EKS — managed Kubernetes when you actually need it

EKS is managed Kubernetes: AWS runs the control plane (the API server, etcd, scheduler — across 3 AZs, patched and highly available) for a flat $0.10/hour per cluster, and you run the data plane (the nodes or Fargate that host your pods). You get the full Kubernetes API, the CNCF ecosystem (Helm, operators, service mesh, the whole tooling universe) and cluster portability across clouds. You also get Kubernetes’ operational surface: version upgrades on both planes, add-on management (CNI, CoreDNS, kube-proxy, CSI drivers), IRSA/Pod Identity for AWS access, and a genuinely steeper learning curve. Choose EKS when you already run Kubernetes, need its API/ecosystem, or require multi-cloud portability — not because it’s fashionable.

ECS vs EKS — the honest comparison

This is the decision that costs teams the most when they get it wrong. EKS is more powerful and more work; ECS is simpler and AWS-only.

Dimension	ECS	EKS
Orchestrator	AWS proprietary	Kubernetes (CNCF)
Control-plane cost	Free	$0.10/hr (~$73/mo) per cluster
Learning curve	Gentle (few concepts)	Steep (pods, deployments, services, RBAC, CRDs…)
Operational surface	Small	Large (upgrades, add-ons, CNI, CVEs)
Ecosystem	AWS-native integrations	Vast (Helm, operators, mesh, ArgoCD)
Portability	Locked to AWS	Portable across K8s anywhere
Networking	Simple (`awsvpc`)	CNI (IP-per-pod, more powerful, more complex)
Best fit	Few-to-many services, AWS-committed, lean team	K8s shops, complex platforms, portability needs
Version upgrades	None (AWS-managed)	You drive control + data plane

EKS data-plane options — how you run pods

EKS pods run on one of three data planes (often mixed in one cluster):

Data plane	What it is	You manage	Best for	Trade-off
Managed node groups	EC2 nodes EKS provisions/rotates for you	Instance type, size, upgrades (assisted)	General workloads, GPU, Spot	You still own AMI/upgrade cadence
Self-managed nodes	Your own ASG of nodes	Everything	Maximum customisation	Most toil
Fargate	Serverless pods (one pod per micro-VM)	Just manifests + Fargate profiles	Spiky/isolated workloads, no node ops	No DaemonSets, per-pod overhead, limited sizes
Karpenter	Just-in-time node provisioning (controller)	Provisioner config	Fast, cost-optimised, right-sized scaling	A controller to operate

EKS scaling — two independent loops

Kubernetes scales pods and nodes separately; you wire both.

Scaler	Scales	Reacts to	Note
HPA (Horizontal Pod Autoscaler)	Pod replica count	CPU/memory/custom metrics	Needs metrics-server
VPA (Vertical Pod Autoscaler)	Pod CPU/mem requests	Usage over time	Restarts pods to resize
Cluster Autoscaler	Node count (node groups)	Unschedulable pods	Per-node-group; slower
Karpenter	Nodes (any shape)	Unschedulable pods	Picks instance types live; faster, cheaper

Create a cluster and a managed node group (CLI shown via eksctl-style and Terraform):

# Control plane (AWS manages it across 3 AZs)
aws eks create-cluster --name prod \
  --role-arn arn:aws:iam::111122223333:role/eksClusterRole \
  --resources-vpc-config subnetIds=subnet-a,subnet-b,subnet-c \
  --kubernetes-version 1.30

# A managed node group for the data plane
aws eks create-nodegroup --cluster-name prod --nodegroup-name general \
  --node-role arn:aws:iam::111122223333:role/eksNodeRole \
  --subnets subnet-a subnet-b subnet-c \
  --instance-types m7g.large --scaling-config minSize=2,maxSize=10,desiredSize=3

resource "aws_eks_cluster" "prod" {
  name     = "prod"
  role_arn = aws_iam_role.eks_cluster.arn
  version  = "1.30"
  vpc_config { subnet_ids = aws_subnet.private[*].id }
}

resource "aws_eks_node_group" "general" {
  cluster_name    = aws_eks_cluster.prod.name
  node_group_name = "general"
  node_role_arn   = aws_iam_role.eks_node.arn
  subnet_ids      = aws_subnet.private[*].id
  instance_types  = ["m7g.large"]   # Graviton nodes
  scaling_config { min_size = 2, max_size = 10, desired_size = 3 }
}

EKS failure reference — the classics

Symptom	Meaning	Likely cause	Fix
Pod `Pending`	No node can schedule it	No capacity / taints / resource requests too big	Scale nodes (CA/Karpenter); check requests, taints, AZ
Pod `ImagePullBackOff`	Can’t pull the image	Bad tag, no ECR auth, no route	Fix tag; node role ECR perms; NAT/endpoint
Pod `CrashLoopBackOff`	Container keeps dying	App crash, bad config, failing probe	`kubectl logs`; fix startup/probe
`0/3 nodes available`	Scheduler can’t place	Taints, insufficient resources, AZ mismatch	Tolerations; bigger nodes; spread AZs
Service has no endpoints	LB can’t reach pods	Selector mismatch, failing readiness	Fix label selector; readiness probe
`AccessDenied` from pod	Pod can’t call AWS API	Missing IRSA / Pod Identity	Bind a service account to an IAM role
Node `NotReady`	kubelet unhealthy	CNI/disk/network issue	Check node, CNI add-on, disk pressure

Fargate — the serverless data plane

Fargate isn’t a separate orchestrator — it’s a serverless data plane for ECS and EKS. Instead of running and patching EC2 nodes, you ask for a task or pod and AWS provisions an isolated micro-VM sized exactly to your request, billed per-second of vCPU and memory. You never see the host. It removes the entire node-management burden (no AMIs, no patching, no cluster autoscaler for nodes, no SSH) at a per-compute premium over equivalently-utilised EC2. It earns that premium when your load is variable, your team is lean, or your density is low; it loses to EC2 when you can pack a node tightly and buy it reserved or Spot.

Question	If yes → Fargate	If yes → EC2-backed
Is load variable/spiky?	✓ (pay per task)
Is the team lean on ops?	✓ (no nodes)
Can you pack a node ≥70%?		✓ (bin-pack wins)
Need GPU / special hardware?		✓
Want Reserved/Spot EC2 pricing at scale?		✓ (lowest cost)
Need DaemonSets (EKS)?		✓ (Fargate has none)
Want minimum time-to-first-deploy?	✓

Valid Fargate task sizes are fixed combinations — you can’t ask for arbitrary CPU:memory:

Task vCPU	Valid memory range
0.25 vCPU	0.5, 1, 2 GB
0.5 vCPU	1–4 GB (1 GB steps)
1 vCPU	2–8 GB (1 GB steps)
2 vCPU	4–16 GB (1 GB steps)
4 vCPU	8–30 GB (1 GB steps)
8 vCPU	16–60 GB (4 GB steps)
16 vCPU	32–120 GB (8 GB steps)

Architecture at a glance

The diagram below traces a single product across all the compute homes it might legitimately use — not “pick one” but “place each workload on the service whose billing and operational model fits its lifecycle.” Follow it left to right as the request and event path. Traffic enters through the edge and routing zone (CloudFront for static/cache, an ALB or API Gateway as the front door). The synchronous, request-shaped work lands in the request compute zone: a containerised API on ECS Fargate (no nodes to run) for the steady microservice, and a thin Lambda behind API Gateway for the spiky, event-shaped endpoints. Heavier or special work sits in the specialised compute zone: an EC2 Auto Scaling fleet for the GPU/Windows/licensed workload that needs a real machine, and an EKS cluster for the platform team that already lives in Kubernetes. Asynchronous work flows through the event & async zone — an SQS queue and EventBridge decoupling producers from a fleet of Lambda consumers and Fargate workers — and everything emits logs and metrics to the observability zone (CloudWatch), which is where every failure below is confirmed.

The numbered badges mark the five places a compute choice most often goes wrong: a Lambda hitting its 15-minute wall, an ECS task that can’t pull its image, an EKS pod stuck Pending with no node, an EC2 fleet bleeding money while idle, and a Fargate task rejected for an invalid CPU:memory pair. The legend narrates each as symptom · how to confirm · fix. Read the picture as the map: arrival path across the top, the compute menu in the middle tiers, and the diagnostic pins on the exact hop where each mistake bites.

Real-world scenario

FreightLink, a logistics SaaS in Pune, ran everything on EC2 — eighteen m5.large and c5.xlarge instances across two AZs, hand-rolled with Ansible, patched on a monthly maintenance window that nobody enjoyed. Their AWS bill was ₹6.8 lakh/month and climbing, and a quarterly review found the embarrassing truth: average fleet CPU was 14%. They were paying for a 24×7 fleet sized for a peak that lasted ninety minutes a day, and the on-call rotation spent most of its energy on OS patching and capacity guesswork rather than the product.

The platform lead ran the workloads through exactly the model in this article — match the lifecycle to the meter — and re-homed them one class at a time. The shipment-label generator, a CPU job that ran for ~40 seconds whenever a label was requested (a few thousand times a day, in bursts), was the worst fit for an always-on instance: it moved to Lambda at 1,024 MB with arm64, triggered by SQS. It now costs about ₹4,000/month and scales to zero between bursts — down from a dedicated c5.xlarge running flat-out 24×7. The customer-facing API and tracking services, six stateless microservices with steady mid-day traffic, moved to ECS on Fargate behind an ALB, sized per service (0.5–1 vCPU each) with service autoscaling on ALB requests-per-target. Deploys that used to mean an Ansible run and a held breath became a terraform apply that rolled tasks with zero downtime, and the team stopped SSHing into anything.

Two workloads stayed close to the metal — correctly. The route-optimisation engine used a GPU and a licensed solver, so it stayed on EC2 (g5.xlarge, now on a 1-year Compute Savings Plan because its load was steady). And the data-science platform the analytics team had already built on Kubernetes stayed on EKS — re-platforming it to ECS would have thrown away their Helm charts and operators for no benefit; instead they moved its node group to Graviton and Spot for the batch pools. The one stumble: the first Lambda cut over with the default 3-second timeout inherited from a copy-pasted template, and large multi-page labels intermittently failed with Task timed out after 3.00 seconds. Confirmed in CloudWatch Logs in two minutes, fixed by raising the timeout to 60 s and the memory (which also raised vCPU, halving the runtime). Six weeks later the bill was ₹3.9 lakh/month — a 43% cut — fleet CPU on the remaining EC2 was a healthy 55%, and the monthly patching window was gone for everything except the two instance-based workloads that genuinely needed it. The lesson FreightLink internalised: EC2 wasn’t wrong, defaulting to it was.

Advantages and disadvantages

Each service is a bundle of trade-offs; the table makes them explicit, then the prose says when each side matters.

Service	Advantages	Disadvantages
EC2	Total control; any OS/kernel/GPU; lowest cost at sustained scale (Reserved/Spot); no cold start when always-on	You patch & scale everything; idle cost; operational toil; over-provisioning risk
Lambda	No servers; scales to zero and to thousands; pay-per-use; fastest path for event glue	Hard 15-min/10 GB walls; cold starts; cost trap at sustained load; harder local parity
ECS	Simpler than K8s; free control plane; deep AWS integration; Fargate or EC2	AWS-only (less portable); smaller ecosystem than K8s
EKS	Full Kubernetes API; huge ecosystem; portable; battle-tested at scale	$0.10/hr/cluster; steep curve; you own upgrades/add-ons/CVEs
Fargate	No nodes to manage; per-task billing; right-size each task; fast to ship	Premium over packed EC2; fixed size combos; no DaemonSets/GPU

Control matters when the workload has a hard requirement the managed options can’t satisfy — a kernel module, a GPU, BYOL licensing, persistent local NVMe, or sustained 24×7 load where Reserved EC2 is simply the cheapest place to run. There, EC2’s “disadvantages” are the price of admission and worth paying. Convenience matters when the workload is ordinary and your scarcest resource is engineering time: a stateless API, a queue consumer, an event handler. There, Lambda or Fargate’s premium buys back the weeks you’d otherwise spend patching and scaling — almost always a good trade for a lean team. Portability (EKS) matters when you genuinely run multi-cloud or have deep Kubernetes investment; it’s a real cost you should only pay for a real need, not a hypothetical one. The failure mode in every direction is the same: choosing the bundle for its headline feature while ignoring the column of costs that comes attached.

Hands-on lab

A free-tier-friendly walk-through that deploys the same trivial container as a Fargate task and the same logic as a Lambda, so you feel the two models side by side. Region ap-south-1. Tear everything down at the end.

1. Set up variables and a log group.

export AWS_REGION=ap-south-1 ACCT=$(aws sts get-caller-identity --query Account --output text)
aws logs create-log-group --log-group-name /lab/compute || true

2. Deploy a Lambda (the event/glue model). A tiny function that returns a greeting — stands in for “event-shaped work.”

cat > index.py <<'PY'
def handler(event, context):
    return {"statusCode": 200, "body": "hello from Lambda"}
PY
zip function.zip index.py

# Minimal execution role (trust + basic logging) — created once
aws iam create-role --role-name lab-lambda-exec \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"lambda.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name lab-lambda-exec \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
sleep 10  # let the role propagate

aws lambda create-function --function-name lab-hello \
  --runtime python3.12 --architectures arm64 --handler index.handler \
  --timeout 10 --memory-size 128 \
  --role arn:aws:iam::${ACCT}:role/lab-lambda-exec \
  --zip-file fileb://function.zip

3. Invoke it and watch it scale from zero.

aws lambda invoke --function-name lab-hello out.json && cat out.json
# Expected: {"statusCode": 200, "body": "hello from Lambda"}

4. Deploy the same idea as a Fargate task (the container model). Use a public amazonlinux image that prints and exits — stands in for “container-shaped work.”

aws ecs create-cluster --cluster-name lab-cluster

aws ecs register-task-definition --family lab-hello \
  --requires-compatibilities FARGATE --network-mode awsvpc \
  --cpu 256 --memory 512 \
  --execution-role-arn arn:aws:iam::${ACCT}:role/ecsTaskExecutionRole \
  --container-definitions '[{"name":"hello","image":"public.ecr.aws/amazonlinux/amazonlinux:2023",
    "command":["/bin/sh","-c","echo hello from Fargate"],"essential":true,
    "logConfiguration":{"logDriver":"awslogs","options":{
      "awslogs-group":"/lab/compute","awslogs-region":"ap-south-1","awslogs-stream-prefix":"hello"}}}]'

# Run one task in a public subnet (replace subnet/SG with yours)
aws ecs run-task --cluster lab-cluster --launch-type FARGATE \
  --task-definition lab-hello \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-xxxx],securityGroups=[sg-xxxx],assignPublicIp=ENABLED}'

5. Confirm the Fargate task ran by reading its log stream in CloudWatch (/lab/compute), where you’ll see hello from Fargate. Notice the difference you just felt: the Lambda returned in milliseconds with nothing to provision; the Fargate task took tens of seconds to pull and start — that’s the cold-start tax of the container path, and the per-task isolation you pay for.

6. Teardown — leave nothing billing.

aws lambda delete-function --function-name lab-hello
aws ecs delete-cluster --cluster lab-cluster
aws logs delete-log-group --log-group-name /lab/compute
aws iam detach-role-policy --role-name lab-lambda-exec \
  --policy-arn arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
aws iam delete-role --role-name lab-lambda-exec

Common mistakes & troubleshooting

The real failure modes, each as symptom → root cause → how to confirm → fix. These are the ones that actually page teams.

#	Symptom	Root cause	Confirm with	Fix
1	`Task timed out after 900.00 seconds` (or 3.00)	Work exceeds Lambda’s timeout (or default 3 s left in place)	CloudWatch Logs for the function	Raise timeout (≤900 s); if it needs >15 min, move to ECS/EC2
2	Lambda `429 TooManyRequestsException`	Hit the 1,000 concurrency limit	Lambda `Throttles` metric	Raise account concurrency quota; add reserved concurrency; smooth with SQS
3	Lambda OOM / `Runtime exited`	Function exceeded its memory	Logs show `Runtime exited`; memory near max	Raise `memory-size` (also raises vCPU)
4	ECS task `CannotPullContainerError`	Execution role lacks ECR perms, or no route to ECR	ECS task `stoppedReason`	Attach `AmazonECSTaskExecutionRolePolicy`; add NAT or ECR/S3 VPC endpoints
5	ECS service flaps; tasks cycle	Health check fails (bad path / slow start)	Service events; target group health	Fix `/healthz`; raise health-check grace/start period
6	Fargate `register-task-definition` rejected	Invalid CPU:memory combination	API error message	Use a valid pair (e.g. 256 CPU → 0.5/1/2 GB)
7	EKS pod stuck `Pending`	No node has room / taints / requests too big	`kubectl describe pod` events	Scale nodes (Karpenter/CA); lower requests; fix taints/AZ
8	EKS pod `ImagePullBackOff`	Bad tag, node role lacks ECR, no route	`kubectl describe pod`	Fix tag; node role ECR perms; NAT/endpoint
9	EKS pod `AccessDenied` calling AWS	No IRSA / Pod Identity binding	App logs; `kubectl describe sa`	Bind service account to an IAM role (IRSA)
10	EC2 fleet bill high, CPU ~10%	Over-provisioned / idle always-on fleet	Cost Explorer + CloudWatch CPU	Right-size; move spiky work to Lambda; Savings Plan the rest
11	`t3` instance mysteriously slow	CPU credits exhausted, throttled to baseline	`CPUCreditBalance` near zero	Enable T3 Unlimited or move to `m` family
12	EKS cluster cost surprise	Forgotten `$0.10/hr` per non-prod cluster	Billing per cluster	Consolidate clusters; namespaces over clusters where safe
13	Spot task/instance killed mid-job	Spot interruption (2-min notice)	Interruption notices / events	Make work idempotent/checkpointed; use capacity-optimized; mix On-Demand
14	Lambda cold starts hurt p99	Scale-from-zero + heavy init (or JVM)	Duration init metric / X-Ray	Provisioned concurrency; SnapStart (Java/.NET); trim package

The meta-mistake behind half of these is choosing the service by familiarity and then fighting its model: forcing long work into Lambda (1, 3), running spiky work on always-on EC2 (10), or adopting EKS without budgeting for its control-plane cost and upgrade toil (12). Re-home the workload and the symptom disappears.

Best practices

Choose by lifecycle, not by habit. Event-driven & short → Lambda; containerised with no K8s need → ECS; already on Kubernetes → EKS; OS/GPU/licensing/sustained → EC2. Re-derive it per workload.
Default the data plane to Fargate, move to EC2-backed only when bin-packing or Reserved/Spot pricing demonstrably wins. Don’t run a node fleet you don’t need.
Prefer Graviton (arm64) for Lambda, Fargate and EC2 wherever your dependencies support it — ~20% cheaper and often faster. Recompile native deps and test.
Buy commitment for steady spend. Compute Savings Plans cover EC2, Fargate and Lambda; cover your steady baseline and leave the spiky top On-Demand/Spot.
Make fault-tolerant work run on Spot (batch, CI, stateless workers) with checkpointing/idempotency and a capacity-optimized allocation strategy.
Always set health checks (ECS container health + ALB target health; EKS readiness/liveness probes). No health check means you ship crash-looping tasks to users.
Set Lambda timeouts and memory deliberately — never ship the default 3 s; raise memory to raise vCPU for CPU-bound work (it can be cheaper by finishing faster).
Cap blast radius with reserved concurrency on functions that hit a fragile downstream (a small RDS), so one function can’t exhaust connections.
Keep one cluster, many namespaces on EKS where isolation allows — each extra cluster is another $0.10/hr plus another upgrade to run.
Right-size on a schedule. Review fleet/utilisation quarterly; instances outlive their workloads and idle vCPUs are pure waste.
Push everything to CloudWatch (and X-Ray for traces). The first move in every failure above is reading a log; make sure there is one.

Security notes

Least-privilege roles per workload. Every compute service assumes an identity: EC2 → instance profile, Lambda → execution role, ECS → task role, EKS → IRSA/Pod Identity. Scope each to exactly the APIs that workload calls — never a wildcard *. The task role (your app’s identity) is distinct from the execution role (pull image / write logs); don’t conflate them.
Run compute in private subnets. API/worker tasks, functions touching VPC resources, and nodes belong in private subnets with egress via NAT or, better, VPC endpoints (for ECR, S3, Secrets Manager, CloudWatch) so traffic to AWS APIs never leaves the AWS backbone — which also fixes the most common CannotPullContainerError / ResourceInitializationError.
Inject secrets, never bake them. Pull from Secrets Manager or SSM Parameter Store at runtime (ECS secrets, Lambda env from Secrets, EKS External Secrets) — never put credentials in an AMI, image layer or environment file in source control.
Patch what you own. AWS patches the Lambda runtime and the Fargate host; you patch EC2 AMIs and EKS/ECS-EC2 nodes. Automate AMI rebuilds and node rotation, and track EKS add-on and Kubernetes-version CVEs — an unpatched node is your exposure, not AWS’s.
Encrypt at rest and in transit. EBS volumes and ephemeral storage encrypted with KMS; TLS on every front door (ALB/API Gateway); mTLS inside the mesh where the threat model warrants it.
Lock down the EKS API. Make the control-plane endpoint private (or tightly IP-restricted), use RBAC and IAM together, and disable anonymous access; the Kubernetes API is a high-value target.
Isolate Spot/preemptible blast radius. Don’t run security-sensitive, hard-to-checkpoint work on Spot where a 2-minute eviction could interrupt a partial sensitive operation.

Cost & sizing

The bill is driven by what you pay per unit times how much you run idle. The levers that move it most, ordered by impact:

Lever	Typical saving	Applies to	How
Stop paying for idle	Up to ~85% on the workload	EC2 → Lambda/Fargate	Move spiky/intermittent work off always-on instances
Savings Plans / Reserved	up to ~72%	EC2, Fargate, Lambda	Commit 1–3 yr to your steady baseline
Spot	up to ~90%	EC2, Fargate, EKS nodes	Fault-tolerant/stateless workloads
Graviton (`arm64`)	~20%	EC2, Lambda, Fargate	Recompile + test on ARM
Right-size	10–40%	All	Match instance/task/memory to real usage
Lambda memory tuning	varies (can cut cost)	Lambda	More memory → more vCPU → shorter run
Consolidate EKS clusters	$73/mo each	EKS	Namespaces over clusters where safe

Rough figures (ap-south-1 / Mumbai, On-Demand, indicative — always check the calculator):

Compute unit	Indicative price	Notes
`t3.medium` (2 vCPU/4 GB)	~$0.0448/hr (~₹2,700/mo if 24×7)	Burstable; throttles if always busy
`m7g.large` (2 vCPU/8 GB)	~$0.0856/hr (~₹5,200/mo)	Graviton general purpose
`c7g.xlarge` (4 vCPU/8 GB)	~$0.145/hr	Compute-optimised Graviton
Lambda	~$0.20 / 1M requests + ~$0.0000166667/GB-s	`arm64` ~20% less; free tier 1M req + 400k GB-s/mo
Fargate	~$0.04048/vCPU-hr + ~$0.004445/GB-hr	Per-second; Fargate Spot ~70% off
EKS control plane	$0.10/hr (~$73/mo) per cluster	Fixed, on top of the data plane
App Runner	per-second active + provisioned floor	Managed HTTP container

Free-tier anchors worth knowing: EC2 750 hours/month of t2.micro/t3.micro for 12 months; Lambda a perpetual 1M requests + 400,000 GB-seconds/month; EKS has no free tier — the $0.10/hr starts immediately, which is exactly why idle non-prod clusters quietly add up. Size by starting small and scaling on a real metric: for EC2/ECS, pick the smallest type that holds your p95 with headroom and let autoscaling handle peaks; for Lambda, set memory by profiling (the cheapest run is often not the smallest memory, because higher memory finishes faster).

Interview & exam questions

1. When would you choose Lambda over ECS Fargate for an HTTP API? When traffic is spiky or low-average and each request is short (well under 15 minutes), so scaling to zero between bursts saves money and you want zero infrastructure to operate. Choose Fargate when the service is steady, needs a long-lived process, has heavy/large dependencies, or you want consistent low latency without cold-start management. (AWS SAA-C03.)

2. What are Lambda’s hard limits, and which is most often hit first? 15-minute max timeout, 10 GB max memory, 10 GB ephemeral /tmp, 6 MB synchronous payload, 250 MB unzipped package (10 GB as a container image), and 1,000 default concurrency per region. In practice the timeout and concurrency limits bite first — long jobs silently fail at 15 minutes, and bursty workloads throttle at 1,000.

3. ECS or EKS for a five-service startup with a three-person platform team? ECS. EKS adds a $0.10/hr-per-cluster cost and, more importantly, the full Kubernetes operational surface — version upgrades, add-on CVEs, CNI/IRSA debugging — which a three-person team servicing five services can’t justify. Choose EKS only if they already have deep Kubernetes investment or a hard portability requirement.

4. Explain the difference between an ECS execution role and a task role. The execution role is used by the ECS agent/Fargate to pull the container image from ECR and write logs to CloudWatch. The task role is the identity your application code assumes to call AWS APIs (S3, DynamoDB, etc.). They’re separate so you can least-privilege the app independently of the platform’s pull/log permissions.

5. How do EC2 purchase models change cost, and when is Spot appropriate? On-Demand is the flexible baseline; Reserved Instances and Savings Plans discount up to ~72% for a 1–3 year commitment to steady usage; Spot discounts up to ~90% for spare capacity that can be reclaimed with a 2-minute notice. Spot suits fault-tolerant, stateless, checkpointed work (batch, CI, stateless workers) — never a stateful single instance that can’t tolerate eviction.

6. What is a cold start and which services pay it? The latency to initialise fresh capacity when a request arrives with nothing warm — runtime init, image pull, connection priming. Lambda (from zero or beyond warm concurrency), Fargate tasks (pull + start), and autoscaler-provisioned EC2/EKS nodes all pay it; an always-on EC2 instance does not. Mitigate with provisioned concurrency, SnapStart, smaller packages, or keeping warm capacity.

7. Fargate vs EC2 launch type for ECS — what decides it? Fargate removes node management and bills per task — best for variable load and lean teams. EC2-backed wins when you can bin-pack a node tightly (≥~70% utilisation), need GPU/special hardware, or want Reserved/Spot EC2 pricing at scale. They’re the data-plane choice; the orchestrator (ECS) is unchanged either way.

8. Why might raising a Lambda’s memory reduce its cost? Memory and vCPU scale together — more memory means more CPU, so a CPU-bound function finishes faster. Because you’re billed per GB-second, a function that runs in half the time at double the memory can cost the same or less, while being far faster. Profile across memory sizes to find the cost/latency sweet spot.

9. A pod is stuck Pending on EKS. Walk through diagnosis. kubectl describe pod and read the events: common causes are no node with enough free CPU/memory (resource requests too high or cluster at capacity), taints with no matching toleration, or an AZ/volume affinity mismatch. Fix by scaling the data plane (Cluster Autoscaler/Karpenter), lowering requests, adding tolerations, or correcting AZ placement.

10. What does the EKS control-plane charge buy you, and how do you avoid wasting it? $0.10/hr per cluster pays for a managed, 3-AZ, highly-available, AWS-patched Kubernetes control plane (API server + etcd). Avoid waste by consolidating environments into fewer clusters (namespaces and RBAC for isolation) instead of spinning up a cluster per team/env that then sits mostly idle.

11. Which compute service for a 4-hour nightly ETL batch job? Not Lambda (15-minute wall). A Batch-managed or scheduled ECS task (Fargate for simplicity, EC2/Spot for cost) running the container to completion, or an EC2 Spot fleet if it’s large and checkpointable. The job’s >15-minute duration disqualifies functions outright.

12. How do Savings Plans differ from Reserved Instances? Reserved Instances commit to a specific instance family (Standard) or a swappable set (Convertible) in a region. Compute Savings Plans commit to a dollars-per-hour spend and apply flexibly across instance family, size, region, OS and across EC2, Fargate and Lambda — more flexible, slightly lower max discount than an exact-match Standard RI.

Quick check

A job runs for 18 minutes per execution. Can it run on Lambda? Why or why not?
You have six stateless microservices, a lean team, and no Kubernetes experience. ECS or EKS — and on what data plane?
Name two EC2 purchase models that discount steady 24×7 usage and one that suits fault-tolerant batch.
What’s the difference between an ECS execution role and a task role?
Your EC2 fleet’s average CPU is 12% and the bill is high. What’s the likely problem and the first fix?

Answers

No. Lambda’s hard maximum timeout is 15 minutes (900 s); an 18-minute job will be killed with Task timed out. Run it as an ECS task (Fargate or EC2) or on EC2/Batch where there’s no duration wall.
ECS on Fargate. ECS avoids the Kubernetes operational surface and the $0.10/hr-per-cluster cost a lean, non-K8s team can’t justify; Fargate removes node management so they ship without running a fleet.
Steady: Reserved Instances and Savings Plans (up to ~72%). Fault-tolerant batch: Spot (up to ~90%, with 2-minute interruption).
The execution role lets the platform pull the image from ECR and write logs; the task role is the identity your application assumes to call AWS APIs. Keep them separate and least-privilege each.
The fleet is over-provisioned / idle — paying for always-on capacity it doesn’t use. First fix: re-home spiky/intermittent work to Lambda or Fargate (scale to zero), right-size what remains, and cover the steady baseline with a Savings Plan.

Glossary

Instance — a virtual machine rented from EC2; the maximum-control compute unit.
Instance family — an EC2 class (t, m, c, r, g, …) tuned to a CPU:memory:accelerator ratio.
AMI (Amazon Machine Image) — the disk image an EC2 instance boots from; you own its patch level.
Auto Scaling Group (ASG) — keeps a target number of EC2 instances running and scales them on a policy.
CPU credits — the burst budget for t-family instances; exhaust them and you throttle to a baseline (unless Unlimited).
Lambda — function-as-a-service; runs your handler per event, scaling automatically, billed per GB-second + request.
Cold start — the latency to initialise fresh capacity (runtime/image/connection) when nothing is warm.
Concurrency — simultaneous executions (Lambda) or tasks (ECS); Lambda’s default account limit is 1,000.
Provisioned concurrency — pre-warmed Lambda execution environments that eliminate cold start for N invocations.
Task (ECS) — a running group of one or more containers; the ECS unit of work.
Task definition — the blueprint declaring a task’s containers, CPU/memory, roles, ports and health check.
Service (ECS) — keeps a desired number of tasks running and integrates with a load balancer.
Launch type — whether ECS tasks run on EC2 nodes you own or on Fargate (serverless).
Pod (EKS) — the smallest deployable Kubernetes unit; one or more containers scheduled together.
Node group (EKS) — a managed set of EC2 nodes that host pods; the data plane you may own.
Control plane — the orchestrator’s API/scheduler; free on ECS, $0.10/hr per cluster on EKS.
Fargate — the serverless data plane for ECS and EKS; per-second vCPU/GB billing, no nodes to manage.
Execution role — the IAM identity the platform uses to pull images and write logs.
Task role / IRSA — the IAM identity the application assumes to call AWS APIs (IRSA = IAM Roles for Service Accounts on EKS).
Savings Plan — a commitment to $/hour spend that discounts EC2, Fargate and Lambda flexibly.
Spot — spare-capacity pricing up to ~90% off, reclaimable with a 2-minute notice.
Graviton (arm64) — AWS’s ARM processors; ~20% cheaper per vCPU, often faster, across EC2/Lambda/Fargate.

Next steps

Go one level deeper on containers with ECS vs EKS vs Fargate: choosing your container path.
Master the serverless integration shapes in AWS Lambda event-driven patterns.
Put a front door on your compute with ALB vs NLB vs API Gateway compared.
Land it in the right network and blast radius via VPC, subnets and security groups and Regions and Availability Zones explained.
Connect it to state with RDS vs DynamoDB vs Aurora compared and the identity it assumes via AWS Organizations and IAM foundations.