AWS ECS vs EKS vs Fargate: Choose Your Container Path

Quick take: ECS is the easy AWS-native path. EKS is Kubernetes when you genuinely need it. Fargate removes nodes from both. The hard decision is not ECS vs EKS — it is whether you actually need Kubernetes, and separately, whether you want to own the servers.

A SaaS company adopted Amazon EKS because it was “the industry standard.” Six months later, three platform engineers spent their weeks managing node groups, the VPC CNI, an ingress controller, the cluster autoscaler and a sprawl of Helm charts — all to run a handful of stateless HTTP services that did nothing Kubernetes-specific. They migrated the web tier to ECS on Fargate and cut platform toil in half. The data platform, which leaned on the Spark Operator and custom controllers, stayed on EKS because Kubernetes was genuinely earning its keep there. That is the whole article in one anecdote: AWS gives you two orchestrators (ECS, EKS) crossed with two launch types (Fargate, EC2), and the cost of choosing wrong is measured in engineer-years, not dollars.

This is the decision guide I wish that team had read first. We treat the choice as two orthogonal axes, not one menu. Axis one — orchestrator — is ECS (AWS’s own scheduler, no control-plane fee, deep IAM/CloudWatch integration) versus EKS (conformant Kubernetes, portable, ecosystem-rich, but you operate add-ons and upgrades and pay $0.10/hr per cluster). Axis two — launch type — is Fargate (serverless: no nodes to patch, scale or right-size, billed per vCPU-second) versus EC2 (you own the instances: cheaper at steady state, Spot/Graviton/GPU available, daemonsets and privileged mode possible). Four corners: ECS+Fargate, ECS+EC2, EKS+Fargate, EKS+EC2 (and EKS+Karpenter, the modern node-provisioning answer). Each corner has a different operating model, a different bill, and a different set of 2 a.m. failure modes.

By the end you will stop choosing by brand recognition. You will know that an awsvpc task needs an ALB target group of target-type ip or it will never pass health checks; that a task stuck in PROVISIONING in a private subnet almost always means missing ECR/S3/logs VPC endpoints; that CannotPullContainerError is an execution-role problem, not a task-role one; that EKS Fargate quietly forbids DaemonSets and hostNetwork; and that the cheapest steady-state path is usually EC2 Spot on Graviton with Karpenter, while the cheapest operationally is Fargate. Because this is a reference you will return to mid-decision and mid-incident, the trade-offs, the limits, the task-definition fields and the failure modes are all laid out as scannable tables — read the prose once, then keep the tables open when the architecture review (or the pager) starts.

What problem this solves

Containers need an orchestrator: something to place them on hosts, restart them when they die, roll out new versions, wire them to load balancers, and scale them with demand. AWS does not give you one answer — it gives you a 2×2, and the marketing pages make all four corners sound equally good. They are not. The wrong corner is expensive in the way that hurts most: not a surprise invoice (though that too), but a permanent tax on every deploy, every patch cycle, every on-call rotation.

What breaks without a deliberate choice: a five-person startup stands up EKS “to be cloud-native,” then discovers that keeping the cluster alive — Kubernetes minor-version upgrades every ~14 months before support ends, VPC CNI / CoreDNS / kube-proxy add-on bumps, IP-exhaustion from the CNI’s per-pod ENI model, ingress-controller CVEs, Helm-chart drift — is now a full-time job that produces zero customer value. Conversely, a platform team standardizes on ECS for simplicity, then spends a year reinventing Helm-style templating, operators and CRDs in CloudFormation because they actually did need Kubernetes’ extensibility. Both teams chose on the wrong axis. The orchestrator axis is about extensibility and portability; the launch-type axis is about who owns the servers. Conflating them is the root mistake.

Who hits this: essentially every team that has outgrown a single EC2 box or a Lambda and wants to run long-lived containers. It bites hardest on teams that (a) adopt Kubernetes for resume-driven reasons, (b) run Fargate at high steady-state utilization and overpay versus EC2, © deploy into private subnets without the VPC endpoints awsvpc networking requires, or (d) confuse the execution role with the task role and then can’t pull an image or read a secret. The fix is almost never “switch orchestrators in a panic” — it’s “decide the two axes on their actual merits, then implement the networking and IAM correctly.”

To frame the whole field before the deep dive, here is the 2×2 with the one question each corner forces and the single fact that most often makes the decision:

Corner	One-line identity	Question it forces	Deciding fact	Best when
ECS + Fargate	AWS-native, no nodes	“Do I really need k8s? No.”	Lowest total ops; per-vCPU premium	Stateless services, batch, side-projects, small teams
ECS + EC2	AWS-native, own nodes	“Need GPU/Spot/custom AMI on ECS?”	Cheaper steady-state; you patch AMIs	Cost-sensitive steady load, GPU, daemons on ECS
EKS + Fargate	k8s API, no nodes	“Want k8s but hate node ops?”	k8s API minus DaemonSets/GPU/hostNet	Portable manifests, low-ops k8s, per-pod isolation
EKS + EC2 (Karpenter)	Full k8s, own nodes	“Operators/CRDs + Spot/GPU?”	Max power & cost control; max toil	Spark/ML, service mesh, multi-cloud, big platforms

Learning objectives

By the end of this article you can:

Separate the orchestrator decision (ECS vs EKS) from the launch-type decision (Fargate vs EC2) and reason about each on its own axis instead of as a single menu choice.
State precisely when Kubernetes earns its complexity — existing Helm/CRDs/operators, portability/multi-cloud, advanced scheduling — and when it is pure overhead you should avoid.
Choose between Fargate and EC2 using utilization, GPU/daemon needs, Spot tolerance and the per-vCPU cost premium, and back it with real numbers.
Author a task definition (ECS) and a Pod/Deployment (EKS) and explain every field that bites: CPU/memory pairs, networkMode awsvpc, execution role vs task role, log driver, health check.
Wire containers to an ALB correctly, including why awsvpc/Fargate tasks require target-type ip and how the health check must target the container port.
Diagnose the canonical container failures — PROVISIONING hangs, CannotPullContainerError, ResourceInitializationError, OOM (exit 137), IP exhaustion, ALB 503s — to a specific root cause with the exact command to confirm it.
Right-size and cost-model all four corners (Fargate vCPU-seconds, EC2/Spot/Graviton, the EKS control-plane fee, Savings Plans) in rough INR/USD and pick the cheapest appropriate path.
Map each choice to the relevant certifications (SAP-C02, DVA-C02, DOP-C02) and the Kubernetes (CKA) mindset.

Prerequisites & where this fits

You should already be comfortable with the AWS container fundamentals: a container image lives in a registry (Amazon ECR or another OCI registry); a task (ECS) or Pod (Kubernetes) is one or more containers scheduled together; a service keeps N copies running and registers them with a load balancer. You should know how to run the AWS CLI and read JSON output, what a VPC, subnet, security group and route table are, and that IAM roles grant AWS permissions. Basic Kubernetes literacy (Deployment, Service, namespace) helps for the EKS sections but is not required to follow the decision logic.

This sits in the Compute → Containers track and is the decision upstream of all the hands-on container work. It assumes the compute landscape from AWS Compute: EC2, Lambda, ECS and EKS — Which One to Choose? (that article picks the category; this one picks within containers). It depends on the networking from AWS VPC, Subnets and Security Groups Explained — awsvpc task networking, VPC endpoints and SGs are where most container outages actually live — and on the load-balancer choice from AWS ALB vs NLB vs API Gateway Compared, because the ALB target-type detail below is the single most common ECS wiring bug. Identity grounding comes from AWS Organizations & IAM Foundations.

A quick map of who owns what during a container incident, so you page the right person fast:

Layer	What lives here	Who usually owns it	Failure classes it can cause
Client / DNS / TLS	Name resolution, certs, retries	Frontend / SRE	5xx only if misrouted; mostly red herrings
ALB / target group	Listener, health check, target-type	Network / platform	503 (no healthy targets), 504 (slow app)
Orchestrator (ECS/EKS)	Scheduling, desired count, rollout	Platform team	Tasks not placed, stuck rollout, throttling
Launch type (Fargate/EC2)	Capacity, ENI attach, node health	Platform / AWS	PROVISIONING hang, node pressure, IP exhaustion
Image / ECR	Pull auth, image tag, size	App + platform	`CannotPullContainerError`, slow cold start
Task / Pod (your code)	Process, port bind, memory	App / dev team	Crash loop, OOM (137), wrong port
IAM (exec + task role)	Pull/log/secrets vs app APIs	App + security	AccessDenied, secret resolve fail

Core concepts

Six mental models make every later decision and diagnosis obvious.

The choice is two axes, not one. Orchestrator (ECS vs EKS) decides the API and ecosystem you program against and operate. Launch type (Fargate vs EC2) decides who owns the servers. They are independent: you can run ECS on Fargate or EC2, and EKS on Fargate or EC2 (or both at once). Decide them separately. The orchestrator question is “do I need Kubernetes’ extensibility and portability?” The launch-type question is “do I want to patch, scale and right-size servers, in exchange for lower cost and more control?”

ECS is the AWS-native, no-cluster-fee orchestrator. Amazon ECS (Elastic Container Service) schedules tasks (defined by a task definition — a versioned JSON describing containers, CPU/memory, networking, roles, logging). A service maintains a desired count and integrates natively with ALB/NLB, CloudWatch, IAM, App Mesh and Service Connect. There is no charge for the ECS control plane — you pay only for the compute (Fargate or EC2). ECS concepts map cleanly onto AWS primitives, so there is little to learn beyond AWS itself. The trade: it is AWS-only and less extensible than Kubernetes.

EKS is conformant Kubernetes, with a control-plane fee and add-on operations. Amazon EKS (Elastic Kubernetes Service) runs an upstream-conformant Kubernetes control plane that AWS manages (highly available across AZs) for $0.10 per cluster-hour (~$73/month). You get the entire Kubernetes API: Deployments, CRDs, operators, Helm, the Horizontal/Vertical Pod Autoscaler, network policies, and portability across clouds. The cost is operational: you own the add-on lifecycle (VPC CNI, CoreDNS, kube-proxy), cluster version upgrades (a new minor roughly every ~4 months; ~14 months of standard support each), the ingress/load-balancer controller, IP planning for the CNI, and the broader Kubernetes blast radius. Power and portability in exchange for toil.

Fargate is serverless containers — no nodes, billed per vCPU-second. AWS Fargate runs your task/pod on AWS-managed capacity. You specify CPU and memory; AWS finds the host, attaches an ENI (awsvpc), pulls the image and runs the container. No EC2 to patch, scale, secure or right-size. You pay per vCPU-second and GB-second while the task runs (per-second, 1-minute minimum). The trade-offs: a per-vCPU premium over EC2 at steady state (~20–50% depending on Region/commitment), no DaemonSets/privileged/GPU, fixed CPU↔memory ratios, slower cold starts than a warm EC2 node, and ephemeral storage capped (20 GiB default, up to 200 GiB configurable).

EC2 launch type means you own the nodes — cheaper and more flexible, but yours to operate. With the EC2 launch type, tasks/pods run on EC2 instances you provision (an ECS capacity provider / Auto Scaling Group, or on EKS a managed node group or Karpenter). You choose instance families (Graviton/arm64 for ~20–40% better price-performance, GPU for ML, memory-optimized for caches), use Spot for up to ~90% savings on interruptible work, bring custom AMIs, run DaemonSets/privileged containers, and bin-pack many tasks per instance. The cost: you patch AMIs, manage scaling and capacity, and carry the security of the host OS.

awsvpc networking gives each task its own ENI — and its own failure modes. On Fargate (always) and increasingly on EC2, ECS/EKS use the awsvpc network mode: each task/pod gets its own Elastic Network Interface with a VPC IP, its own security group, and first-class VPC routing. This is clean (per-task SGs, no port conflicts) but introduces three classics: IP exhaustion (each task burns a subnet IP; the EKS CNI burns several), the ALB target-type ip requirement (the LB targets the task’s ENI IP, not a host), and VPC-endpoint dependence in private subnets (pulling from ECR and writing to CloudWatch need a route to AWS — a NAT Gateway or interface/gateway endpoints).

The vocabulary in one table

Before the deep sections, pin down every moving part. The glossary repeats these for lookup; this table is the mental model side by side:

Concept	One-line definition	Where it lives	Why it matters to the choice
Orchestrator	ECS or EKS — the scheduler/API	Account / Region	Axis 1: extensibility & portability
Launch type	Fargate or EC2 — who owns hosts	Per service/profile	Axis 2: cost & control vs ops
Task definition	Versioned JSON: containers, CPU/mem, roles	ECS	The unit you deploy on ECS
Service	Keeps N tasks/pods running + LB-wired	ECS / EKS	Steady-state app; rollout target
Pod / Deployment	k8s scheduling unit / replica controller	EKS	The unit you deploy on EKS
Execution role	Pull image, write logs, read secrets	ECS task def	Wrong → `CannotPullContainerError`
Task role / IRSA	The app’s own AWS permissions	Task / Pod	Wrong → app `AccessDenied`
Capacity provider	Maps a service to Fargate/EC2 capacity	ECS	How EC2/Spot/Fargate mix is set
Managed node group	AWS-managed EC2 ASG for EKS	EKS	Node lifecycle without raw ASGs
Karpenter	Just-in-time node provisioner for EKS	EKS	Modern EC2 scaling; bin-packs Spot
VPC CNI	EKS plugin giving pods VPC IPs	EKS	IP exhaustion; prefix delegation
awsvpc ENI	Per-task/pod network interface + SG	Subnet	IP burn; ALB target-type ip
VPC endpoint	Private route to ECR/S3/logs/STS	VPC	Missing → PROVISIONING/pull fails

Axis 1 — ECS or EKS? Deciding whether you need Kubernetes

This is the consequential decision, and it is not about which is “better” — it is about whether your workload needs Kubernetes’ extensibility and portability enough to pay for operating it. Default to ECS. Reach for EKS only when you can name a concrete Kubernetes capability you depend on.

What ECS gives you (and what it doesn’t)

ECS is the path of least resistance on AWS. Everything is an AWS primitive you already understand; there is no second API to learn, no add-on fleet to keep current, and no control-plane bill.

Capability	ECS	Notes
Control-plane cost	$0 (free)	You pay only Fargate/EC2 compute
Learning curve	Low (AWS concepts only)	Task def ≈ “JSON of containers”
Native ALB/NLB integration	Yes (target group + service)	First-class, no extra controller
IAM per task	Yes (task role)	Clean least-privilege per workload
Service discovery	Cloud Map / Service Connect	DNS + L7 mesh-lite, no sidecar to run
Autoscaling	Service Auto Scaling (target tracking)	On CPU/mem/ALB request count
Secrets	Secrets Manager / SSM injection	Declared in task def
Observability	CloudWatch Logs/Container Insights	Native; OTel via ADOT sidecar
Custom controllers / operators	No	The big gap vs k8s
CRDs / extensible API	No	Can’t extend the API
Portability off AWS	No	AWS-only
Ecosystem (Helm/charts)	No	Use CloudFormation/CDK/Terraform

What EKS gives you (and what it costs)

EKS is Kubernetes — the full API, the ecosystem, the portability. The price is a control-plane fee plus a permanent operational surface.

Capability	EKS	Notes
Control-plane cost	$0.10/hr (~$73/mo) per cluster	Plus compute; consolidate clusters
Learning curve	High (Kubernetes + AWS)	YAML, controllers, RBAC, CNI
API extensibility (CRDs)	Yes	Operators, custom resources
Operators ecosystem	Yes	Spark, Flink, Strimzi, cert-manager…
Helm / chart ecosystem	Yes	Huge reuse for off-the-shelf software
Portability / multi-cloud	Yes (conformant)	Same manifests on GKE/AKS/on-prem
Advanced scheduling	Yes	Affinity, taints/tolerations, topology
Network policies	Yes (CNI/Calico)	Pod-level micro-segmentation
HPA + VPA + KEDA	Yes	Event-driven & vertical autoscaling
Add-on lifecycle (you own)	CNI, CoreDNS, kube-proxy	Version-bump on every cluster upgrade
Cluster upgrades (you own)	~every 14 months before EOL	In-place; test add-on compat
LB controller (you install)	AWS Load Balancer Controller	Provisions ALB/NLB from Ingress/Service
IP planning (you own)	VPC CNI per-pod ENI	Prefix delegation / custom networking

The decision table — does this workload need Kubernetes?

Run each “yes” signal against the list. One genuine yes can justify EKS; all no means ECS, full stop.

Signal	If YES → lean	Why
You already run Helm charts / operators / CRDs	EKS	Reusing the k8s ecosystem is the point
You need multi-cloud / on-prem portability	EKS	Conformant API runs the same elsewhere
You run Spark/Flink/ML on Kubernetes operators	EKS	Operators are the value (e.g. Spark Operator)
You need advanced scheduling (affinity, topology, gang)	EKS	ECS scheduling is comparatively basic
Your org has deep Kubernetes skills already	EKS	The toil is cheaper when you know k8s
You need a service mesh (Istio/Linkerd)	EKS	Mesh ecosystems are k8s-native
You just need to run stateless containers + ALB	ECS	k8s buys you nothing here
Team is small / no k8s expertise	ECS	Don’t pay the cluster tax for nothing
You want lowest operational surface	ECS	No add-ons, no upgrades, no CNI
Cost of the control plane matters at small scale	ECS	$0 vs $73/mo per cluster
You want resume-driven Kubernetes	ECS	Not a technical reason; resist

Operating-toil comparison (the part the bill doesn’t show)

The control-plane fee is the visible cost. The invisible one is recurring engineering time. This is where most “we should have used ECS” regret originates.

Recurring task	ECS	EKS	Notes
Patch the orchestrator	AWS (none for you)	AWS does control plane; you do add-ons	Add-on bumps every upgrade
Minor-version upgrades	None	Yes, ~yearly before EOL	Test CNI/CoreDNS/app compat
Networking plugin (CNI)	None (native)	You tune (prefix deleg., custom net)	IP exhaustion is an EKS-only class
Load-balancer wiring	Native service↔TG	Install/operate LB Controller	A Deployment you keep current
Ingress	ALB via service	Ingress + controller	More moving parts
RBAC / access	IAM only	IAM + Kubernetes RBAC + aws-auth/Access Entries	Two systems to keep in sync
Secrets	Native injection	CSI driver / External Secrets	Extra components
Disaster of a bad upgrade	Rare	Real risk (add-on/app breakage)	Blue/green clusters mitigate

Axis 2 — Fargate or EC2? Deciding who owns the servers

Independent of the orchestrator, decide whether you want to operate nodes. Fargate trades money for the elimination of node operations; EC2 trades operations for lower cost and more capability. Both work under ECS and EKS.

Fargate — the no-nodes model

Property	Fargate behaviour	Implication
Host management	None (AWS-managed)	No AMI patching, no node scaling
Billing	Per vCPU-second + GB-second	Pay only while the task runs
Sizing	Fixed CPU↔memory combinations	Can’t pick arbitrary ratios
Networking	Always awsvpc (own ENI)	Per-task SG; burns a subnet IP
GPU	Not supported	ML/GPU must use EC2
Privileged / hostNetwork / DaemonSet	Not supported	No node-level agents on Fargate
Ephemeral storage	20 GiB default (up to 200 GiB)	No persistent local disk
Spot equivalent	Fargate Spot (~70% off, interruptible)	Great for batch/dev
Cold start	Seconds (image pull + ENI attach)	Slower than a warm EC2 node
Per-vCPU cost vs EC2	~20–50% premium at steady state	The core trade-off

EC2 launch type — own the nodes

Property	EC2 behaviour	Implication
Host management	Yours (patch, scale, secure)	Operational cost
Billing	Per instance-hour (or Spot/RI/SP)	Cheaper at steady, high utilization
Sizing	Any instance family/size	Graviton, GPU, memory/compute-optimized
Bin-packing	Many tasks per instance	Higher density = lower unit cost
Spot	Up to ~90% off (interruptible)	Big savings on tolerant workloads
Graviton (arm64)	~20–40% better price/perf	Rebuild image multi-arch
GPU	Supported (g/p families)	Required for ML inference/training
DaemonSets / privileged	Supported	Node agents, log shippers, security tools
Custom AMI / kernel	Supported	Compliance, special drivers
Scaling mechanism	ASG / capacity provider / Karpenter	Karpenter = fast, bin-packing JIT nodes

Fargate-vs-EC2 decision table

If your workload…	Choose	Why
Is spiky / low-and-variable utilization	Fargate	Pay per second; no idle nodes to fund
Has a small team / wants min ops	Fargate	No node patching or scaling
Runs steady & high utilization 24×7	EC2	Bin-pack + RI/SP beats per-task pricing
Needs GPU (ML inference/training)	EC2	Fargate has no GPU
Needs DaemonSets / node agents / privileged	EC2	Fargate forbids them
Can tolerate interruptions (batch, CI, dev)	Fargate Spot / EC2 Spot	Up to 70–90% savings
Wants Graviton price-performance	EC2 (arm64) or Fargate arm64	Both support arm64; EC2 cheaper
Has bursty batch with no infra team	Fargate	Scales to zero between runs
Needs custom AMI / kernel modules	EC2	Fargate is a sealed runtime
Wants the cheapest possible steady compute	EC2 Spot + Graviton + Karpenter	Lowest unit cost, highest toil

The four corners, side by side

	Fargate	EC2
ECS	Lowest ops; AWS-native; no nodes; per-vCPU premium. Default for most services.	AWS-native + Spot/Graviton/GPU/daemons; you patch AMIs. Cost-optimized ECS.
EKS	k8s API, no nodes; no DaemonSet/GPU/hostNet; per-pod isolation; pod-level fee mechanics. Low-ops k8s.	Full k8s power: operators, GPU, Spot via Karpenter, daemonsets. Max toil. Spark/ML/mesh.

ECS deep dive — the task definition, every field that bites

On ECS you deploy task definitions. A task definition is immutable and versioned (family:revision); you register a new revision and update the service to it. The fields below are where real incidents originate.

Task-level settings

Field	What it sets	Choices / values	Default	Gotcha
`requiresCompatibilities`	Launch type compatibility	`FARGATE` / `EC2`	—	Fargate forces `awsvpc` + valid CPU/mem pair
`networkMode`	Task networking	`awsvpc` / `bridge` / `host` / `none`	`bridge` (EC2)	Fargate = awsvpc only; ALB needs target-type ip
`cpu` (task)	vCPU units (1024 = 1 vCPU)	256–16384 (Fargate set)	—	Fargate: only specific CPU↔mem pairs
`memory` (task)	MiB	Tied to CPU on Fargate	—	EC2: optional but recommended as a cap
`executionRoleArn`	Pull image, logs, secrets	An IAM role	—	Missing → `CannotPullContainerError`
`taskRoleArn`	App’s AWS permissions	An IAM role	—	The app’s calls (S3/DDB) use THIS, not exec role
`ephemeralStorage.sizeInGiB`	Scratch disk (Fargate)	21–200	20	Not persistent; gone on stop
`runtimePlatform`	OS/arch	`LINUX/X86_64`, `LINUX/ARM64`, Windows	x86_64	arm64 = Graviton savings; rebuild image
`pidMode` / `ipcMode`	Shared namespaces	`task`/`host`	none	`host` not allowed on Fargate

Fargate CPU↔memory valid combinations

Fargate does not let you pick arbitrary CPU/memory. Pick a row.

vCPU (`cpu`)	Memory options (`memory`)
0.25 (256)	0.5, 1, 2 GB
0.5 (512)	1, 2, 3, 4 GB
1 (1024)	2–8 GB (1 GB steps)
2 (2048)	4–16 GB (1 GB steps)
4 (4096)	8–30 GB (1 GB steps)
8 (8192)	16–60 GB (4 GB steps)
16 (16384)	32–120 GB (8 GB steps)

Container-level settings (inside `containerDefinitions`)

Field	What it sets	Notes / gotcha
`image`	ECR/OCI image URI	Pin a digest/tag, never `:latest` in prod
`portMappings.containerPort`	Port the app listens on	Must match the ALB health check + target group port
`essential`	If false, container dying doesn’t kill task	Sidecars often `essential:false`
`logConfiguration`	Log driver	`awslogs` (CloudWatch) or `awsfirelens` (FireLens→anywhere)
`healthCheck`	Container-level health (Docker)	Separate from the ALB health check
`secrets`	Inject from Secrets Manager/SSM	Needs execution role permission
`environment`	Plain env vars	Never put secrets here
`ulimits` / `linuxParameters`	nofile, capabilities	`add`/`drop` Linux capabilities here
`dependsOn`	Container start ordering	E.g. app waits for a proxy to be HEALTHY
`cpu` / `memoryReservation`	Per-container limits	Sum must fit the task-level sizing

Create an ECS Fargate service (CLI)

# 1) Register the task definition (JSON in file)
aws ecs register-task-definition --cli-input-json file://taskdef.json

# 2) Create a target group of TYPE IP (awsvpc requires this!)
aws elbv2 create-target-group --name web-tg \
  --protocol HTTP --port 8080 --vpc-id vpc-0abc \
  --target-type ip --health-check-path /healthz

# 3) Create the service on Fargate, wired to the ALB
aws ecs create-service --cluster prod --service-name web \
  --task-definition web:7 --desired-count 3 --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-1,subnet-2],securityGroups=[sg-web],assignPublicIp=DISABLED}' \
  --load-balancers 'targetGroupArn=arn:...:targetgroup/web-tg/...,containerName=web,containerPort=8080'

The same in Terraform

resource "aws_ecs_service" "web" {
  name            = "web"
  cluster         = aws_ecs_cluster.prod.id
  task_definition = aws_ecs_task_definition.web.arn
  desired_count   = 3
  launch_type     = "FARGATE"

  network_configuration {
    subnets          = var.private_subnets
    security_groups  = [aws_security_group.web.id]
    assign_public_ip = false
  }
  load_balancer {
    target_group_arn = aws_lb_target_group.web.arn # target_type = "ip"
    container_name   = "web"
    container_port   = 8080
  }
}

ECS capacity providers — how Fargate/EC2/Spot mix is set

Capacity provider	Backs	Use it for
`FARGATE`	On-demand Fargate	Baseline reliable capacity
`FARGATE_SPOT`	Interruptible Fargate (~70% off)	Batch, dev, fault-tolerant tiers
ASG capacity provider	Your EC2 Auto Scaling Group	EC2 launch type; managed scaling
Capacity-provider strategy	Weighted mix (e.g. 1 on-demand : 3 Spot)	Cost/reliability blend with a base count

EKS deep dive — clusters, node options, and the add-ons you own

On EKS you deploy standard Kubernetes objects (Deployment, Service, Ingress). The differences from a generic cluster are where the nodes come from, how pods get IPs, and which add-ons you keep current.

EKS compute options

Compute option	What it is	Pros	Cons
Managed node groups	AWS-managed EC2 ASG of workers	Simple lifecycle, AWS-patched AMIs, drain on update	Less flexible than Karpenter; coarse scaling
Self-managed nodes	Your own ASG/AMI	Full control (custom AMI/kernel)	You own everything, including upgrades
Karpenter	JIT node provisioner (controller)	Fast, bin-packs, picks cheapest fit, Spot-native	A controller you operate; newer mental model
EKS on Fargate	Serverless pods via Fargate profiles	No nodes; per-pod isolation	No DaemonSet/GPU/hostNetwork; profile selectors
EKS Auto Mode	AWS-managed compute+addons	Lowest ops EKS; AWS runs nodes/CNI/LB	Newer; less control; premium

The add-ons you must keep current (the toil, enumerated)

Add-on	Job	If you neglect it
VPC CNI (`aws-node`)	Gives pods VPC IPs	IP exhaustion; pods stuck `ContainerCreating`
CoreDNS	In-cluster DNS	Service discovery breaks
kube-proxy	Service VIP routing	Service traffic fails
AWS Load Balancer Controller	ALB/NLB from Ingress/Service	No external load balancing
Cluster Autoscaler / Karpenter	Node scaling	Pods `Pending`, no capacity
EBS/EFS CSI driver	Persistent volumes	PVCs won’t bind
Metrics Server	HPA input	HPA can’t scale
cert-manager / ExternalDNS (optional)	TLS / DNS automation	Manual cert/DNS toil

Fargate profiles (EKS) — and their hard limits

A Fargate profile declares which pods (by namespace + labels) run on Fargate instead of nodes. The limits below decide whether your workload even fits.

Limitation on EKS Fargate	Detail	Consequence
No DaemonSets	Can’t schedule one pod per node	Node-level agents (logging, security) won’t run; use sidecars
No GPU	No accelerator support	ML/GPU pods must use EC2 nodes
No `hostNetwork` / `hostPort`	Pod can’t share host net	Some CNIs/agents incompatible
No privileged containers	Sealed runtime	Security/observability tooling that needs it fails
One pod per “node”	Each pod = its own micro-VM	Higher isolation; different cost profile
Sidecar logging	No node agent → use FireLens/sidecar	Wire logs per pod
Profile selectors required	Pods must match a profile to land	Mismatched pods stay `Pending`

IRSA vs Pod Identity — granting AWS permissions to pods

Mechanism	How it works	When to use
IRSA (IAM Roles for Service Accounts)	OIDC trust → annotate a ServiceAccount with a role ARN	Mature, widely supported, fine-grained per-SA
EKS Pod Identity	Pod Identity Agent + association; no per-cluster OIDC trust setup	Newer, simpler at scale; fewer trust-policy edits

Minimal EKS on Fargate, then a Deployment

# Cluster + a Fargate profile for the "apps" namespace (eksctl)
eksctl create cluster --name prod --region ap-south-1 --fargate

# Install the AWS Load Balancer Controller (Helm) so Ingress provisions an ALB
helm repo add eks https://aws.github.io/eks-charts
helm install aws-lb-controller eks/aws-load-balancer-controller \
  -n kube-system --set clusterName=prod

apiVersion: apps/v1
kind: Deployment
metadata: { name: web, namespace: apps }
spec:
  replicas: 3
  selector: { matchLabels: { app: web } }
  template:
    metadata: { labels: { app: web } }
    spec:
      serviceAccountName: web-sa   # IRSA-annotated for the app's AWS perms
      containers:
        - name: web
          image: 1234.dkr.ecr.ap-south-1.amazonaws.com/web:1.4.2
          ports: [{ containerPort: 8080 }]
          resources:
            requests: { cpu: "250m", memory: "512Mi" }
            limits:   { cpu: "500m", memory: "1Gi" }

Networking — awsvpc, ALB target types, and the VPC endpoints you forget

This section is where the most outages live. awsvpc networking is clean but unforgiving, and the ALB/endpoint requirements are non-negotiable.

Network modes (ECS)

Mode	Each task gets	ALB target type	Use when
awsvpc	Own ENI + IP + SG	ip	Fargate (forced); EC2 when you want per-task SGs
bridge	Shared host net (Docker bridge)	instance	Legacy EC2; dynamic host ports
host	Host’s network namespace	instance	Max perf, no isolation; EC2 only
none	No external networking	n/a	Batch with no inbound

ALB target-type — the #1 ECS wiring bug

Target type	Registers	Required for	Symptom if wrong
ip	Task/pod ENI IP	awsvpc / Fargate	Targets never register / ALB 503 with bridge-style TG
instance	EC2 instance + host port	`bridge`/`host` EC2	Health checks fail for awsvpc tasks

If your service is Fargate or awsvpc and you created an instance target group, registration fails or the ALB has no healthy targets → clients get 503. Recreate the target group with --target-type ip and point its health check at the container port.

VPC endpoints private tasks need (or a NAT Gateway)

A task in a private subnet with assignPublicIp=DISABLED must reach AWS APIs to pull the image and ship logs. Either route via a NAT Gateway or add these endpoints (cheaper at scale, and required if you have no NAT):

Endpoint	Type	Why the task needs it
`com.amazonaws.<region>.ecr.api`	Interface	ECR auth / metadata
`com.amazonaws.<region>.ecr.dkr`	Interface	Pull image layers
`com.amazonaws.<region>.s3`	Gateway	ECR layers live in S3 (must add!)
`com.amazonaws.<region>.logs`	Interface	CloudWatch Logs (awslogs driver)
`com.amazonaws.<region>.secretsmanager`	Interface	If injecting secrets
`com.amazonaws.<region>.ssm` / `ssmmessages`	Interface	SSM params / ECS Exec
`com.amazonaws.<region>.sts`	Interface	IRSA / role assumption (EKS)
`com.amazonaws.<region>.ecs` / `ecs-agent` / `ecs-telemetry`	Interface	ECS agent comms (EC2 launch)

Forgetting the S3 gateway endpoint is the classic: ecr.api/ecr.dkr resolve, auth succeeds, but the layer download (which goes to S3) hangs → task stuck in PROVISIONING or CannotPullContainerError.

EKS VPC CNI — IP exhaustion math

The EKS VPC CNI gives each pod a real VPC IP, pre-allocating a warm pool per node. Without prefix delegation, a node’s pod density is capped by its ENI/IP limits, and large clusters exhaust /24s fast.

Lever	Effect	Trade-off
Prefix delegation	Assign /28 prefixes → ~16× more pods/node	Slight IP fragmentation; enable early
Custom networking	Pods in a secondary CIDR	More config; preserves primary subnet IPs
Bigger subnets (/19+)	More headroom	Plan CIDRs up front; hard to change later
Fewer, larger nodes	Fewer warm-pool IPs wasted	Larger blast radius per node

Deployments & rollouts — keeping the service alive during change

ECS deployment controllers

Controller	Behaviour	Use when
ECS rolling (default)	Replaces tasks per min/max healthy %	Default; simple rolling update
CodeDeploy blue/green	Shifts ALB traffic to a new task set	Safe canary/linear/all-at-once with rollback
EXTERNAL	You drive task sets via API	Custom deployment tooling

Deployment-tuning knobs (ECS)

Setting	Controls	Default	Gotcha
`minimumHealthyPercent`	How many tasks stay up during deploy	100	Too high + no spare capacity = stuck deploy
`maximumPercent`	Burst capacity during deploy	200	Fargate has no nodes to “fill”; fine. EC2 needs headroom
`deploymentCircuitBreaker.rollback`	Auto-roll-back on failed deploy	off	Turn ON — saves a bad rollout
Health-check grace period	Ignore ALB health for N s after start	0	Set it for slow-booting apps or you’ll thrash

Kubernetes rollout knobs (EKS)

Setting	Controls	Notes
`strategy.rollingUpdate.maxUnavailable`	Pods down during rollout	Lower = safer, slower
`strategy.rollingUpdate.maxSurge`	Extra pods during rollout	Needs node headroom (or Karpenter scales)
`readinessProbe`	When a pod joins the LB	Wrong path → pod never Ready → no endpoints
`livenessProbe`	When kubelet restarts a pod	Too aggressive = crash-loop you caused
`PodDisruptionBudget`	Min available during drains	Protects availability during node upgrades

Architecture at a glance

Trace one HTTPS request and you can see the whole 2×2 in a single path. A client hits an Application Load Balancer on :443 (TLS terminates here). Because the containers run with awsvpc networking, the ALB’s target group must be target-type ip — it sends traffic to the task’s own ENI IP, not to a host. From there the request enters the control plane you chose: ECS (AWS-native, no cluster fee) or EKS (the Kubernetes API at $0.10/hr). That orchestrator schedules the work onto the data plane you chose: Fargate (serverless, 0.25–16 vCPU, nothing to patch) or EC2 nodes (you own the AMI; Graviton, Spot and GPU live here). Whichever combination, the actual workload is the same container — an image pulled from ECR over :443, running as a task or pod under a task role (ECS) or IRSA (EKS). Finally every task leans on shared platform dependencies: CloudWatch for logs and metrics, and IAM split into an execution role (pull the image, write logs, read secrets) and a task role (the app’s own AWS calls).

The numbered badges mark the five places this architecture most often goes wrong or forces a decision. (1) is the ALB target-type trap — awsvpc demands ip, and an instance target group gives you a 503 with no healthy targets. (2) is the orchestrator fork itself: pay the EKS control-plane fee and add-on toil only for genuine Kubernetes needs. (3) is the Fargate-vs-EC2 trade — serverless simplicity versus steady-state cost, GPU and daemons. (4) is the dreaded PROVISIONING hang: an ENI that can’t attach because the subnet is out of IPs or the private subnet lacks ECR/S3/logs endpoints. (5) is the IAM split — mixing the execution and task roles is why images won’t pull or secrets won’t resolve. Read the diagram left to right and the badges become your pre-flight checklist.

ECS vs EKS over Fargate vs EC2: one HTTPS request through an ALB (target-type ip) into the chosen orchestrator and launch type, the ECR image and task/pod, and the CloudWatch + IAM platform dependencies, with five numbered failure/decision points

Real-world scenario

Northwind Stream (fictional) is a 40-engineer media-analytics SaaS on AWS. Two years ago a newly hired platform lead stood up a single large EKS cluster “to be cloud-native,” and everything went on it: the customer-facing web/API tier (a dozen stateless Go and Node services), a set of scheduled batch jobs, and the data platform (Spark on the Spark Operator, plus a couple of bespoke controllers). It worked — until operating it became the team’s main job.

The symptoms were classic misallocation. The platform group had grown to three full-time engineers whose week was Kubernetes upkeep: a minor-version upgrade every cycle (with the obligatory VPC CNI / CoreDNS compatibility testing), recurring IP-exhaustion alerts as the CNI burned through a /23 (prefix delegation hadn’t been enabled), AWS Load Balancer Controller CVEs to patch, Helm-chart drift, and a painful incident where a bad CoreDNS bump broke service discovery for twenty minutes. None of this produced customer value. Meanwhile the web tier — pure stateless HTTP behind an ALB — used nothing Kubernetes-specific. It was paying the full cluster tax for zero benefit.

The architecture review split the workloads along the two axes honestly:

Workload	Needs k8s?	Utilization	Decision	Why
Web / API tier (12 services)	No	Spiky daytime	ECS + Fargate	Stateless + ALB; no k8s features used; min ops
Scheduled batch (reports)	No	Bursty, scale-to-zero	ECS + Fargate (Spot)	EventBridge-triggered; cheap; no idle nodes
Data platform (Spark, controllers)	Yes	Steady, heavy, GPU-ish	EKS + EC2 (Karpenter, Spot, Graviton)	Operators/CRDs are the value; cost-tuned nodes
ML inference (GPU)	Maybe	Steady	EKS + EC2 GPU	Operators + GPU; Fargate can’t do GPU

They migrated the web tier to ECS Fargate behind the existing ALBs (recreating the target groups as target-type ip), moved the batch jobs to ECS Fargate Spot triggered by EventBridge Scheduler, and kept the data platform on EKS — but moved its nodes to Karpenter on Graviton Spot, and finally enabled prefix delegation to kill the IP-exhaustion alerts. The numbers afterward: the platform team shrank from three engineers to one; the EKS cluster’s blast radius dropped (only the data platform now depends on it); the batch tier’s compute bill fell sharply because it scaled to zero between runs; and the Spark/ML workloads got cheaper on Karpenter+Graviton+Spot while gaining the JIT bin-packing they’d lacked. The lesson the lead wrote in the post-mortem: “Kubernetes was the right tool for the 20% of our workload that needed it, and an expensive mistake for the 80% that didn’t. We chose on the wrong axis — we picked an orchestrator before asking whether each workload needed one.”

Advantages and disadvantages

Dimension	Advantage	Disadvantage
ECS	$0 control plane; lowest learning curve; native AWS integration; per-task IAM	AWS-only; no CRDs/operators; less extensible
EKS	Full Kubernetes API; portable; huge ecosystem; advanced scheduling	$0.10/hr/cluster; add-on + upgrade toil; bigger blast radius
Fargate	No node patching/scaling/right-sizing; per-second billing; per-task isolation	~20–50% per-vCPU premium; no GPU/DaemonSet/privileged; fixed CPU/mem pairs
EC2	Cheaper at steady state; Spot/Graviton/GPU; daemons; custom AMI; bin-packing	You patch/scale/secure nodes; capacity planning; host security surface

In prose: ECS wins when the workload is “just containers + a load balancer” and you value shipping over operating — which is most workloads, most of the time. EKS wins precisely when you can name the Kubernetes capability you depend on (an operator, CRDs, a mesh, portability) — and when that value clears the bar of the control-plane fee plus the permanent add-on/upgrade surface. On the other axis, Fargate wins on spiky utilization, small teams, and anything you’d rather not babysit; its premium is real but often dwarfed by the salary cost of node operations at small scale. EC2 wins on steady, high-utilization fleets where bin-packing plus Savings Plans/Spot/Graviton make it dramatically cheaper, and on the hard requirements Fargate simply can’t meet (GPU, DaemonSets, custom kernels). The corners are not ranked; they are matched to a workload’s shape.

Hands-on lab

This lab deploys a tiny HTTP container on ECS Fargate behind an ALB, hits it, then tears everything down. It is free-tier-adjacent (Fargate and ALB are not free, but a few minutes costs cents). Run in a Region like ap-south-1 (Mumbai). Replace IDs with yours.

Step 0 — prerequisites

aws sts get-caller-identity            # confirm you're authenticated
aws ec2 describe-vpcs --query 'Vpcs[0].VpcId' --output text   # note a VPC id

Step 1 — create a cluster

aws ecs create-cluster --cluster-name lab --capacity-providers FARGATE FARGATE_SPOT

Expected: JSON with "status": "ACTIVE".

Step 2 — an execution role (pull image + write logs)

aws iam create-role --role-name lab-exec \
  --assume-role-policy-document '{"Version":"2012-10-17","Statement":[{"Effect":"Allow","Principal":{"Service":"ecs-tasks.amazonaws.com"},"Action":"sts:AssumeRole"}]}'
aws iam attach-role-policy --role-name lab-exec \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

Step 3 — register a task definition (taskdef.json)

{
  "family": "lab-web",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "256", "memory": "512",
  "executionRoleArn": "arn:aws:iam::<acct>:role/lab-exec",
  "containerDefinitions": [{
    "name": "web",
    "image": "public.ecr.aws/nginx/nginx:latest",
    "portMappings": [{ "containerPort": 80 }],
    "essential": true,
    "logConfiguration": {
      "logDriver": "awslogs",
      "options": {
        "awslogs-group": "/ecs/lab-web",
        "awslogs-region": "ap-south-1",
        "awslogs-stream-prefix": "web",
        "awslogs-create-group": "true"
      }
    }
  }]
}

aws ecs register-task-definition --cli-input-json file://taskdef.json

Step 4 — an ALB + a TARGET-TYPE IP target group (the lab’s whole point)

aws elbv2 create-load-balancer --name lab-alb --type application \
  --subnets subnet-1 subnet-2 --security-groups sg-alb
aws elbv2 create-target-group --name lab-tg --protocol HTTP --port 80 \
  --vpc-id vpc-0abc --target-type ip --health-check-path /
# create a listener on :80 forwarding to lab-tg (ARNs from the two commands above)
aws elbv2 create-listener --load-balancer-arn <alb-arn> --protocol HTTP --port 80 \
  --default-actions Type=forward,TargetGroupArn=<tg-arn>

Step 5 — run the service

aws ecs create-service --cluster lab --service-name web \
  --task-definition lab-web --desired-count 2 --launch-type FARGATE \
  --network-configuration 'awsvpcConfiguration={subnets=[subnet-1,subnet-2],securityGroups=[sg-web],assignPublicIp=ENABLED}' \
  --load-balancers 'targetGroupArn=<tg-arn>,containerName=web,containerPort=80'

Note: assignPublicIp=ENABLED lets the lab pull the public ECR image without VPC endpoints. In production you’d use private subnets + the endpoints in the table above.

Step 6 — verify

aws ecs describe-services --cluster lab --services web \
  --query 'services[0].deployments[0].runningCount'      # → 2 when ready
aws elbv2 describe-target-health --target-group-arn <tg-arn> \
  --query 'TargetHealthDescriptions[].TargetHealth.State' # → ["healthy","healthy"]
curl http://<alb-dns-name>/                               # → nginx welcome HTML

Step 7 — teardown (avoid charges)

aws ecs update-service --cluster lab --service web --desired-count 0
aws ecs delete-service --cluster lab --service web --force
aws elbv2 delete-listener --listener-arn <listener-arn>
aws elbv2 delete-load-balancer --load-balancer-arn <alb-arn>
aws elbv2 delete-target-group --target-group-arn <tg-arn>
aws ecs delete-cluster --cluster lab
aws logs delete-log-group --log-group-name /ecs/lab-web

Common mistakes & troubleshooting

This is the differentiator. Containers fail in a small set of recurring ways with a specific root cause and an exact confirm step. Use this as a playbook: match the symptom, confirm the cause, apply the fix.

#	Symptom	Root cause	Confirm (exact command / path)	Fix
1	Task stuck in PROVISIONING	ENI can’t attach: no free subnet IPs, or private subnet missing ECR/S3/logs endpoints	`aws ecs describe-tasks ... --query 'tasks[0].stoppedReason'`; check subnet free IPs	Free IPs / bigger subnet; add ECR(api,dkr)+S3 gateway+logs endpoints
2	`CannotPullContainerError`	Execution role lacks ECR perms, or no route to ECR/S3	Task `stoppedReason`; CloudTrail `ecr:GetAuthorizationToken` deny	Attach `AmazonECSTaskExecutionRolePolicy`; add ECR+S3 endpoints or NAT
3	`ResourceInitializationError: unable to pull secrets`	Execution role can’t read Secrets Manager/SSM, or no endpoint	`stoppedReason`; secret ARN in task def	Grant exec role `secretsmanager:GetSecretValue`; add SM endpoint
4	App throws AccessDenied calling S3/DDB	Permission put on execution role, not task role	App logs; the call uses the task role	Move the app’s policy to the task role (`taskRoleArn`)
5	ALB returns 503, no healthy targets	Target group is target-type instance for an `awsvpc`/Fargate service	`aws elbv2 describe-target-health` shows no/`unhealthy` targets	Recreate TG `--target-type ip`; health-check the container port
6	Targets unhealthy, app is fine	Health-check path/port wrong; SG blocks ALB→task	`describe-target-health` reason `Target.ResponseCodeMismatch`/`Timeout`	Fix `--health-check-path`/port; allow ALB SG → task SG on the port
7	Container exits with code 137	OOM — exceeded task/container memory	`stoppedReason: OutOfMemoryError`; Container Insights memory	Raise `memory`; fix leak; set `memoryReservation` sensibly
8	Crash loop (task restarts forever)	App throws at startup (bad env/secret/migration)	`aws logs tail /ecs/<svc>` repeating trace; ECS events	Fix config; enable circuit-breaker rollback; add health grace period
9	EKS pods `Pending`	No capacity (no nodes) or no matching Fargate profile	`kubectl describe pod` → `FailedScheduling`/`Insufficient cpu`	Scale nodes/Karpenter; add a Fargate profile matching the labels
10	EKS pods `ContainerCreating` forever	IP exhaustion (VPC CNI) or CNI not ready	`kubectl describe pod` → `failed to assign an IP`; `aws-node` logs	Enable prefix delegation; bigger subnets; restart CNI
11	Spot interruption kills tasks/nodes	Fargate Spot/EC2 Spot reclaimed	ECS events / node `Terminating`; Spot interruption notice	Run an on-demand base via capacity-provider strategy; PDBs (EKS)
12	Deploy stuck, never completes	`minimumHealthyPercent 100` + no spare EC2 capacity	`aws ecs describe-services` deployment `IN_PROGRESS` forever	Lower min-healthy or add capacity; Fargate avoids this
13	504 Gateway Timeout from ALB	App slower than ALB idle/target timeout	App Insights/logs latency; ALB idle timeout	Speed up app; raise target/idle timeout; fix downstream
14	`exec format error` on start	arm64 image on x86 task (or vice-versa)	Container logs first line	Build multi-arch image; match `runtimePlatform`
15	ECS Exec / `kubectl exec` fails	Missing SSM endpoints or `enableExecuteCommand` off	`aws ecs execute-command` error; SSM agent	Enable exec; add `ssm`/`ssmmessages` endpoints; task-role SSM perms

A few reading notes that save the most time:

Always read stoppedReason first. aws ecs describe-tasks --cluster <c> --tasks <id> --query 'tasks[0].stoppedReason' tells the truth for almost every Fargate failure (pull, secrets, ENI, OOM). On EKS, the equivalent is kubectl describe pod events.
Execution role vs task role is the most common IAM confusion. Execution role = the platform pulling your image, writing your logs, reading your secrets before/around your code. Task role = your code’s AWS identity. If the image won’t pull or a secret won’t resolve, it’s the execution role. If your app gets AccessDenied calling AWS, it’s the task role.
target-type ip is mandatory for awsvpc/Fargate. This single setting causes more “ALB returns 503” tickets than anything else.

Best practices

Default to ECS; adopt EKS only with a named Kubernetes requirement. Write the requirement down (operator X, CRD Y, portability to Z). If you can’t, you don’t need EKS.
Default to Fargate; move to EC2 when the bill or a hard requirement says so. Start serverless; switch steady, high-utilization or GPU/daemon workloads to EC2 with data.
Always use target-type ip target groups for awsvpc/Fargate, and health-check the container port — not a host port.
Split IAM correctly: execution role for pull/logs/secrets, task role for the app. Keep both least-privilege; never reuse one for both jobs.
In private subnets, add the ECR(api+dkr) + S3 gateway + logs endpoints (and secretsmanager/sts/ssm as needed) — or you’ll chase PROVISIONING hangs.
Turn on the ECS deployment circuit breaker with rollback so a bad task definition auto-reverts instead of taking the service down.
Pin image digests/tags; never :latest in production. Reproducible deploys and clean rollbacks depend on it.
Run an on-demand base + Spot burst (capacity-provider strategy on ECS; Karpenter Spot with on-demand fallback on EKS) for cost without fragility.
On EKS, enable VPC CNI prefix delegation early and plan CIDRs generously — IP exhaustion is painful to fix after the fact.
On EKS, keep add-ons current and test upgrades on a non-prod cluster (or use blue/green clusters). CNI/CoreDNS/kube-proxy must match the control-plane version.
Right-size with Container Insights / metrics, then commit (Savings Plans for Fargate+EC2) once usage is steady.
Prefer Graviton (arm64) for both Fargate and EC2 where your stack supports it; build multi-arch images so you’re not locked to x86.

Security notes

Least-privilege task roles. Scope the task role to exactly the AWS APIs the app calls; never grant *. On EKS, use IRSA or Pod Identity so each ServiceAccount maps to a minimal role — not node-wide instance-profile permissions.
Don’t put secrets in environment. Inject from Secrets Manager/SSM Parameter Store via the task definition secrets block (or External Secrets/CSI on EKS); the execution role reads them, and they never appear in the task-def history or logs.
Private subnets + endpoints, no public IPs. Run tasks with assignPublicIp=DISABLED in private subnets and reach AWS via VPC endpoints, keeping image pulls and log shipping off the public internet.
Per-task security groups (awsvpc) let you micro-segment: the ALB SG → app SG on the container port only; app SG → database SG on the DB port only. On EKS, add network policies for pod-to-pod controls.
Image provenance. Scan images in ECR (enhanced scanning / Inspector), pin digests, and ideally enforce signed images. A vulnerable base image is a vulnerable fleet.
EKS RBAC + IAM together. Lock down Access Entries / aws-auth so only the right principals reach the API server, and use Kubernetes RBAC for in-cluster authorization. Restrict the public API endpoint or make it private.
No privileged unless required. On Fargate it’s impossible (a security feature). On EC2/EKS, drop Linux capabilities you don’t need and avoid privileged: true.

Cost & sizing

What drives the bill differs sharply by corner. Roughly (us-east-1-class on-demand list; INR at ~₹84/USD; verify current pricing):

Cost driver	Applies to	Rough figure	Notes
Fargate vCPU	ECS/EKS Fargate	~$0.04048 / vCPU-hr	Per-second, 1-min minimum
Fargate memory	ECS/EKS Fargate	~$0.004445 / GB-hr	Billed alongside vCPU
Fargate Spot	ECS Fargate	~70% off	Interruptible; batch/dev
EKS control plane	Every EKS cluster	$0.10/hr (~$73/mo / ~₹6,100)	Per cluster — consolidate!
EC2 on-demand	EC2 launch type	Instance-hour	Cheaper than Fargate at high utilization
EC2 Spot	EC2 launch type	up to ~90% off	Interruptible; Karpenter handles it
Graviton (arm64)	Fargate & EC2	~20–40% better price/perf	Multi-arch image required
NAT Gateway	Private tasks w/o endpoints	~$0.045/hr + $0.045/GB	Endpoints often cheaper at scale
Interface VPC endpoint	Private tasks	~$0.01/hr each + data	Fixed per-AZ; adds up with many
ALB	Fronting services	~$0.0225/hr + LCU	Shared across many target groups
CloudWatch Logs	All	~$0.50/GB ingest + storage	Sample/route via FireLens to cut cost
Container Insights	Optional	Per metric/log	Useful but priced; scope it

Right-sizing guidance

Decision	Heuristic
Fargate task size	Start at the smallest valid CPU/mem pair that fits; scale out, not up, first
Fargate vs EC2 crossover	Above ~60–70% steady utilization 24×7, EC2 (with RI/SP) usually wins
Spot mix	On-demand base for availability + Spot burst for cost (e.g. 1:3)
EKS cluster count	Consolidate — each cluster is $73/mo; use namespaces/RBAC, not extra clusters
arm64 adoption	Default new services to Graviton if the stack supports it
Logs spend	Route with FireLens, sample debug logs, set retention

Free-tier-ish notes

ECS itself is free (you pay only compute) — there is no orchestrator charge.
EKS has no free tier: the $0.10/hr control-plane fee starts immediately, so don’t leave idle clusters running.
Fargate/ALB are not free but a few minutes of the lab above costs only cents — just remember to tear down.

Interview & exam questions

Q1. ECS vs EKS in one sentence — when each? ECS is AWS’s native orchestrator with no control-plane fee and the lowest operational surface — use it for straightforward containerized services. EKS is conformant Kubernetes ($0.10/hr/cluster) — use it when you need the Kubernetes ecosystem (operators, CRDs, Helm), portability, or advanced scheduling. (SAP-C02)

Q2. Fargate vs EC2 launch type — the trade-off? Fargate is serverless (no nodes to patch/scale/right-size, per-second billing) at a per-vCPU premium and without GPU/DaemonSet/privileged support. EC2 is cheaper at steady high utilization and supports Spot, Graviton, GPU, daemons and custom AMIs, but you own host operations. (DVA-C02 / SAP-C02)

Q3. Your Fargate service behind an ALB returns 503 with no healthy targets. Why? Almost certainly the target group is target-type instance while Fargate uses awsvpc, so targets register the wrong way. Recreate the target group with --target-type ip and point the health check at the container port. (DVA-C02)

Q4. A Fargate task is stuck in PROVISIONING in a private subnet. First two suspects? No free IPs in the subnet for the task ENI, or missing VPC endpoints (ECR api+dkr, the S3 gateway endpoint for layers, and logs). Confirm via stoppedReason and subnet free-IP count. (SAP-C02)

Q5. Difference between the ECS execution role and task role? The execution role lets the ECS agent pull the image, write logs and read secrets (platform-side). The task role is the application’s own AWS identity for its API calls (S3, DynamoDB, etc.). CannotPullContainerError → execution role; app AccessDenied → task role. (DVA-C02)

Q6. When is EKS genuinely the right call over ECS? When you depend on Kubernetes-specific capabilities: existing Helm charts/operators/CRDs, a service mesh, advanced scheduling, multi-cloud portability, or workloads like Spark/Flink/ML that run on Kubernetes operators. Brand recognition is not a reason. (SAP-C02)

Q7. What can’t you run on EKS Fargate? DaemonSets, GPU workloads, privileged containers and hostNetwork/hostPort pods. Node-level agents (logging/security) must be sidecars; GPU/daemon needs require EC2 node groups. (CKA mindset / SAP-C02)

Q8. How do you give an EKS pod AWS permissions securely? Use IRSA (IAM Roles for Service Accounts via OIDC) or EKS Pod Identity to bind a least-privilege role to a ServiceAccount — not the node instance profile, which would over-grant every pod on the node. (DOP-C02)

Q9. Container exits with code 137 — what happened and how do you confirm? It was OOM-killed for exceeding its memory limit. Confirm via stoppedReason: OutOfMemoryError (ECS) or the pod’s OOMKilled reason (EKS) and Container Insights memory metrics; fix by raising memory or fixing the leak. (DVA-C02)

Q10. How do you cut container compute cost without sacrificing availability? Run an on-demand base plus Spot burst (capacity-provider strategy on ECS; Karpenter Spot with on-demand fallback on EKS), adopt Graviton/arm64, right-size from metrics, and commit Savings Plans once usage is steady. (SAP-C02)

Q11. Why does the EKS control plane cost matter for cluster strategy? Each cluster is $0.10/hr (~$73/mo). Spinning up a cluster per team/app multiplies that fee and the add-on toil. Consolidate with namespaces and RBAC instead, reserving separate clusters for genuine isolation needs. (SAP-C02)

Q12. What is Karpenter and why prefer it over the Cluster Autoscaler? Karpenter is a just-in-time node provisioner for EKS that watches pending pods and launches right-sized, cheapest-fit nodes (Spot-native, bin-packing) in seconds, then consolidates. It’s faster and more cost-efficient than ASG-based Cluster Autoscaler’s fixed node groups. (DOP-C02)

Quick check

You have a dozen stateless HTTP services, a five-person team, and no Kubernetes experience. Which corner of the 2×2 do you pick, and why?
Your Fargate service behind an ALB shows “no healthy targets” and clients get 503. What is the single most likely misconfiguration?
Name two VPC endpoints (besides logs) a private-subnet Fargate task needs to pull an image, and the easy one to forget.
The app logs AccessDenied calling DynamoDB. Execution role or task role — which do you fix?
Give one workload that genuinely justifies EKS over ECS and one that genuinely justifies EC2 over Fargate.

Answers

ECS + Fargate. Stateless containers + ALB use nothing Kubernetes-specific, and Fargate removes node ops — the lowest total operational surface for a small team. EKS would add a control-plane fee and an add-on/upgrade burden for zero benefit.
The target group is target-type instance instead of ip. awsvpc/Fargate tasks must be targeted by ENI IP; recreate the target group with --target-type ip and health-check the container port.
ecr.api and ecr.dkr (interface endpoints) — plus the easy-to-forget S3 gateway endpoint, because ECR image layers are stored in S3 and the download stalls without it.
The task role. The app’s own AWS calls use the task role; the execution role only covers image pull, log writes and secret reads.
EKS-justifying: Spark/ML on the Spark Operator (or anything needing CRDs/operators, a mesh, or multi-cloud portability). EC2-justifying: a GPU inference service (Fargate has no GPU) or a steady 24×7 fleet where Spot+Graviton+bin-packing is far cheaper.

Glossary

Amazon ECS — AWS’s native container orchestrator; schedules tasks, integrates with ALB/IAM/CloudWatch; no control-plane fee.
Amazon EKS — AWS-managed, conformant Kubernetes; full k8s API and ecosystem; $0.10/hr per cluster.
AWS Fargate — Serverless compute for containers; runs tasks/pods with no nodes to manage; billed per vCPU/GB-second.
EC2 launch type — Running tasks/pods on EC2 instances you provision; cheaper at steady state; supports Spot/Graviton/GPU/daemons.
Task definition — Immutable, versioned JSON describing an ECS task’s containers, CPU/memory, roles, networking and logging.
Execution role — IAM role the ECS agent uses to pull the image, write logs and read secrets (platform-side).
Task role — IAM role that is the application’s own AWS identity for its API calls.
awsvpc — Network mode giving each task/pod its own ENI, IP and security group; required by Fargate.
Target-type ip — ALB target-group mode that registers task/pod ENI IPs; required for awsvpc/Fargate services.
Capacity provider — ECS construct mapping a service to Fargate/Fargate-Spot/EC2 capacity (with weighted strategies).
Managed node group — AWS-managed EC2 Auto Scaling Group of EKS worker nodes with lifecycle handling.
Karpenter — EKS just-in-time node provisioner that launches right-sized, cheapest-fit (Spot) nodes and consolidates.
VPC CNI — EKS networking plugin that assigns each pod a real VPC IP; tuned via prefix delegation/custom networking.
IRSA — IAM Roles for Service Accounts; binds a least-privilege IAM role to a Kubernetes ServiceAccount via OIDC.
Fargate profile — EKS construct selecting which pods (namespace + labels) run on Fargate instead of nodes.
Graviton (arm64) — AWS Arm-based processors offering ~20–40% better price-performance; needs multi-arch images.

Next steps

Go deep on the launch-type cost mechanics in AWS Compute: EC2, Lambda, ECS and EKS — Which One to Choose?.
Get the load-balancer choice right (the target-type detail lives here) in AWS ALB vs NLB vs API Gateway Compared.
Master the awsvpc/endpoint/SG networking your tasks depend on in AWS VPC, Subnets and Security Groups Explained.
Lock down the execution/task roles and account guardrails via AWS Organizations & IAM Foundations.
Pair containers with event-driven glue (EventBridge-triggered batch tasks) using AWS Lambda Event-Driven Patterns.