AWS Containers

Amazon ECS & ECR, In Depth: Task Definitions, Services, Fargate vs EC2 & the Registry

A container bundles your application and everything it needs to run into one immutable artefact, and on a laptop docker run makes that feel trivial. Production is where the real questions start: where does the image live, who is allowed to pull it, how many copies run, what happens when one dies at 3am, how does a new version roll out without dropping requests, and how do other services find it. Amazon ECS (Elastic Container Service) and Amazon ECR (Elastic Container Registry) are AWS’s answers to exactly those questions, and together they are the fastest way to take a Dockerfile to a resilient, auto-scaling, load-balanced service without ever touching Kubernetes.

ECR is the registry — a managed, private (or public) place to store and version your container images, with vulnerability scanning and automated cleanup built in. ECS is the orchestrator — it takes a task definition (a JSON blueprint of your containers) and runs the requested number of copies as tasks, keeps them healthy as a service, registers them behind a load balancer, scales them on demand, and replaces them safely on each deploy. ECS runs those tasks either on Fargate (serverless — AWS owns the host) or on EC2 instances you provide, and choosing between them is one of the decisions this lesson makes easy.

This is the exhaustive version. We will walk ECR in full (registry types, push/pull, scanning, lifecycle policies, tag immutability), then every building block of ECS — the cluster, then the task definition field by field, then the difference between a task and a service, then launch types (Fargate vs EC2) with a decision table, then deployment types (rolling, blue/green, external) and the deployment circuit breaker, then service auto scaling, ALB integration, and service discovery / Service Connect. We finish with the question every interviewer asks — ECS vs EKS — and a hands-on lab you can run on the Free Tier. By the end you can ship a production container service on AWS and answer the certification questions about it cold.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

You need an AWS account, the AWS CLI configured (aws configure), Docker installed locally to build and push an image, and a working grasp of IAM (ECS uses two distinct IAM roles) and VPC basics (the awsvpc network mode gives each task its own elastic network interface in your subnets, governed by a security group). Familiarity with the Application Load Balancer helps, since that is how most ECS services receive traffic. This is a Containers lesson in the AWS Zero-to-Hero course; it builds on the EC2, VPC, and ELB deep dives and is the foundation for the production companion, Production Amazon ECS on Fargate. After this, the course moves on to the Amazon CloudFront deep dive (aws-cloudfront-deep-dive-distributions-origins-caching-oac) — the CDN that often sits in front of an ECS-backed application.

Core concepts: the ECS object model

Before any settings, fix the mental model. ECS has a small, clean object hierarchy, and almost every confusion in interviews comes from blurring two of these terms. Learn them precisely.

The single most important distinction to internalise now is task vs service. A task is one running copy that, once it exits, stays exited. A service is the long-running supervisor that says “I want N healthy copies at all times” and makes that true — replacing failures, balancing across Availability Zones, and rolling out new versions. You run one-off jobs as standalone tasks (or via scheduled tasks); you run long-lived applications (web APIs, workers) as services.


Part 1 — Amazon ECR (the registry)

ECR registry types: private vs public

ECR stores your container images so ECS (and EKS, Lambda, or anything that speaks the Docker/OCI protocol) can pull them. There are two registry types.

Registry What it is Who can pull Auth to pull Typical use
Private registry One per account per Region; holds private repositories Only principals you grant via IAM / repository policy Required (token via aws ecr get-login-password) Your application images — the default and the common case
Public registry (ECR Public / Amazon ECR Public Gallery) A globally reachable registry at public.ecr.aws Anyone, anonymously (auth only needed to push or for higher pull rate limits) Not required to pull Distributing images to the world (base images, open-source tools)

For everything in this lesson we use the private registry — your application image is not something the public should pull. Each account gets one private registry per Region, addressed as <account-id>.dkr.ecr.<region>.amazonaws.com, containing as many repositories as you create.

Creating a repository: every setting

When you create a private repository (aws ecr create-repository, or console ECR → Create repository), these are the settings that matter.

Setting What it does Choices Default When to change Gotcha
Repository name The repo’s name (can include a namespace path, e.g. team-a/checkout) Free text, lowercase Use a clear team/app convention Cannot be renamed after creation — you would create a new repo and re-push
Tag immutability Whether a tag, once pushed, can be overwritten MUTABLE or IMMUTABLE MUTABLE Set IMMUTABLE for production With MUTABLE, re-pushing latest (or any tag) silently moves the tag to a new image — a supply-chain and rollback hazard
Scan on push Run a vulnerability scan automatically when an image is pushed On / Off (basic), or enhanced scanning at the registry level Off (basic) Turn on for any image you ship Basic scan runs once on push; enhanced (via Amazon Inspector) continuously rescans as new CVEs are published
Encryption Encrypt images at rest AES256 (Amazon S3-managed) or AWS KMS (AWS-managed or your CMK) AES256 Use a CMK when you need key control/audit or cross-account key policies Encryption type is fixed at creation; switching means a new repository

ECR is just a store; access is controlled by IAM identity policies (what a principal may do) plus an optional repository policy (a resource policy on the repo — the mechanism for cross-account pulls and for granting AWS services access). The execution role your ECS task uses must have ECR read permissions, which we cover below.

Pushing and pulling images

ECR speaks the standard Docker registry protocol, so the workflow is docker logindocker push/docker pull, with the login token obtained from the API. The canonical push flow:

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-south-1
REPO=demo-web

# 1. Authenticate Docker to your private registry (token valid 12 hours)
aws ecr get-login-password --region "$REGION" \
  | docker login --username AWS --password-stdin \
    "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com"

# 2. Build and tag with the full ECR URI
docker build -t "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO:1.0.0" .

# 3. Push
docker push "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO:1.0.0"

Key facts: the login token from get-login-password is valid for 12 hours; the registry endpoint is per-account-per-Region; and you should tag images with a meaningful, unique version (a Git SHA or semantic version) rather than relying on latest. To pull (from ECS, CI, or a laptop) you authenticate the same way and docker pull the URI — but ECS does the pull for you using its execution role, so you rarely pull by hand in production.

Referencing images immutably. A tag is a movable label; a digest (@sha256:…) is the content hash and never changes. For reproducible, tamper-evident deploys, reference images by digest in your task definition (or enable tag immutability so a tag behaves like a digest).

Image scanning: basic vs enhanced

ECR can scan images for known operating-system and language-package vulnerabilities (CVEs).

Scan type Engine When it runs Coverage Cost
Basic scanning ECR’s built-in scanner (CVE feeds) On push (if enabled) or on demand OS packages Free
Enhanced scanning Amazon Inspector On push and continuously as new CVEs appear OS and programming-language packages (e.g. npm, pip, Maven) Charged per image/scan via Inspector

Basic scanning is a sensible free baseline; enhanced scanning is what you want for production because it keeps re-evaluating images already in the registry as the threat landscape changes — an image that was clean last month may carry a critical CVE today. Findings are surfaced in the console, via the API, and (for enhanced) in Amazon Inspector and EventBridge, so you can alert or block on severity.

Lifecycle policies: automated cleanup

Without housekeeping, repositories accumulate hundreds of old images and quietly run up storage cost. A lifecycle policy is a set of rules that expire images automatically based on age or count.

A policy is JSON with prioritised rules; each rule selects images by tag status (tagged with given prefixes, untagged, or any) and a count- or age-based condition. Example — keep the 10 newest prod-tagged images and delete untagged images older than 14 days:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep last 10 prod images",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["prod"],
        "countType": "imageCountMoreThan",
        "countNumber": 10
      },
      "action": { "type": "expire" }
    },
    {
      "rulePriority": 2,
      "description": "Expire untagged after 14 days",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 14
      },
      "action": { "type": "expire" }
    }
  ]
}

Rules evaluate in priority order (lower number first), and expire is currently the only action. Untagged images pile up every time you overwrite a mutable tag, so a “delete untagged after N days” rule is almost always worth having. Note expiry is permanent — there is no recycle bin — so scope your selections carefully and never let a rule match an image a running service still references.


Part 2 — Amazon ECS (the orchestrator)

The cluster

A cluster is a logical grouping of your tasks and services and the capacity boundary they share. What “capacity” means depends on the launch model:

A cluster also carries cluster-wide settings such as Container Insights (enhanced CloudWatch metrics/logs) and a default capacity-provider strategy. You can run many services and hundreds of tasks in one cluster; clusters are free — you pay only for the compute (Fargate vCPU/GB-seconds, or the EC2 instances).

The task definition: every field

The task definition is the heart of ECS — the JSON blueprint ECS uses to launch tasks. It is organised as a family (a name) with auto-incrementing revisions (my-app:7); registering a change creates a new revision, and you deploy by pointing a service at it. The fields below are grouped as you encounter them.

Task-level settings

Field What it is Choices / range Notes & gotchas
family The task definition name; revisions increment under it Free text You deploy a family; ECS tracks family:revision
requiresCompatibilities Which launch types this definition supports FARGATE, EC2, EXTERNAL Determines which fields are valid (Fargate forbids some, e.g. host network mode)
networkMode How task networking works awsvpc, bridge, host, none Fargate requires awsvpc; see the network-mode section below
cpu / memory (task level) The CPU/memory envelope for the whole task See Fargate matrix below On Fargate these are required and must be a valid pair; on EC2 they are optional caps
taskRoleArn The task role — IAM role your application code assumes to call AWS APIs An IAM role ARN This is what your container uses to reach S3, DynamoDB, etc. — least-privilege here
executionRoleArn The execution role — IAM role the ECS agent uses to pull the image and write logs An IAM role ARN Needs ECR pull + CloudWatch Logs + (if used) Secrets Manager/SSM read
runtimePlatform OS and CPU architecture LINUX/WINDOWS; X86_64/ARM64 Use ARM64 (Graviton) on Fargate for ~20% lower cost when your image supports it
volumes Task-level volume definitions containers can mount bind mounts, Docker volumes, EFS, FSx (EC2), Fargate ephemeral The storage layer; see volumes below
placementConstraints Rules restricting where (EC2) tasks land e.g. memberOf an attribute expression EC2 launch type only; ignored on Fargate
ephemeralStorage Size of Fargate scratch storage 20–200 GiB (Fargate) Default 20 GiB free; raise for large temp data
pidMode / ipcMode Share PID/IPC namespaces across containers task/host/none Advanced; host not allowed on Fargate
runtimePlatform / tags / proxyConfiguration Metadata and App Mesh/Service-Connect proxy wiring proxyConfiguration is used by App Mesh; Service Connect manages its own proxy

The Fargate CPU/memory matrix

Fargate does not accept arbitrary CPU/memory — only specific combinations, and the valid memory range is constrained by the CPU value. Memorise the shape (the exact upper bounds have grown over time; these are the widely supported tiers):

cpu (vCPU) Valid memory range
256 (.25 vCPU) 512, 1024, 2048 MiB
512 (.5 vCPU) 1024–4096 MiB (1 GiB steps)
1024 (1 vCPU) 2048–8192 MiB (1 GiB steps)
2048 (2 vCPU) 4096–16384 MiB (1 GiB steps)
4096 (4 vCPU) 8192–30720 MiB (1 GiB steps)
8192 (8 vCPU) 16384–61440 MiB (4 GiB steps)
16384 (16 vCPU) 32768–122880 MiB (8 GiB steps)

The whole task shares this budget. If you run a sidecar (a log router or proxy), it draws from the same pool — size the task for the sum, then optionally cap each container with container-level cpu/memory.

Container-level settings (the container definition)

Each entry in containerDefinitions configures one container. The important fields:

Field What it is Notes & gotchas
name Container name (unique within the task) Used by dependsOn, links, and load-balancer target wiring
image The image URI to run Use the full ECR URI; reference by digest for immutability
cpu (container) Soft/relative CPU share for this container Optional sub-allocation of the task cpu
memory (hard limit) Container is killed if it exceeds this Set at least one of memory/memoryReservation; OOM-kill is a common silent failure
memoryReservation (soft limit) Reserved amount; container can burst above it if the host has room (EC2) Lets you pack more on EC2 hosts
essential If true, the whole task stops when this container exits Mark your main app essential; sidecars usually false
portMappings Container ports to expose (and host ports / names) With awsvpc, hostPort = containerPort; name them for Service Connect
environment Plain-text env vars Never put secrets here — visible in the definition
secrets Inject values from Secrets Manager / SSM Parameter Store as env vars The secure way to pass credentials; needs execution-role read access
environmentFiles Bulk env vars from a file in S3 Handy for many variables
logConfiguration Where stdout/stderr go awslogs (CloudWatch), awsfirelens (FireLens → anywhere), splunk, etc.
healthCheck A command run inside the container to report health Distinct from the ALB health check; controls container/task health
dependsOn Ordering: start/stop this container relative to others by condition START, COMPLETE, SUCCESS, HEALTHY — e.g. wait for a migration container
command / entryPoint / workingDirectory Override the image’s CMD/ENTRYPOINT/workdir
ulimits / linuxParameters nofile limits, capabilities, initProcessEnabled, shared memory initProcessEnabled: true reaps zombie processes — useful for many apps
mountPoints / volumesFrom Mount task volumes into this container Pairs with task-level volumes
readonlyRootFilesystem Make the container’s root FS read-only A strong security default; write only to mounted volumes
user Run as a non-root UID/GID Avoid running as root
stopTimeout Grace period after SIGTERM before SIGKILL Give your app time to drain

A minimal Fargate task definition for a web container, registered with aws ecs register-task-definition --cli-input-json file://taskdef.json:

{
  "family": "demo-web",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "512",
  "runtimePlatform": { "cpuArchitecture": "ARM64", "operatingSystemFamily": "LINUX" },
  "executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::111122223333:role/demo-web-task-role",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "111122223333.dkr.ecr.ap-south-1.amazonaws.com/demo-web:1.0.0",
      "essential": true,
      "portMappings": [{ "name": "http", "containerPort": 80, "protocol": "tcp" }],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/demo-web",
          "awslogs-region": "ap-south-1",
          "awslogs-stream-prefix": "web",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost/ || exit 1"],
        "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 10
      }
    }
  ]
}

Network modes explained

The networkMode decides how a task’s containers get networking — and it is a favourite interview topic.

Mode How it works Who can use it When to pick it Trade-off / gotcha
awsvpc Each task gets its own elastic network interface (ENI) with a private IP in your subnet and its own security group Fargate (required) and EC2 Almost always — first-class VPC networking, per-task security groups, works with ALB IP targets On EC2, each task consumes an ENI; instance types cap ENIs per host (mitigated by ENI trunking)
bridge Docker’s default virtual bridge on the host; containers share the host’s network via NAT and port mappings EC2 only Legacy / dense packing where per-task ENIs are not needed Needs dynamic host ports + ALB to avoid port clashes; no per-task security group
host Containers bind directly to the host’s network interface and ports EC2 only Maximum network performance; ports must be unique per host No port remapping — only one task per host can use a given port; no per-task SG
none No external networking EC2 only Batch/compute that needs no network Container cannot reach the network

For essentially all new work, use awsvpc: it gives every task a real VPC IP and its own security group, integrates cleanly with the ALB (IP target type), and is the only mode Fargate supports. Read the production companion lesson for ENI/IP-address planning under awsvpc, which is where large fleets actually hit limits: Production Amazon ECS on Fargate: task networking, auto scaling, and safe rolling deployments.

Task role vs execution role — the classic confusion

ECS uses two IAM roles and they are constantly mixed up. The distinction is who uses the role and for what.

Role Used by Used for Typical permissions
Execution role (executionRoleArn) The ECS agent / Fargate infrastructure (before/around your container) Pulling the image from ECR, writing logs to CloudWatch, and fetching secrets referenced in the task definition AmazonECSTaskExecutionRolePolicy (ECR read + Logs) plus secretsmanager:GetSecretValue / ssm:GetParameters if you inject secrets
Task role (taskRoleArn) Your application code inside the container Calling AWS APIs your app needs at runtime (read S3, write DynamoDB, publish to SNS…) Exactly the least-privilege set your app requires — nothing more

The mnemonic: the execution role gets the task running (pull, log, secrets); the task role is what the running app can do. A task that fails to start with a “CannotPullContainerError” or “unable to retrieve secret” almost always has an execution-role problem; an AccessDenied from inside your code at runtime is a task-role problem.

Volumes and storage

Containers are ephemeral; for files that must persist or be shared you attach a volume defined at the task level and mounted into containers via mountPoints.

Volume type What it is Lifetime Use it for
Bind mount A path on the host (EC2) or task-scoped scratch (Fargate) Task lifetime Sharing files between containers in the same task (e.g. a sidecar reading the app’s logs)
Docker volume A Docker-managed volume (EC2 only) Task or instance Local persistence on a container instance
Amazon EFS A shared, elastic NFS file system mounted into the task Independent of the task — durable & shared Shared state across tasks/AZs; persistent data on Fargate
Amazon FSx for Windows Windows shared file storage Independent Windows workloads (EC2)
Fargate ephemeral storage The task’s scratch disk (20–200 GiB) Task lifetime Temp files, caches — not durable

For anything that must survive a task replacement or be shared between tasks, use EFS — it is the durable, multi-AZ option and works on Fargate. Fargate’s own disk is wiped when the task stops.

Secrets, logging, and health checks

Task vs service (revisited, with settings)

You can run a task definition two ways:

Key service settings:

Setting What it does Notes
desiredCount How many tasks the service keeps running The number auto scaling adjusts
launchType / capacityProviderStrategy Where tasks run FARGATE, EC2, or a capacity-provider mix (e.g. Fargate Spot weighting)
deploymentConfiguration Rolling-deploy bounds + circuit breaker minimumHealthyPercent, maximumPercent, deploymentCircuitBreaker (see below)
deploymentController Which deployment engine ECS (rolling), CODE_DEPLOY (blue/green), EXTERNAL
loadBalancers Target group(s) + container/port to register Wires the service to an ALB/NLB
healthCheckGracePeriodSeconds Ignore LB health checks for new tasks for N seconds after start Stops slow-booting apps being killed before they are ready
placementStrategy / placementConstraints How (EC2) tasks spread/bin-pack spread across AZ, binpack, random
serviceConnectConfiguration / serviceRegistries Service Connect or Cloud Map discovery See discovery section
enableExecuteCommand Allow aws ecs execute-command (ECS Exec) into a running task The container-shell debugging path; needs SSM + task-role perms
propagateTags / enableECSManagedTags Tag propagation from definition/service to tasks Cost allocation

Service scheduler strategies: REPLICA (the default — maintain N copies, spread across AZs) or DAEMON (run exactly one task per active container instance — EC2 only — for node agents like log shippers or monitoring).

Launch types: Fargate vs EC2

The single biggest architectural choice for an ECS workload is where the tasks run.

Dimension Fargate (serverless) EC2 (self-managed hosts)
Who owns the host AWS — no instances to see, patch, or scale You — an Auto Scaling group of container instances you patch and right-size
Pricing model Per-task vCPU + memory per second (1-minute minimum) Per EC2 instance running, regardless of how full it is
Operational overhead Minimal — pick CPU/memory and go You manage AMIs, the ECS agent, scaling, bin-packing
Scaling speed Fast; no host warm-up Must also scale the instance fleet (mitigate with capacity-provider managed scaling / warm pools)
Density / cost at scale Can be pricier for steady, packable, high-utilisation fleets Cheaper when you keep instances well utilised (good bin-packing)
GPU / special hardware Limited Full access — GPU instances, specific families, larger sizes
Spot FARGATE_SPOT EC2 Spot in the ASG / capacity provider
Network mode awsvpc only awsvpc, bridge, host, none
Daemon tasks Not applicable DAEMON scheduling supported

Rule of thumb: start on Fargate — it removes an entire layer of undifferentiated operational work (patching, scaling, bin-packing) and is the right default for most services. Move workloads to EC2 when you have a clear reason: steady, high-utilisation fleets where careful bin-packing beats per-task pricing; GPU or specialised instance needs; daemon workloads; or per-host customisation Fargate does not allow. Many teams run both in one cluster via capacity-provider strategies (e.g. baseline on Fargate, burst on FARGATE_SPOT).

Deployment types

When you deploy a new task-definition revision to a service, the deployment controller decides how traffic shifts from old tasks to new.

Type Controller How it works Rollback When to use
Rolling update ECS (default) ECS gradually replaces old tasks with new ones in place, bounded by minimumHealthyPercent / maximumPercent Automatic via the circuit breaker (or manual redeploy) The default; simplest; in-place, no extra infrastructure
Blue/green CODE_DEPLOY CodeDeploy stands up a parallel (“green”) task set, shifts ALB traffic to it (all-at-once, canary, or linear), then tears down “blue” Instant — shift traffic back to blue Zero-downtime releases with pre-traffic validation and instant rollback
External EXTERNAL You manage task sets and traffic shifting yourself via the API (or a third-party tool) You implement it Custom deployment tooling / advanced control

Rolling updates and the deployment circuit breaker

A rolling deployment is governed by two percentages of the desired count:

With min=100, max=200 ECS does a classic surge: spin up replacements, wait for them to pass health checks, then drain and stop the old ones — no capacity dip. Tighter bounds (e.g. min=50, max=100) trade capacity headroom for fewer concurrent tasks.

The deployment circuit breaker (deploymentCircuitBreaker: { enable: true, rollback: true }) is the safety net: if too many new tasks fail to start or stay healthy, ECS marks the deployment failed and (with rollback: true) automatically rolls back to the last known-good revision — instead of retrying a broken image forever. Always enable it for production rolling deployments. For richer release strategies (canary/linear traffic shifting with validation hooks), use blue/green via CodeDeploy.

ALB integration

Most ECS web services sit behind an Application Load Balancer. You attach the service to a target group, name the container and port to register, and ECS keeps the target group’s membership in sync as tasks come and go.

ECS also supports the Network Load Balancer (TCP/UDP, ultra-low latency, static IP) for non-HTTP workloads. Path/host-based routing, TLS termination, and listener rules all live on the ALB exactly as they do for EC2 targets.

Service auto scaling

A service scales its desired count automatically via Application Auto Scaling (the same engine behind DynamoDB and Aurora autoscaling), using three policy types:

Policy How it works Best for
Target tracking Keep a metric at a target (e.g. average CPU at 60%, or ALBRequestCountPerTarget at 1000) The default — set a goal, AWS adds/removes tasks to hold it
Step scaling Add/remove a step of tasks when an alarm breaches by a range Fine-grained control over scale increments
Scheduled scaling Change min/max/desired on a schedule (cron) Predictable daily/weekly patterns (scale up before business hours)

Target tracking on CPU utilisation or ALB request count per target covers most web services. You set a minimum and maximum capacity (the bounds auto scaling stays within) and the scaling policy adjusts desired count between them. The companion production lesson covers tuning cooldowns, combining policies, and scaling the EC2 capacity layer underneath.

Service discovery and ECS Service Connect

When services need to call each other (not the public internet), they need a stable name to resolve. ECS offers two mechanisms.

Mechanism What it is How callers reach a service Adds
Service discovery (Cloud Map) ECS registers each task in AWS Cloud Map, which creates DNS records (e.g. web.internal) DNS resolution to task IPs Simple name → IP; DNS-based, so client-side caching and no built-in retries/metrics
ECS Service Connect A managed proxy sidecar (Envoy) injected into tasks; services talk via logical names with the proxy handling routing A logical endpoint name; the proxy load-balances per request Per-request load balancing, retries, connection draining, and built-in traffic metrics — no DNS-caching pitfalls

Service discovery (Cloud Map) is the lightweight, DNS-based option. Service Connect is the newer, richer option: it gives you client-side load balancing, automatic retries, health-aware routing, and telemetry without an ALB between every pair of services, and it sidesteps the stale-DNS problems that plague raw DNS discovery. For internal service-to-service traffic at any scale, Service Connect is usually the better choice; reserve the ALB for north-south (ingress) traffic. The dedicated comparison — when to reach for Service Connect, Cloud Map, or an internal load balancer — is in ECS Service Connect deep dive: service discovery, traffic resilience, and migrating off ALBs.

ECS vs EKS

Both run containers on AWS; the difference is the control plane and ecosystem.

Dimension Amazon ECS Amazon EKS (managed Kubernetes)
Orchestrator AWS-proprietary, fully managed; no control plane to run Kubernetes — the open standard; AWS manages the control plane
Learning curve Low — a handful of concepts (task def, task, service, cluster) High — pods, deployments, services, ingress, RBAC, CRDs, the whole K8s surface
Control-plane cost None (you pay only for compute) A per-cluster hourly charge plus compute
Portability AWS-only Portable — same manifests run on any Kubernetes (other clouds, on-prem)
Ecosystem Tight AWS integration (IAM, ALB, CloudWatch) out of the box Vast CNCF ecosystem (Helm, Operators, Istio, Argo, Karpenter…)
Compute Fargate or EC2 Fargate, EC2 managed node groups, or Karpenter
Best when You want the simplest path to production containers on AWS and are happy being AWS-native You need Kubernetes specifically — portability, an existing K8s investment/skills, or the CNCF ecosystem

The short answer: choose ECS when you want to ship containers on AWS with the least operational and conceptual overhead and have no specific need for Kubernetes; choose EKS when Kubernetes itself is a requirement — for multi-cloud portability, an existing Kubernetes platform/skill set, or the rich CNCF tooling. ECS is “the AWS way”; EKS is “Kubernetes, managed by AWS”. Neither is universally better — they optimise for different priorities.

Amazon ECS & ECR fundamentals

The diagram traces the full path: you build an image and push it to an ECR repository; a task definition references that image; a service in a cluster launches the desired number of tasks (on Fargate or EC2); an ALB routes inbound traffic to the tasks; service auto scaling adjusts the count; and Service Connect / Cloud Map lets services find each other.

Hands-on lab

You will build a tiny container image, push it to ECR, run it as an ECS service on Fargate, verify it, and clean everything up. This stays within the AWS Free Tier for the brief time it runs, but Fargate tasks bill per second while running, so do the cleanup promptly. Region used: ap-south-1 (Mumbai) — substitute your own.

Prerequisites: Docker running locally, the AWS CLI configured, and a default VPC with public subnets (every account has one). Set helper variables:

export AWS_REGION=ap-south-1
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REPO=demo-web
export CLUSTER=demo-cluster

1. Create the ECR repository (immutable, scan on push)

aws ecr create-repository \
  --repository-name "$REPO" \
  --image-tag-mutability IMMUTABLE \
  --image-scanning-configuration scanOnPush=true \
  --region "$AWS_REGION"

2. Build and push a minimal image

Create a one-file site and a Dockerfile in an empty directory:

mkdir demo-web && cd demo-web
echo '<h1>Hello from ECS on Fargate</h1>' > index.html
printf 'FROM public.ecr.aws/nginx/nginx:stable\nCOPY index.html /usr/share/nginx/html/index.html\n' > Dockerfile

aws ecr get-login-password --region "$AWS_REGION" \
  | docker login --username AWS --password-stdin "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com"

docker build -t "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0" .
docker push "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0"

Expected: the push reports each layer Pushed and a final digest. Confirm and view scan results:

aws ecr describe-images --repository-name "$REPO" --region "$AWS_REGION" \
  --query 'imageDetails[].imageTags'

3. Create the cluster

aws ecs create-cluster --cluster-name "$CLUSTER" --region "$AWS_REGION"

4. Ensure the execution role exists

Most accounts already have ecsTaskExecutionRole. If not, create it with the trust policy for ECS tasks and attach the managed policy:

cat > trust.json <<'EOF'
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow",
  "Principal": { "Service": "ecs-tasks.amazonaws.com" }, "Action": "sts:AssumeRole" }] }
EOF
aws iam create-role --role-name ecsTaskExecutionRole \
  --assume-role-policy-document file://trust.json 2>/dev/null || true
aws iam attach-role-policy --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

5. Register the task definition

Write taskdef.json (substitute your account ID), then register it:

cat > taskdef.json <<EOF
{
  "family": "demo-web",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::$ACCOUNT:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0",
      "essential": true,
      "portMappings": [{ "containerPort": 80, "protocol": "tcp" }],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/demo-web",
          "awslogs-region": "$AWS_REGION",
          "awslogs-stream-prefix": "web",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}
EOF
aws ecs register-task-definition --cli-input-json file://taskdef.json --region "$AWS_REGION"

6. Run it as a service on Fargate

Grab a default-VPC subnet and a security group, then create a one-task service with a public IP (for the lab; production tasks live in private subnets behind an ALB):

SUBNET=$(aws ec2 describe-subnets --region "$AWS_REGION" \
  --filters "Name=default-for-az,Values=true" --query 'Subnets[0].SubnetId' --output text)
SG=$(aws ec2 describe-security-groups --region "$AWS_REGION" \
  --filters "Name=group-name,Values=default" --query 'SecurityGroups[0].GroupId' --output text)

aws ecs create-service \
  --cluster "$CLUSTER" \
  --service-name demo-web-svc \
  --task-definition demo-web \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[$SG],assignPublicIp=ENABLED}" \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},minimumHealthyPercent=100,maximumPercent=200" \
  --region "$AWS_REGION"

The default security group allows all traffic within itself but not from the internet. To actually browse the page, add an inbound rule for TCP 80 from your IP to $SG. For a quick functional check, the task reaching RUNNING is enough.

7. Validate

aws ecs describe-services --cluster "$CLUSTER" --services demo-web-svc \
  --region "$AWS_REGION" --query 'services[0].{desired:desiredCount,running:runningCount,status:status}'

Expected once stable: running equals desired (1) and status is ACTIVE. List the task and read its public IP:

TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name demo-web-svc \
  --region "$AWS_REGION" --query 'taskArns[0]' --output text)
ENI=$(aws ecs describe-tasks --cluster "$CLUSTER" --tasks "$TASK" --region "$AWS_REGION" \
  --query "tasks[0].attachments[0].details[?name=='networkInterfaceId'].value" --output text)
aws ec2 describe-network-interfaces --network-interface-ids "$ENI" --region "$AWS_REGION" \
  --query 'NetworkInterfaces[0].Association.PublicIp' --output text

If you opened port 80, curl http://<that-ip>/ returns the Hello from ECS on Fargate page.

Cleanup

Delete in order — service, cluster, repository, logs — so nothing keeps billing:

aws ecs update-service --cluster "$CLUSTER" --service demo-web-svc \
  --desired-count 0 --region "$AWS_REGION"
aws ecs delete-service --cluster "$CLUSTER" --service demo-web-svc \
  --force --region "$AWS_REGION"
aws ecs delete-cluster --cluster "$CLUSTER" --region "$AWS_REGION"
aws ecr delete-repository --repository-name "$REPO" --force --region "$AWS_REGION"
aws logs delete-log-group --log-group-name /ecs/demo-web --region "$AWS_REGION" 2>/dev/null || true

Cost note: a 256-CPU/512-MB Fargate task costs a few US cents per hour while running; run through the lab and clean up the same session and the cost is negligible. ECR storage is billed per GB-month (the lab image is a few MB) and the first 500 MB-month of private storage is free; basic scanning and CloudWatch Logs ingestion for this tiny workload are effectively free. The cluster itself costs nothing — you pay only for running tasks and stored data.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Task stuck PENDING/STOPPED with CannotPullContainerError Execution role lacks ECR permissions, or the task in a private subnet has no route to ECR Attach AmazonECSTaskExecutionRolePolicy; give the subnet a NAT gateway or ECR + S3 VPC endpoints
Task stops immediately with ResourceInitializationError: unable to retrieve secret Execution role can’t read the Secrets Manager/SSM value (or no network path to the service) Grant secretsmanager:GetSecretValue/ssm:GetParameters to the execution role; add VPC endpoints if private
App gets AccessDenied calling S3/DynamoDB at runtime Wrong role — permission is on the execution role, not the task role Put the app’s API permissions on the task role (taskRoleArn)
Invalid CPU/memory combination on register Fargate only accepts specific CPU/memory pairs Match the Fargate CPU/memory matrix (e.g. cpu 256 → memory 512/1024/2048)
Deploy never finishes; tasks cycle PROVISIONINGSTOPPED New tasks fail health checks (bad image, wrong port, slow boot) Enable the deployment circuit breaker (auto-rollback); fix the health-check path/port; set healthCheckGracePeriodSeconds
Healthy tasks but 502/503 from the ALB Target-group container/port mismatch, or SG blocks the ALB→task path Register the correct container name + port; allow the ALB SG to reach the task SG on the container port
bridge-mode tasks fail to place — “ports already in use” Static host ports clash on the EC2 host Use dynamic host ports (hostPort: 0) with an ALB, or switch to awsvpc
Untagged images and storage cost creeping up Mutable tags overwrite, orphaning old images; no cleanup Use immutable tags and an ECR lifecycle policy to expire untagged/old images

Best practices

Security notes

The container security model on ECS rests on a few pillars. Image provenance: scan images (enhanced scanning) and use immutable tags so a deployed tag cannot be silently swapped — reference by digest for the strongest guarantee. Least-privilege identity: the task role should grant only the AWS APIs the app calls, and the execution role only pull/log/secrets — never reuse a broad role across services. Secrets: keep credentials in Secrets Manager / SSM and inject them via the secrets block so they never appear in the (readable) task definition or in environment listings. Network isolation: with awsvpc, give each service its own security group and run tasks in private subnets, reaching AWS services through VPC endpoints and the internet (if needed) through a NAT gateway. Runtime hardening: run as a non-root user, set readonlyRootFilesystem, drop Linux capabilities you do not need, and avoid privileged mode. Registry access: lock down repository policies; cross-account pulls should be explicit and scoped. Auditing: ECS, ECR, and the IAM role assumptions are all logged in CloudTrail; ECS Exec sessions are auditable via SSM. Finally, prefer Service Connect/Cloud Map internal traffic over exposing services publicly, and put a WAF + the ALB (or CloudFront) in front of anything internet-facing.

Interview & exam questions

  1. What is the difference between a task and a service in ECS? A task is one running instantiation of a task definition — one or more containers running together on a host — and once it stops it is gone. A service is a controller that maintains a desired count of tasks: it replaces failures, spreads tasks across AZs, registers them with a load balancer, drives auto scaling, and orchestrates deployments. Tasks are for one-off/batch work; services are for long-lived applications.

  2. Explain the task role versus the execution role. The execution role is used by the ECS agent/Fargate infrastructure to pull the image from ECR, write logs to CloudWatch, and fetch secrets referenced in the definition. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB, etc.). Execution role = get the task running; task role = what the running app can do.

  3. What are the ECS network modes, and which does Fargate require? awsvpc (each task gets its own ENI/IP and security group — required by Fargate), bridge (Docker bridge with port mapping, EC2 only), host (bind directly to host network, EC2 only), and none. awsvpc is the modern default everywhere.

  4. When would you choose Fargate over the EC2 launch type, and vice versa? Choose Fargate for minimal operations (no hosts to patch/scale), per-task pricing, and fast scaling — the right default for most services. Choose EC2 for steady high-utilisation fleets where bin-packing beats per-task pricing, for GPU/special instances, for daemon workloads, or for host-level customisation Fargate doesn’t allow.

  5. What does the deployment circuit breaker do? On a rolling deployment, if too many new tasks fail to start or stay healthy, the circuit breaker marks the deployment failed and (with rollback: true) automatically rolls the service back to the last healthy revision — preventing an endless loop of launching a broken image.

  6. How do minimumHealthyPercent and maximumPercent control a rolling deploy? minimumHealthyPercent is the floor of healthy tasks ECS must keep during the deploy; maximumPercent is the ceiling of total (old+new) tasks. min=100, max=200 lets ECS add new tasks before removing old ones (no capacity dip); lower values trade headroom for fewer concurrent tasks.

  7. How do blue/green deployments work on ECS, and what do they add over rolling? With the CodeDeploy controller, ECS stands up a parallel green task set, shifts ALB traffic to it (all-at-once, canary, or linear) with optional validation hooks, then retires the blue set. They add pre-traffic validation and instant rollback (shift traffic back) that in-place rolling updates lack.

  8. What is ECR tag immutability and why does it matter? With IMMUTABLE repositories, a tag cannot be overwritten once pushed. This prevents a tag like prod or latest from silently pointing at a different image — a supply-chain and rollback hazard — making deployments reproducible and auditable.

  9. What is an ECR lifecycle policy? A prioritised set of rules that automatically expire images by age or count (e.g. keep the last 10 prod images, delete untagged images older than 14 days), controlling storage cost and clutter. Expiry is permanent, so selections must avoid images that running services still reference.

  10. How does an ECS service integrate with an Application Load Balancer? The service is attached to a target group with a named container+port; ECS keeps the target group in sync as tasks start/stop. With awsvpc, the ip target type registers each task’s ENI IP directly. A health-check grace period protects slow-booting tasks, and deregistration delay drains connections on stop.

  11. What is the difference between service discovery (Cloud Map) and ECS Service Connect? Cloud Map is DNS-based discovery (names resolve to task IPs) — simple but subject to client DNS caching and with no built-in retries/metrics. Service Connect injects a managed proxy that gives per-request load balancing, retries, health-aware routing, and traffic metrics — better for internal service-to-service traffic.

  12. When would you choose ECS over EKS? Choose ECS for the simplest path to production containers on AWS with no Kubernetes overhead and tight AWS integration (and no control-plane cost). Choose EKS when you specifically need Kubernetes — portability/multi-cloud, existing K8s skills/investment, or the CNCF ecosystem (Helm, Operators, Karpenter).

  13. Basic vs enhanced image scanning in ECR? Basic uses ECR’s built-in scanner on OS packages, runs on push/on demand, and is free. Enhanced uses Amazon Inspector, covers OS and language packages, runs on push and continuously as new CVEs appear, and is billed per scan.

Quick check

  1. Which IAM role pulls the image from ECR and writes the task’s logs?
  2. Which network mode is mandatory on Fargate, and what does each task get under it?
  3. What two percentages bound a rolling deployment, and what does min=100, max=200 achieve?
  4. Name the three ECS deployment controllers.
  5. What does an ECR lifecycle policy do, and is the action reversible?

Answers

  1. The execution role (executionRoleArn) — it handles image pull, CloudWatch Logs, and secret retrieval. (The task role is what your app code uses at runtime.)
  2. awsvpc — each task gets its own ENI with a private IP in your subnet and its own security group.
  3. minimumHealthyPercent (floor of healthy tasks) and maximumPercent (ceiling of total tasks). min=100, max=200 lets ECS launch new tasks before stopping old ones, so capacity never dips during the deploy.
  4. ECS (rolling update), CODE_DEPLOY (blue/green), and EXTERNAL (you manage task sets/traffic).
  5. It automatically expires images by age or count to control storage and clutter; the expire action is permanent — there is no recycle bin.

Exercise

Take the lab service and harden it toward production:

  1. Move the service into private subnets behind an internet-facing ALB: create a target group with target type ip, attach it to the service (--load-balancers), and set a health-check grace period. Confirm the page is reachable through the ALB DNS name, not a task public IP.
  2. Add a task role granting read-only access to one S3 bucket, and prove from inside the container (via ECS Exec, aws ecs execute-command) that the app identity can list that bucket but nothing else.
  3. Inject a value from SSM Parameter Store (or Secrets Manager) using the secrets block instead of a plain-text environment variable; verify the execution role needed the read permission and that the value is not visible in the task definition.
  4. Configure target-tracking auto scaling on ALBRequestCountPerTarget (min 1, max 4); generate load and watch the desired count rise and fall.
  5. Add an ECR lifecycle policy (keep last 5 images; expire untagged after 7 days), push a few revisions, and confirm old/untagged images are reaped.
  6. Switch the service to blue/green via CodeDeploy and run a canary deploy (10% for 5 minutes, then 100%), then trigger a rollback.

Certification mapping

Exam Objective area this supports
DVA-C02 (Developer – Associate) Development & deployment with AWS services — packaging apps as containers, ECR push/pull and image management, authoring task definitions (roles, secrets, logging), running services, and rolling/blue-green deployments with rollback.
SAA-C03 (Solutions Architect – Associate) Design resilient, cost-optimised architectures — choosing Fargate vs EC2, ECS vs EKS, awsvpc networking and ALB integration, service auto scaling, and service discovery/Service Connect for decoupled microservices.
SOA-C02 (SysOps Administrator – Associate) Deployment, monitoring & troubleshooting — operating ECS services, Container Insights/CloudWatch logging, deployment health and circuit breaker, and diagnosing task-launch and load-balancer issues.

Glossary

Next steps

Continue the course with the Amazon CloudFront deep dive — the CDN that commonly sits in front of an ECS-backed application for global caching and TLS. Then go deeper on running containers in production:

AWSECSECRFargateContainersDevOps
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading