A container bundles your application and everything it needs to run into one immutable artefact, and on a laptop docker run makes that feel trivial. Production is where the real questions start: where does the image live, who is allowed to pull it, how many copies run, what happens when one dies at 3am, how does a new version roll out without dropping requests, and how do other services find it. Amazon ECS (Elastic Container Service) and Amazon ECR (Elastic Container Registry) are AWS’s answers to exactly those questions, and together they are the fastest way to take a Dockerfile to a resilient, auto-scaling, load-balanced service without ever touching Kubernetes.
ECR is the registry — a managed, private (or public) place to store and version your container images, with vulnerability scanning and automated cleanup built in. ECS is the orchestrator — it takes a task definition (a JSON blueprint of your containers) and runs the requested number of copies as tasks, keeps them healthy as a service, registers them behind a load balancer, scales them on demand, and replaces them safely on each deploy. ECS runs those tasks either on Fargate (serverless — AWS owns the host) or on EC2 instances you provide, and choosing between them is one of the decisions this lesson makes easy.
This is the exhaustive version. We will walk ECR in full (registry types, push/pull, scanning, lifecycle policies, tag immutability), then every building block of ECS — the cluster, then the task definition field by field, then the difference between a task and a service, then launch types (Fargate vs EC2) with a decision table, then deployment types (rolling, blue/green, external) and the deployment circuit breaker, then service auto scaling, ALB integration, and service discovery / Service Connect. We finish with the question every interviewer asks — ECS vs EKS — and a hands-on lab you can run on the Free Tier. By the end you can ship a production container service on AWS and answer the certification questions about it cold.
Learning objectives
By the end of this lesson you will be able to:
- Create and manage an ECR repository — push and pull images, choose between mutable and immutable tags, enable scan-on-push (basic and enhanced), and write lifecycle policies to expire old images automatically.
- Explain the ECS object model — cluster, task definition, container definition, task, and service — and how they relate.
- Author a task definition with confidence: CPU/memory at task and container level, network mode (
awsvpcvsbridgevshostvsnone), the task role vs the execution role, volumes, secrets, logging, health checks, and the Fargate CPU/memory matrix. - Distinguish a task (one running copy) from a service (a controller that maintains a desired count) and configure desired count, minimum/maximum healthy percent, and the deployment circuit breaker.
- Choose between the Fargate and EC2 launch types and justify the trade-off.
- Pick the right deployment type — rolling update, blue/green (via CodeDeploy), or external — and explain how each shifts traffic and rolls back.
- Wire a service to an Application Load Balancer and to service discovery / ECS Service Connect, and configure service auto scaling.
- Articulate when to use ECS versus EKS.
Prerequisites & where this fits
You need an AWS account, the AWS CLI configured (aws configure), Docker installed locally to build and push an image, and a working grasp of IAM (ECS uses two distinct IAM roles) and VPC basics (the awsvpc network mode gives each task its own elastic network interface in your subnets, governed by a security group). Familiarity with the Application Load Balancer helps, since that is how most ECS services receive traffic. This is a Containers lesson in the AWS Zero-to-Hero course; it builds on the EC2, VPC, and ELB deep dives and is the foundation for the production companion, Production Amazon ECS on Fargate. After this, the course moves on to the Amazon CloudFront deep dive (aws-cloudfront-deep-dive-distributions-origins-caching-oac) — the CDN that often sits in front of an ECS-backed application.
Core concepts: the ECS object model
Before any settings, fix the mental model. ECS has a small, clean object hierarchy, and almost every confusion in interviews comes from blurring two of these terms. Learn them precisely.
- Image — an immutable, layered package of your application and its dependencies, built from a Dockerfile and stored in a registry (ECR). Identified by a tag (
myapp:1.4.2) or, immutably, by a digest (myapp@sha256:…). - Registry / repository — the registry (ECR) is the service that stores images; a repository is a named collection of related image versions within it (one repository per application image, typically).
- Container definition — the spec for one container inside a task: its image, port mappings, environment variables, secrets, resource limits, log configuration, and health check. A task definition holds one or more of these.
- Task definition — the blueprint (a versioned JSON document, organised into a family with numbered revisions) describing one or more containers that should run together as a unit, plus task-level settings (CPU/memory, network mode, the two IAM roles, volumes, launch-type compatibility). It is a template; it does not run anything by itself.
- Task — a single running instantiation of a task definition: one or more containers running together on one host, scheduled and tracked by ECS. The unit of scheduling. When it stops, it is gone — ECS does not “restart” a task in place; it launches a fresh one.
- Service — a controller that runs and maintains a specified number of tasks (the desired count) from a task definition, replaces unhealthy ones, registers them with a load balancer, integrates with auto scaling, and orchestrates deployments. A service is to a task what an Auto Scaling group is to an EC2 instance.
- Cluster — a logical grouping (and capacity boundary) into which tasks and services are placed. With Fargate a cluster needs no servers at all; with the EC2 launch type the cluster is backed by container instances (EC2 hosts running the ECS agent).
- Launch type / capacity provider — where tasks run: Fargate (serverless) or EC2 (your instances). Capacity providers add auto-managed capacity and Spot strategies on top.
The single most important distinction to internalise now is task vs service. A task is one running copy that, once it exits, stays exited. A service is the long-running supervisor that says “I want N healthy copies at all times” and makes that true — replacing failures, balancing across Availability Zones, and rolling out new versions. You run one-off jobs as standalone tasks (or via scheduled tasks); you run long-lived applications (web APIs, workers) as services.
Part 1 — Amazon ECR (the registry)
ECR registry types: private vs public
ECR stores your container images so ECS (and EKS, Lambda, or anything that speaks the Docker/OCI protocol) can pull them. There are two registry types.
| Registry | What it is | Who can pull | Auth to pull | Typical use |
|---|---|---|---|---|
| Private registry | One per account per Region; holds private repositories | Only principals you grant via IAM / repository policy | Required (token via aws ecr get-login-password) |
Your application images — the default and the common case |
| Public registry (ECR Public / Amazon ECR Public Gallery) | A globally reachable registry at public.ecr.aws |
Anyone, anonymously (auth only needed to push or for higher pull rate limits) | Not required to pull | Distributing images to the world (base images, open-source tools) |
For everything in this lesson we use the private registry — your application image is not something the public should pull. Each account gets one private registry per Region, addressed as <account-id>.dkr.ecr.<region>.amazonaws.com, containing as many repositories as you create.
Creating a repository: every setting
When you create a private repository (aws ecr create-repository, or console ECR → Create repository), these are the settings that matter.
| Setting | What it does | Choices | Default | When to change | Gotcha |
|---|---|---|---|---|---|
| Repository name | The repo’s name (can include a namespace path, e.g. team-a/checkout) |
Free text, lowercase | — | Use a clear team/app convention |
Cannot be renamed after creation — you would create a new repo and re-push |
| Tag immutability | Whether a tag, once pushed, can be overwritten | MUTABLE or IMMUTABLE |
MUTABLE |
Set IMMUTABLE for production |
With MUTABLE, re-pushing latest (or any tag) silently moves the tag to a new image — a supply-chain and rollback hazard |
| Scan on push | Run a vulnerability scan automatically when an image is pushed | On / Off (basic), or enhanced scanning at the registry level | Off (basic) | Turn on for any image you ship | Basic scan runs once on push; enhanced (via Amazon Inspector) continuously rescans as new CVEs are published |
| Encryption | Encrypt images at rest | AES256 (Amazon S3-managed) or AWS KMS (AWS-managed or your CMK) |
AES256 |
Use a CMK when you need key control/audit or cross-account key policies | Encryption type is fixed at creation; switching means a new repository |
ECR is just a store; access is controlled by IAM identity policies (what a principal may do) plus an optional repository policy (a resource policy on the repo — the mechanism for cross-account pulls and for granting AWS services access). The execution role your ECS task uses must have ECR read permissions, which we cover below.
Pushing and pulling images
ECR speaks the standard Docker registry protocol, so the workflow is docker login → docker push/docker pull, with the login token obtained from the API. The canonical push flow:
ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-south-1
REPO=demo-web
# 1. Authenticate Docker to your private registry (token valid 12 hours)
aws ecr get-login-password --region "$REGION" \
| docker login --username AWS --password-stdin \
"$ACCOUNT.dkr.ecr.$REGION.amazonaws.com"
# 2. Build and tag with the full ECR URI
docker build -t "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO:1.0.0" .
# 3. Push
docker push "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO:1.0.0"
Key facts: the login token from get-login-password is valid for 12 hours; the registry endpoint is per-account-per-Region; and you should tag images with a meaningful, unique version (a Git SHA or semantic version) rather than relying on latest. To pull (from ECS, CI, or a laptop) you authenticate the same way and docker pull the URI — but ECS does the pull for you using its execution role, so you rarely pull by hand in production.
Referencing images immutably. A tag is a movable label; a digest (@sha256:…) is the content hash and never changes. For reproducible, tamper-evident deploys, reference images by digest in your task definition (or enable tag immutability so a tag behaves like a digest).
Image scanning: basic vs enhanced
ECR can scan images for known operating-system and language-package vulnerabilities (CVEs).
| Scan type | Engine | When it runs | Coverage | Cost |
|---|---|---|---|---|
| Basic scanning | ECR’s built-in scanner (CVE feeds) | On push (if enabled) or on demand | OS packages | Free |
| Enhanced scanning | Amazon Inspector | On push and continuously as new CVEs appear | OS and programming-language packages (e.g. npm, pip, Maven) | Charged per image/scan via Inspector |
Basic scanning is a sensible free baseline; enhanced scanning is what you want for production because it keeps re-evaluating images already in the registry as the threat landscape changes — an image that was clean last month may carry a critical CVE today. Findings are surfaced in the console, via the API, and (for enhanced) in Amazon Inspector and EventBridge, so you can alert or block on severity.
Lifecycle policies: automated cleanup
Without housekeeping, repositories accumulate hundreds of old images and quietly run up storage cost. A lifecycle policy is a set of rules that expire images automatically based on age or count.
A policy is JSON with prioritised rules; each rule selects images by tag status (tagged with given prefixes, untagged, or any) and a count- or age-based condition. Example — keep the 10 newest prod-tagged images and delete untagged images older than 14 days:
{
"rules": [
{
"rulePriority": 1,
"description": "Keep last 10 prod images",
"selection": {
"tagStatus": "tagged",
"tagPrefixList": ["prod"],
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": { "type": "expire" }
},
{
"rulePriority": 2,
"description": "Expire untagged after 14 days",
"selection": {
"tagStatus": "untagged",
"countType": "sinceImagePushed",
"countUnit": "days",
"countNumber": 14
},
"action": { "type": "expire" }
}
]
}
Rules evaluate in priority order (lower number first), and expire is currently the only action. Untagged images pile up every time you overwrite a mutable tag, so a “delete untagged after N days” rule is almost always worth having. Note expiry is permanent — there is no recycle bin — so scope your selections carefully and never let a rule match an image a running service still references.
Part 2 — Amazon ECS (the orchestrator)
The cluster
A cluster is a logical grouping of your tasks and services and the capacity boundary they share. What “capacity” means depends on the launch model:
- With Fargate, the cluster needs no servers — you simply place Fargate tasks/services into it and AWS provisions the compute invisibly. A Fargate-only cluster costs nothing until tasks run.
- With the EC2 launch type, the cluster is backed by container instances — EC2 hosts running the ECS container agent that register themselves into the cluster and report available CPU/memory/ports. You manage that fleet (typically via an Auto Scaling group).
- Capacity providers sit on top and let ECS manage capacity: the
FARGATEandFARGATE_SPOTproviders for serverless (mixing on-demand and Spot by a strategy you define), or an Auto Scaling group capacity provider for EC2 with managed scaling (ECS scales the ASG to fit pending tasks) and managed termination protection.
A cluster also carries cluster-wide settings such as Container Insights (enhanced CloudWatch metrics/logs) and a default capacity-provider strategy. You can run many services and hundreds of tasks in one cluster; clusters are free — you pay only for the compute (Fargate vCPU/GB-seconds, or the EC2 instances).
The task definition: every field
The task definition is the heart of ECS — the JSON blueprint ECS uses to launch tasks. It is organised as a family (a name) with auto-incrementing revisions (my-app:7); registering a change creates a new revision, and you deploy by pointing a service at it. The fields below are grouped as you encounter them.
Task-level settings
| Field | What it is | Choices / range | Notes & gotchas |
|---|---|---|---|
| family | The task definition name; revisions increment under it | Free text | You deploy a family; ECS tracks family:revision |
| requiresCompatibilities | Which launch types this definition supports | FARGATE, EC2, EXTERNAL |
Determines which fields are valid (Fargate forbids some, e.g. host network mode) |
| networkMode | How task networking works | awsvpc, bridge, host, none |
Fargate requires awsvpc; see the network-mode section below |
| cpu / memory (task level) | The CPU/memory envelope for the whole task | See Fargate matrix below | On Fargate these are required and must be a valid pair; on EC2 they are optional caps |
| taskRoleArn | The task role — IAM role your application code assumes to call AWS APIs | An IAM role ARN | This is what your container uses to reach S3, DynamoDB, etc. — least-privilege here |
| executionRoleArn | The execution role — IAM role the ECS agent uses to pull the image and write logs | An IAM role ARN | Needs ECR pull + CloudWatch Logs + (if used) Secrets Manager/SSM read |
| runtimePlatform | OS and CPU architecture | LINUX/WINDOWS; X86_64/ARM64 |
Use ARM64 (Graviton) on Fargate for ~20% lower cost when your image supports it |
| volumes | Task-level volume definitions containers can mount | bind mounts, Docker volumes, EFS, FSx (EC2), Fargate ephemeral | The storage layer; see volumes below |
| placementConstraints | Rules restricting where (EC2) tasks land | e.g. memberOf an attribute expression |
EC2 launch type only; ignored on Fargate |
| ephemeralStorage | Size of Fargate scratch storage | 20–200 GiB (Fargate) | Default 20 GiB free; raise for large temp data |
| pidMode / ipcMode | Share PID/IPC namespaces across containers | task/host/none |
Advanced; host not allowed on Fargate |
| runtimePlatform / tags / proxyConfiguration | Metadata and App Mesh/Service-Connect proxy wiring | — | proxyConfiguration is used by App Mesh; Service Connect manages its own proxy |
The Fargate CPU/memory matrix
Fargate does not accept arbitrary CPU/memory — only specific combinations, and the valid memory range is constrained by the CPU value. Memorise the shape (the exact upper bounds have grown over time; these are the widely supported tiers):
cpu (vCPU) |
Valid memory range |
|---|---|
| 256 (.25 vCPU) | 512, 1024, 2048 MiB |
| 512 (.5 vCPU) | 1024–4096 MiB (1 GiB steps) |
| 1024 (1 vCPU) | 2048–8192 MiB (1 GiB steps) |
| 2048 (2 vCPU) | 4096–16384 MiB (1 GiB steps) |
| 4096 (4 vCPU) | 8192–30720 MiB (1 GiB steps) |
| 8192 (8 vCPU) | 16384–61440 MiB (4 GiB steps) |
| 16384 (16 vCPU) | 32768–122880 MiB (8 GiB steps) |
The whole task shares this budget. If you run a sidecar (a log router or proxy), it draws from the same pool — size the task for the sum, then optionally cap each container with container-level cpu/memory.
Container-level settings (the container definition)
Each entry in containerDefinitions configures one container. The important fields:
| Field | What it is | Notes & gotchas |
|---|---|---|
| name | Container name (unique within the task) | Used by dependsOn, links, and load-balancer target wiring |
| image | The image URI to run | Use the full ECR URI; reference by digest for immutability |
| cpu (container) | Soft/relative CPU share for this container | Optional sub-allocation of the task cpu |
| memory (hard limit) | Container is killed if it exceeds this | Set at least one of memory/memoryReservation; OOM-kill is a common silent failure |
| memoryReservation (soft limit) | Reserved amount; container can burst above it if the host has room (EC2) | Lets you pack more on EC2 hosts |
| essential | If true, the whole task stops when this container exits |
Mark your main app essential; sidecars usually false |
| portMappings | Container ports to expose (and host ports / names) | With awsvpc, hostPort = containerPort; name them for Service Connect |
| environment | Plain-text env vars | Never put secrets here — visible in the definition |
| secrets | Inject values from Secrets Manager / SSM Parameter Store as env vars | The secure way to pass credentials; needs execution-role read access |
| environmentFiles | Bulk env vars from a file in S3 | Handy for many variables |
| logConfiguration | Where stdout/stderr go | awslogs (CloudWatch), awsfirelens (FireLens → anywhere), splunk, etc. |
| healthCheck | A command run inside the container to report health | Distinct from the ALB health check; controls container/task health |
| dependsOn | Ordering: start/stop this container relative to others by condition | START, COMPLETE, SUCCESS, HEALTHY — e.g. wait for a migration container |
| command / entryPoint / workingDirectory | Override the image’s CMD/ENTRYPOINT/workdir | — |
| ulimits / linuxParameters | nofile limits, capabilities, initProcessEnabled, shared memory |
initProcessEnabled: true reaps zombie processes — useful for many apps |
| mountPoints / volumesFrom | Mount task volumes into this container | Pairs with task-level volumes |
| readonlyRootFilesystem | Make the container’s root FS read-only | A strong security default; write only to mounted volumes |
| user | Run as a non-root UID/GID | Avoid running as root |
| stopTimeout | Grace period after SIGTERM before SIGKILL | Give your app time to drain |
A minimal Fargate task definition for a web container, registered with aws ecs register-task-definition --cli-input-json file://taskdef.json:
{
"family": "demo-web",
"requiresCompatibilities": ["FARGATE"],
"networkMode": "awsvpc",
"cpu": "256",
"memory": "512",
"runtimePlatform": { "cpuArchitecture": "ARM64", "operatingSystemFamily": "LINUX" },
"executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::111122223333:role/demo-web-task-role",
"containerDefinitions": [
{
"name": "web",
"image": "111122223333.dkr.ecr.ap-south-1.amazonaws.com/demo-web:1.0.0",
"essential": true,
"portMappings": [{ "name": "http", "containerPort": 80, "protocol": "tcp" }],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/demo-web",
"awslogs-region": "ap-south-1",
"awslogs-stream-prefix": "web",
"awslogs-create-group": "true"
}
},
"healthCheck": {
"command": ["CMD-SHELL", "curl -f http://localhost/ || exit 1"],
"interval": 30, "timeout": 5, "retries": 3, "startPeriod": 10
}
}
]
}
Network modes explained
The networkMode decides how a task’s containers get networking — and it is a favourite interview topic.
| Mode | How it works | Who can use it | When to pick it | Trade-off / gotcha |
|---|---|---|---|---|
awsvpc |
Each task gets its own elastic network interface (ENI) with a private IP in your subnet and its own security group | Fargate (required) and EC2 | Almost always — first-class VPC networking, per-task security groups, works with ALB IP targets | On EC2, each task consumes an ENI; instance types cap ENIs per host (mitigated by ENI trunking) |
bridge |
Docker’s default virtual bridge on the host; containers share the host’s network via NAT and port mappings | EC2 only | Legacy / dense packing where per-task ENIs are not needed | Needs dynamic host ports + ALB to avoid port clashes; no per-task security group |
host |
Containers bind directly to the host’s network interface and ports | EC2 only | Maximum network performance; ports must be unique per host | No port remapping — only one task per host can use a given port; no per-task SG |
none |
No external networking | EC2 only | Batch/compute that needs no network | Container cannot reach the network |
For essentially all new work, use awsvpc: it gives every task a real VPC IP and its own security group, integrates cleanly with the ALB (IP target type), and is the only mode Fargate supports. Read the production companion lesson for ENI/IP-address planning under awsvpc, which is where large fleets actually hit limits: Production Amazon ECS on Fargate: task networking, auto scaling, and safe rolling deployments.
Task role vs execution role — the classic confusion
ECS uses two IAM roles and they are constantly mixed up. The distinction is who uses the role and for what.
| Role | Used by | Used for | Typical permissions |
|---|---|---|---|
Execution role (executionRoleArn) |
The ECS agent / Fargate infrastructure (before/around your container) | Pulling the image from ECR, writing logs to CloudWatch, and fetching secrets referenced in the task definition | AmazonECSTaskExecutionRolePolicy (ECR read + Logs) plus secretsmanager:GetSecretValue / ssm:GetParameters if you inject secrets |
Task role (taskRoleArn) |
Your application code inside the container | Calling AWS APIs your app needs at runtime (read S3, write DynamoDB, publish to SNS…) | Exactly the least-privilege set your app requires — nothing more |
The mnemonic: the execution role gets the task running (pull, log, secrets); the task role is what the running app can do. A task that fails to start with a “CannotPullContainerError” or “unable to retrieve secret” almost always has an execution-role problem; an AccessDenied from inside your code at runtime is a task-role problem.
Volumes and storage
Containers are ephemeral; for files that must persist or be shared you attach a volume defined at the task level and mounted into containers via mountPoints.
| Volume type | What it is | Lifetime | Use it for |
|---|---|---|---|
| Bind mount | A path on the host (EC2) or task-scoped scratch (Fargate) | Task lifetime | Sharing files between containers in the same task (e.g. a sidecar reading the app’s logs) |
| Docker volume | A Docker-managed volume (EC2 only) | Task or instance | Local persistence on a container instance |
| Amazon EFS | A shared, elastic NFS file system mounted into the task | Independent of the task — durable & shared | Shared state across tasks/AZs; persistent data on Fargate |
| Amazon FSx for Windows | Windows shared file storage | Independent | Windows workloads (EC2) |
| Fargate ephemeral storage | The task’s scratch disk (20–200 GiB) | Task lifetime | Temp files, caches — not durable |
For anything that must survive a task replacement or be shared between tasks, use EFS — it is the durable, multi-AZ option and works on Fargate. Fargate’s own disk is wiped when the task stops.
Secrets, logging, and health checks
- Secrets — reference Secrets Manager or SSM Parameter Store entries in the
secretsblock; ECS fetches them at task start using the execution role and injects them as environment variables, so they never appear in the task definition. This is the correct way to pass database passwords and API keys. - Logging — the
awslogsdriver streams stdout/stderr to CloudWatch Logs (set the group, Region, and stream prefix;awslogs-create-group: trueauto-creates the group). For richer routing (filtering, multiple destinations, third-party SIEMs) use theawsfirelensdriver, which runs a Fluent Bit/Fluentd sidecar. - Container health check — a command ECS runs inside the container to decide if it is healthy; this is separate from the load balancer’s health check. The ALB decides whether to send traffic; the container health check influences whether ECS considers the task healthy and may replace it. In a service behind an ALB, the ALB/target-group health check is usually the source of truth for traffic.
Task vs service (revisited, with settings)
You can run a task definition two ways:
- Run a standalone task (
aws ecs run-task) — launches one (or--countN) task that runs to completion or until stopped, with no supervision. Perfect for batch jobs, migrations, and ad-hoc work. Scheduled tasks (via EventBridge Scheduler) run a task on a cron/rate schedule — a serverless cron-for-containers. - Create a service (
aws ecs create-service) — launches and maintains a desired count of tasks, replaces failures, spreads tasks across AZs, registers them with a load balancer, drives auto scaling, and orchestrates deployments. This is how you run anything long-lived.
Key service settings:
| Setting | What it does | Notes |
|---|---|---|
| desiredCount | How many tasks the service keeps running | The number auto scaling adjusts |
| launchType / capacityProviderStrategy | Where tasks run | FARGATE, EC2, or a capacity-provider mix (e.g. Fargate Spot weighting) |
| deploymentConfiguration | Rolling-deploy bounds + circuit breaker | minimumHealthyPercent, maximumPercent, deploymentCircuitBreaker (see below) |
| deploymentController | Which deployment engine | ECS (rolling), CODE_DEPLOY (blue/green), EXTERNAL |
| loadBalancers | Target group(s) + container/port to register | Wires the service to an ALB/NLB |
| healthCheckGracePeriodSeconds | Ignore LB health checks for new tasks for N seconds after start | Stops slow-booting apps being killed before they are ready |
| placementStrategy / placementConstraints | How (EC2) tasks spread/bin-pack | spread across AZ, binpack, random |
| serviceConnectConfiguration / serviceRegistries | Service Connect or Cloud Map discovery | See discovery section |
| enableExecuteCommand | Allow aws ecs execute-command (ECS Exec) into a running task |
The container-shell debugging path; needs SSM + task-role perms |
| propagateTags / enableECSManagedTags | Tag propagation from definition/service to tasks | Cost allocation |
Service scheduler strategies: REPLICA (the default — maintain N copies, spread across AZs) or DAEMON (run exactly one task per active container instance — EC2 only — for node agents like log shippers or monitoring).
Launch types: Fargate vs EC2
The single biggest architectural choice for an ECS workload is where the tasks run.
| Dimension | Fargate (serverless) | EC2 (self-managed hosts) |
|---|---|---|
| Who owns the host | AWS — no instances to see, patch, or scale | You — an Auto Scaling group of container instances you patch and right-size |
| Pricing model | Per-task vCPU + memory per second (1-minute minimum) | Per EC2 instance running, regardless of how full it is |
| Operational overhead | Minimal — pick CPU/memory and go | You manage AMIs, the ECS agent, scaling, bin-packing |
| Scaling speed | Fast; no host warm-up | Must also scale the instance fleet (mitigate with capacity-provider managed scaling / warm pools) |
| Density / cost at scale | Can be pricier for steady, packable, high-utilisation fleets | Cheaper when you keep instances well utilised (good bin-packing) |
| GPU / special hardware | Limited | Full access — GPU instances, specific families, larger sizes |
| Spot | FARGATE_SPOT |
EC2 Spot in the ASG / capacity provider |
| Network mode | awsvpc only |
awsvpc, bridge, host, none |
| Daemon tasks | Not applicable | DAEMON scheduling supported |
Rule of thumb: start on Fargate — it removes an entire layer of undifferentiated operational work (patching, scaling, bin-packing) and is the right default for most services. Move workloads to EC2 when you have a clear reason: steady, high-utilisation fleets where careful bin-packing beats per-task pricing; GPU or specialised instance needs; daemon workloads; or per-host customisation Fargate does not allow. Many teams run both in one cluster via capacity-provider strategies (e.g. baseline on Fargate, burst on FARGATE_SPOT).
Deployment types
When you deploy a new task-definition revision to a service, the deployment controller decides how traffic shifts from old tasks to new.
| Type | Controller | How it works | Rollback | When to use |
|---|---|---|---|---|
| Rolling update | ECS (default) |
ECS gradually replaces old tasks with new ones in place, bounded by minimumHealthyPercent / maximumPercent |
Automatic via the circuit breaker (or manual redeploy) | The default; simplest; in-place, no extra infrastructure |
| Blue/green | CODE_DEPLOY |
CodeDeploy stands up a parallel (“green”) task set, shifts ALB traffic to it (all-at-once, canary, or linear), then tears down “blue” | Instant — shift traffic back to blue | Zero-downtime releases with pre-traffic validation and instant rollback |
| External | EXTERNAL |
You manage task sets and traffic shifting yourself via the API (or a third-party tool) | You implement it | Custom deployment tooling / advanced control |
Rolling updates and the deployment circuit breaker
A rolling deployment is governed by two percentages of the desired count:
minimumHealthyPercent— the floor of healthy tasks ECS must keep running during the deploy (e.g. 100 means never drop below the desired count — ECS adds new tasks before removing old ones).maximumPercent— the ceiling of total tasks (old + new) during the deploy (e.g. 200 means ECS may temporarily run double the desired count).
With min=100, max=200 ECS does a classic surge: spin up replacements, wait for them to pass health checks, then drain and stop the old ones — no capacity dip. Tighter bounds (e.g. min=50, max=100) trade capacity headroom for fewer concurrent tasks.
The deployment circuit breaker (deploymentCircuitBreaker: { enable: true, rollback: true }) is the safety net: if too many new tasks fail to start or stay healthy, ECS marks the deployment failed and (with rollback: true) automatically rolls back to the last known-good revision — instead of retrying a broken image forever. Always enable it for production rolling deployments. For richer release strategies (canary/linear traffic shifting with validation hooks), use blue/green via CodeDeploy.
ALB integration
Most ECS web services sit behind an Application Load Balancer. You attach the service to a target group, name the container and port to register, and ECS keeps the target group’s membership in sync as tasks come and go.
- Target type
ip— withawsvpc, ECS registers each task’s ENI IP directly in the target group. This is the modern, recommended pattern (and the only option on Fargate). - Target type
instance— for EC2 withbridge/hostmode, the ALB targets the instance + dynamic host port. ECS uses dynamic port mapping so many tasks of the same image can share a host without port clashes. - Health checks — the target group health check (path, codes, thresholds) decides whether the ALB sends traffic to a task; pair it with
healthCheckGracePeriodSecondsso slow-booting tasks are not killed before they are ready. - Connection draining / deregistration delay — when a task is stopping, the ALB stops sending new connections and lets in-flight ones finish (default 300s). Set your container
stopTimeoutto cover graceful shutdown.
ECS also supports the Network Load Balancer (TCP/UDP, ultra-low latency, static IP) for non-HTTP workloads. Path/host-based routing, TLS termination, and listener rules all live on the ALB exactly as they do for EC2 targets.
Service auto scaling
A service scales its desired count automatically via Application Auto Scaling (the same engine behind DynamoDB and Aurora autoscaling), using three policy types:
| Policy | How it works | Best for |
|---|---|---|
| Target tracking | Keep a metric at a target (e.g. average CPU at 60%, or ALBRequestCountPerTarget at 1000) |
The default — set a goal, AWS adds/removes tasks to hold it |
| Step scaling | Add/remove a step of tasks when an alarm breaches by a range | Fine-grained control over scale increments |
| Scheduled scaling | Change min/max/desired on a schedule (cron) | Predictable daily/weekly patterns (scale up before business hours) |
Target tracking on CPU utilisation or ALB request count per target covers most web services. You set a minimum and maximum capacity (the bounds auto scaling stays within) and the scaling policy adjusts desired count between them. The companion production lesson covers tuning cooldowns, combining policies, and scaling the EC2 capacity layer underneath.
Service discovery and ECS Service Connect
When services need to call each other (not the public internet), they need a stable name to resolve. ECS offers two mechanisms.
| Mechanism | What it is | How callers reach a service | Adds |
|---|---|---|---|
| Service discovery (Cloud Map) | ECS registers each task in AWS Cloud Map, which creates DNS records (e.g. web.internal) |
DNS resolution to task IPs | Simple name → IP; DNS-based, so client-side caching and no built-in retries/metrics |
| ECS Service Connect | A managed proxy sidecar (Envoy) injected into tasks; services talk via logical names with the proxy handling routing | A logical endpoint name; the proxy load-balances per request | Per-request load balancing, retries, connection draining, and built-in traffic metrics — no DNS-caching pitfalls |
Service discovery (Cloud Map) is the lightweight, DNS-based option. Service Connect is the newer, richer option: it gives you client-side load balancing, automatic retries, health-aware routing, and telemetry without an ALB between every pair of services, and it sidesteps the stale-DNS problems that plague raw DNS discovery. For internal service-to-service traffic at any scale, Service Connect is usually the better choice; reserve the ALB for north-south (ingress) traffic. The dedicated comparison — when to reach for Service Connect, Cloud Map, or an internal load balancer — is in ECS Service Connect deep dive: service discovery, traffic resilience, and migrating off ALBs.
ECS vs EKS
Both run containers on AWS; the difference is the control plane and ecosystem.
| Dimension | Amazon ECS | Amazon EKS (managed Kubernetes) |
|---|---|---|
| Orchestrator | AWS-proprietary, fully managed; no control plane to run | Kubernetes — the open standard; AWS manages the control plane |
| Learning curve | Low — a handful of concepts (task def, task, service, cluster) | High — pods, deployments, services, ingress, RBAC, CRDs, the whole K8s surface |
| Control-plane cost | None (you pay only for compute) | A per-cluster hourly charge plus compute |
| Portability | AWS-only | Portable — same manifests run on any Kubernetes (other clouds, on-prem) |
| Ecosystem | Tight AWS integration (IAM, ALB, CloudWatch) out of the box | Vast CNCF ecosystem (Helm, Operators, Istio, Argo, Karpenter…) |
| Compute | Fargate or EC2 | Fargate, EC2 managed node groups, or Karpenter |
| Best when | You want the simplest path to production containers on AWS and are happy being AWS-native | You need Kubernetes specifically — portability, an existing K8s investment/skills, or the CNCF ecosystem |
The short answer: choose ECS when you want to ship containers on AWS with the least operational and conceptual overhead and have no specific need for Kubernetes; choose EKS when Kubernetes itself is a requirement — for multi-cloud portability, an existing Kubernetes platform/skill set, or the rich CNCF tooling. ECS is “the AWS way”; EKS is “Kubernetes, managed by AWS”. Neither is universally better — they optimise for different priorities.
The diagram traces the full path: you build an image and push it to an ECR repository; a task definition references that image; a service in a cluster launches the desired number of tasks (on Fargate or EC2); an ALB routes inbound traffic to the tasks; service auto scaling adjusts the count; and Service Connect / Cloud Map lets services find each other.
Hands-on lab
You will build a tiny container image, push it to ECR, run it as an ECS service on Fargate, verify it, and clean everything up. This stays within the AWS Free Tier for the brief time it runs, but Fargate tasks bill per second while running, so do the cleanup promptly. Region used: ap-south-1 (Mumbai) — substitute your own.
Prerequisites: Docker running locally, the AWS CLI configured, and a default VPC with public subnets (every account has one). Set helper variables:
export AWS_REGION=ap-south-1
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REPO=demo-web
export CLUSTER=demo-cluster
1. Create the ECR repository (immutable, scan on push)
aws ecr create-repository \
--repository-name "$REPO" \
--image-tag-mutability IMMUTABLE \
--image-scanning-configuration scanOnPush=true \
--region "$AWS_REGION"
2. Build and push a minimal image
Create a one-file site and a Dockerfile in an empty directory:
mkdir demo-web && cd demo-web
echo '<h1>Hello from ECS on Fargate</h1>' > index.html
printf 'FROM public.ecr.aws/nginx/nginx:stable\nCOPY index.html /usr/share/nginx/html/index.html\n' > Dockerfile
aws ecr get-login-password --region "$AWS_REGION" \
| docker login --username AWS --password-stdin "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com"
docker build -t "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0" .
docker push "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0"
Expected: the push reports each layer Pushed and a final digest. Confirm and view scan results:
aws ecr describe-images --repository-name "$REPO" --region "$AWS_REGION" \
--query 'imageDetails[].imageTags'
3. Create the cluster
aws ecs create-cluster --cluster-name "$CLUSTER" --region "$AWS_REGION"
4. Ensure the execution role exists
Most accounts already have ecsTaskExecutionRole. If not, create it with the trust policy for ECS tasks and attach the managed policy:
cat > trust.json <<'EOF'
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow",
"Principal": { "Service": "ecs-tasks.amazonaws.com" }, "Action": "sts:AssumeRole" }] }
EOF
aws iam create-role --role-name ecsTaskExecutionRole \
--assume-role-policy-document file://trust.json 2>/dev/null || true
aws iam attach-role-policy --role-name ecsTaskExecutionRole \
--policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy
5. Register the task definition
Write taskdef.json (substitute your account ID), then register it:
cat > taskdef.json <<EOF
{
"family": "demo-web",
"requiresCompatibilities": ["FARGATE"],
"networkMode": "awsvpc",
"cpu": "256",
"memory": "512",
"executionRoleArn": "arn:aws:iam::$ACCOUNT:role/ecsTaskExecutionRole",
"containerDefinitions": [
{
"name": "web",
"image": "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0",
"essential": true,
"portMappings": [{ "containerPort": 80, "protocol": "tcp" }],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/demo-web",
"awslogs-region": "$AWS_REGION",
"awslogs-stream-prefix": "web",
"awslogs-create-group": "true"
}
}
}
]
}
EOF
aws ecs register-task-definition --cli-input-json file://taskdef.json --region "$AWS_REGION"
6. Run it as a service on Fargate
Grab a default-VPC subnet and a security group, then create a one-task service with a public IP (for the lab; production tasks live in private subnets behind an ALB):
SUBNET=$(aws ec2 describe-subnets --region "$AWS_REGION" \
--filters "Name=default-for-az,Values=true" --query 'Subnets[0].SubnetId' --output text)
SG=$(aws ec2 describe-security-groups --region "$AWS_REGION" \
--filters "Name=group-name,Values=default" --query 'SecurityGroups[0].GroupId' --output text)
aws ecs create-service \
--cluster "$CLUSTER" \
--service-name demo-web-svc \
--task-definition demo-web \
--desired-count 1 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[$SG],assignPublicIp=ENABLED}" \
--deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},minimumHealthyPercent=100,maximumPercent=200" \
--region "$AWS_REGION"
The default security group allows all traffic within itself but not from the internet. To actually browse the page, add an inbound rule for TCP 80 from your IP to
$SG. For a quick functional check, the task reachingRUNNINGis enough.
7. Validate
aws ecs describe-services --cluster "$CLUSTER" --services demo-web-svc \
--region "$AWS_REGION" --query 'services[0].{desired:desiredCount,running:runningCount,status:status}'
Expected once stable: running equals desired (1) and status is ACTIVE. List the task and read its public IP:
TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name demo-web-svc \
--region "$AWS_REGION" --query 'taskArns[0]' --output text)
ENI=$(aws ecs describe-tasks --cluster "$CLUSTER" --tasks "$TASK" --region "$AWS_REGION" \
--query "tasks[0].attachments[0].details[?name=='networkInterfaceId'].value" --output text)
aws ec2 describe-network-interfaces --network-interface-ids "$ENI" --region "$AWS_REGION" \
--query 'NetworkInterfaces[0].Association.PublicIp' --output text
If you opened port 80, curl http://<that-ip>/ returns the Hello from ECS on Fargate page.
Cleanup
Delete in order — service, cluster, repository, logs — so nothing keeps billing:
aws ecs update-service --cluster "$CLUSTER" --service demo-web-svc \
--desired-count 0 --region "$AWS_REGION"
aws ecs delete-service --cluster "$CLUSTER" --service demo-web-svc \
--force --region "$AWS_REGION"
aws ecs delete-cluster --cluster "$CLUSTER" --region "$AWS_REGION"
aws ecr delete-repository --repository-name "$REPO" --force --region "$AWS_REGION"
aws logs delete-log-group --log-group-name /ecs/demo-web --region "$AWS_REGION" 2>/dev/null || true
Cost note: a 256-CPU/512-MB Fargate task costs a few US cents per hour while running; run through the lab and clean up the same session and the cost is negligible. ECR storage is billed per GB-month (the lab image is a few MB) and the first 500 MB-month of private storage is free; basic scanning and CloudWatch Logs ingestion for this tiny workload are effectively free. The cluster itself costs nothing — you pay only for running tasks and stored data.
Common mistakes & troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
Task stuck PENDING/STOPPED with CannotPullContainerError |
Execution role lacks ECR permissions, or the task in a private subnet has no route to ECR | Attach AmazonECSTaskExecutionRolePolicy; give the subnet a NAT gateway or ECR + S3 VPC endpoints |
| Task stops immediately with ResourceInitializationError: unable to retrieve secret | Execution role can’t read the Secrets Manager/SSM value (or no network path to the service) | Grant secretsmanager:GetSecretValue/ssm:GetParameters to the execution role; add VPC endpoints if private |
App gets AccessDenied calling S3/DynamoDB at runtime |
Wrong role — permission is on the execution role, not the task role | Put the app’s API permissions on the task role (taskRoleArn) |
| Invalid CPU/memory combination on register | Fargate only accepts specific CPU/memory pairs | Match the Fargate CPU/memory matrix (e.g. cpu 256 → memory 512/1024/2048) |
Deploy never finishes; tasks cycle PROVISIONING→STOPPED |
New tasks fail health checks (bad image, wrong port, slow boot) | Enable the deployment circuit breaker (auto-rollback); fix the health-check path/port; set healthCheckGracePeriodSeconds |
| Healthy tasks but 502/503 from the ALB | Target-group container/port mismatch, or SG blocks the ALB→task path | Register the correct container name + port; allow the ALB SG to reach the task SG on the container port |
bridge-mode tasks fail to place — “ports already in use” |
Static host ports clash on the EC2 host | Use dynamic host ports (hostPort: 0) with an ALB, or switch to awsvpc |
| Untagged images and storage cost creeping up | Mutable tags overwrite, orphaning old images; no cleanup | Use immutable tags and an ECR lifecycle policy to expire untagged/old images |
Best practices
- Default to Fargate; move to EC2 only for a concrete reason (GPU, steady high-utilisation packing, daemon workloads, host customisation).
- Use
awsvpcnetworking so every task has its own IP and security group, and use ALBiptarget type. - Two roles, least privilege: execution role = pull/log/secrets only; task role = exactly the app’s runtime API needs.
- Immutable image tags + reference by digest, plus an ECR lifecycle policy to expire old/untagged images automatically.
- Enable enhanced scanning for production images so they are continuously re-evaluated against new CVEs.
- Always enable the deployment circuit breaker with rollback on rolling deployments; use blue/green for releases that need pre-traffic validation and instant rollback.
- Set
min=100, max=200for zero-downtime rolling deploys, and a sensible health-check grace period for slow-booting apps. - Inject secrets from Secrets Manager / SSM, never as plain-text environment variables.
- Ship logs to CloudWatch (or FireLens) and turn on Container Insights for cluster/service metrics.
- Run as non-root with a read-only root filesystem and write only to mounted volumes.
- Use Service Connect for internal service-to-service traffic; reserve ALBs for ingress.
Security notes
The container security model on ECS rests on a few pillars. Image provenance: scan images (enhanced scanning) and use immutable tags so a deployed tag cannot be silently swapped — reference by digest for the strongest guarantee. Least-privilege identity: the task role should grant only the AWS APIs the app calls, and the execution role only pull/log/secrets — never reuse a broad role across services. Secrets: keep credentials in Secrets Manager / SSM and inject them via the secrets block so they never appear in the (readable) task definition or in environment listings. Network isolation: with awsvpc, give each service its own security group and run tasks in private subnets, reaching AWS services through VPC endpoints and the internet (if needed) through a NAT gateway. Runtime hardening: run as a non-root user, set readonlyRootFilesystem, drop Linux capabilities you do not need, and avoid privileged mode. Registry access: lock down repository policies; cross-account pulls should be explicit and scoped. Auditing: ECS, ECR, and the IAM role assumptions are all logged in CloudTrail; ECS Exec sessions are auditable via SSM. Finally, prefer Service Connect/Cloud Map internal traffic over exposing services publicly, and put a WAF + the ALB (or CloudFront) in front of anything internet-facing.
Interview & exam questions
-
What is the difference between a task and a service in ECS? A task is one running instantiation of a task definition — one or more containers running together on a host — and once it stops it is gone. A service is a controller that maintains a desired count of tasks: it replaces failures, spreads tasks across AZs, registers them with a load balancer, drives auto scaling, and orchestrates deployments. Tasks are for one-off/batch work; services are for long-lived applications.
-
Explain the task role versus the execution role. The execution role is used by the ECS agent/Fargate infrastructure to pull the image from ECR, write logs to CloudWatch, and fetch secrets referenced in the definition. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB, etc.). Execution role = get the task running; task role = what the running app can do.
-
What are the ECS network modes, and which does Fargate require?
awsvpc(each task gets its own ENI/IP and security group — required by Fargate),bridge(Docker bridge with port mapping, EC2 only),host(bind directly to host network, EC2 only), andnone.awsvpcis the modern default everywhere. -
When would you choose Fargate over the EC2 launch type, and vice versa? Choose Fargate for minimal operations (no hosts to patch/scale), per-task pricing, and fast scaling — the right default for most services. Choose EC2 for steady high-utilisation fleets where bin-packing beats per-task pricing, for GPU/special instances, for daemon workloads, or for host-level customisation Fargate doesn’t allow.
-
What does the deployment circuit breaker do? On a rolling deployment, if too many new tasks fail to start or stay healthy, the circuit breaker marks the deployment failed and (with
rollback: true) automatically rolls the service back to the last healthy revision — preventing an endless loop of launching a broken image. -
How do
minimumHealthyPercentandmaximumPercentcontrol a rolling deploy?minimumHealthyPercentis the floor of healthy tasks ECS must keep during the deploy;maximumPercentis the ceiling of total (old+new) tasks.min=100, max=200lets ECS add new tasks before removing old ones (no capacity dip); lower values trade headroom for fewer concurrent tasks. -
How do blue/green deployments work on ECS, and what do they add over rolling? With the CodeDeploy controller, ECS stands up a parallel green task set, shifts ALB traffic to it (all-at-once, canary, or linear) with optional validation hooks, then retires the blue set. They add pre-traffic validation and instant rollback (shift traffic back) that in-place rolling updates lack.
-
What is ECR tag immutability and why does it matter? With
IMMUTABLErepositories, a tag cannot be overwritten once pushed. This prevents a tag likeprodorlatestfrom silently pointing at a different image — a supply-chain and rollback hazard — making deployments reproducible and auditable. -
What is an ECR lifecycle policy? A prioritised set of rules that automatically expire images by age or count (e.g. keep the last 10
prodimages, delete untagged images older than 14 days), controlling storage cost and clutter. Expiry is permanent, so selections must avoid images that running services still reference. -
How does an ECS service integrate with an Application Load Balancer? The service is attached to a target group with a named container+port; ECS keeps the target group in sync as tasks start/stop. With
awsvpc, theiptarget type registers each task’s ENI IP directly. A health-check grace period protects slow-booting tasks, and deregistration delay drains connections on stop. -
What is the difference between service discovery (Cloud Map) and ECS Service Connect? Cloud Map is DNS-based discovery (names resolve to task IPs) — simple but subject to client DNS caching and with no built-in retries/metrics. Service Connect injects a managed proxy that gives per-request load balancing, retries, health-aware routing, and traffic metrics — better for internal service-to-service traffic.
-
When would you choose ECS over EKS? Choose ECS for the simplest path to production containers on AWS with no Kubernetes overhead and tight AWS integration (and no control-plane cost). Choose EKS when you specifically need Kubernetes — portability/multi-cloud, existing K8s skills/investment, or the CNCF ecosystem (Helm, Operators, Karpenter).
-
Basic vs enhanced image scanning in ECR? Basic uses ECR’s built-in scanner on OS packages, runs on push/on demand, and is free. Enhanced uses Amazon Inspector, covers OS and language packages, runs on push and continuously as new CVEs appear, and is billed per scan.
Quick check
- Which IAM role pulls the image from ECR and writes the task’s logs?
- Which network mode is mandatory on Fargate, and what does each task get under it?
- What two percentages bound a rolling deployment, and what does
min=100, max=200achieve? - Name the three ECS deployment controllers.
- What does an ECR lifecycle policy do, and is the action reversible?
Answers
- The execution role (
executionRoleArn) — it handles image pull, CloudWatch Logs, and secret retrieval. (The task role is what your app code uses at runtime.) awsvpc— each task gets its own ENI with a private IP in your subnet and its own security group.minimumHealthyPercent(floor of healthy tasks) andmaximumPercent(ceiling of total tasks).min=100, max=200lets ECS launch new tasks before stopping old ones, so capacity never dips during the deploy.ECS(rolling update),CODE_DEPLOY(blue/green), andEXTERNAL(you manage task sets/traffic).- It automatically expires images by age or count to control storage and clutter; the
expireaction is permanent — there is no recycle bin.
Exercise
Take the lab service and harden it toward production:
- Move the service into private subnets behind an internet-facing ALB: create a target group with target type
ip, attach it to the service (--load-balancers), and set a health-check grace period. Confirm the page is reachable through the ALB DNS name, not a task public IP. - Add a task role granting read-only access to one S3 bucket, and prove from inside the container (via ECS Exec,
aws ecs execute-command) that the app identity can list that bucket but nothing else. - Inject a value from SSM Parameter Store (or Secrets Manager) using the
secretsblock instead of a plain-text environment variable; verify the execution role needed the read permission and that the value is not visible in the task definition. - Configure target-tracking auto scaling on
ALBRequestCountPerTarget(min 1, max 4); generate load and watch the desired count rise and fall. - Add an ECR lifecycle policy (keep last 5 images; expire untagged after 7 days), push a few revisions, and confirm old/untagged images are reaped.
- Switch the service to blue/green via CodeDeploy and run a canary deploy (10% for 5 minutes, then 100%), then trigger a rollback.
Certification mapping
| Exam | Objective area this supports |
|---|---|
| DVA-C02 (Developer – Associate) | Development & deployment with AWS services — packaging apps as containers, ECR push/pull and image management, authoring task definitions (roles, secrets, logging), running services, and rolling/blue-green deployments with rollback. |
| SAA-C03 (Solutions Architect – Associate) | Design resilient, cost-optimised architectures — choosing Fargate vs EC2, ECS vs EKS, awsvpc networking and ALB integration, service auto scaling, and service discovery/Service Connect for decoupled microservices. |
| SOA-C02 (SysOps Administrator – Associate) | Deployment, monitoring & troubleshooting — operating ECS services, Container Insights/CloudWatch logging, deployment health and circuit breaker, and diagnosing task-launch and load-balancer issues. |
Glossary
- ECR (Elastic Container Registry) — AWS’s managed registry for storing, versioning, and scanning container images.
- Repository — a named collection of image versions within ECR (typically one per application image).
- Tag immutability — a repository setting (
IMMUTABLE) preventing a tag from being overwritten once pushed. - Lifecycle policy — prioritised rules that automatically expire ECR images by age or count.
- Image scanning (basic/enhanced) — CVE scanning of images; basic (ECR, OS packages, on push) vs enhanced (Amazon Inspector, OS + language packages, continuous).
- ECS (Elastic Container Service) — AWS’s proprietary container orchestrator that runs tasks/services from task definitions.
- Cluster — a logical grouping and capacity boundary for ECS tasks and services.
- Task definition — the versioned JSON blueprint (family + revisions) describing containers and task-level settings.
- Container definition — the spec for one container inside a task (image, ports, env, secrets, logging, health check).
- Task — a single running instantiation of a task definition; the unit of scheduling.
- Service — a controller maintaining a desired count of tasks, with load-balancer, scaling, and deployment integration.
- Launch type — where tasks run: Fargate (serverless) or EC2 (your container instances).
- Capacity provider — managed capacity for a cluster (
FARGATE,FARGATE_SPOT, or an Auto Scaling group provider). - Network mode — task networking model:
awsvpc,bridge,host, ornone. - awsvpc mode — each task gets its own ENI/IP and security group; required by Fargate.
- Execution role — the IAM role the ECS agent uses to pull images, write logs, and fetch secrets.
- Task role — the IAM role your application code assumes at runtime to call AWS APIs.
- Desired count — the number of tasks a service keeps running.
- Deployment circuit breaker — auto-detects a failing deployment and (optionally) rolls it back to the last healthy revision.
- Rolling update / blue-green / external — the three ECS deployment controllers (
ECS,CODE_DEPLOY,EXTERNAL). - Service Connect — a managed proxy giving per-request load balancing, retries, and metrics for service-to-service traffic.
- Service discovery (Cloud Map) — DNS-based registration/resolution of tasks by name.
- Fargate CPU/memory matrix — the fixed set of valid CPU/memory combinations Fargate accepts.
Next steps
Continue the course with the Amazon CloudFront deep dive — the CDN that commonly sits in front of an ECS-backed application for global caching and TLS. Then go deeper on running containers in production:
- Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments —
awsvpcENI/IP planning, scaling-policy tuning, deployment circuit breakers, and graceful task lifecycle. - ECS Service Connect Deep Dive: Service Discovery, Traffic Resilience, and Migrating Off ALBs — when to use Service Connect, Cloud Map, or an internal load balancer for service-to-service traffic.