A container bundles your application and everything it needs to run into one immutable artefact, and on a laptop docker run makes that feel trivial. Production is where the real questions start: where does the image live, who is allowed to pull it, how many copies run, what happens when one dies at 3am, how does a new version roll out without dropping requests, and how do other services find it. Amazon ECS (Elastic Container Service) and Amazon ECR (Elastic Container Registry) are AWS’s answers to exactly those questions, and together they are the fastest way to take a Dockerfile to a resilient, auto-scaling, load-balanced service without ever touching Kubernetes.

ECR is the registry — a managed, private (or public) place to store and version your container images, with vulnerability scanning and automated cleanup built in. ECS is the orchestrator — it takes a task definition (a JSON blueprint of your containers) and runs the requested number of copies as tasks, keeps them healthy as a service, registers them behind a load balancer, scales them on demand, and replaces them safely on each deploy. ECS runs those tasks either on Fargate (serverless — AWS owns the host) or on EC2 instances you provide, and choosing between them is one of the decisions this lesson makes easy.

This is the exhaustive version. We will walk ECR in full (registry types, push/pull, scanning, lifecycle policies, tag immutability), then every building block of ECS — the cluster, then the task definition field by field, then the difference between a task and a service, then launch types (Fargate vs EC2) with a decision table, then deployment types (rolling, blue/green, external) and the deployment circuit breaker, then service auto scaling, ALB integration, and service discovery / Service Connect. We finish with the question every interviewer asks — ECS vs EKS — and a hands-on lab you can run on the Free Tier. By the end you can ship a production container service on AWS and answer the certification questions about it cold.

Learning objectives

By the end of this lesson you will be able to:

Create and manage an ECR repository — push and pull images, choose between mutable and immutable tags, enable scan-on-push (basic and enhanced), and write lifecycle policies to expire old images automatically.
Explain the ECS object model — cluster, task definition, container definition, task, and service — and how they relate.
Author a task definition with confidence: CPU/memory at task and container level, network mode (awsvpc vs bridge vs host vs none), the task role vs the execution role, volumes, secrets, logging, health checks, and the Fargate CPU/memory matrix.
Distinguish a task (one running copy) from a service (a controller that maintains a desired count) and configure desired count, minimum/maximum healthy percent, and the deployment circuit breaker.
Choose between the Fargate and EC2 launch types and justify the trade-off.
Pick the right deployment type — rolling update, blue/green (via CodeDeploy), or external — and explain how each shifts traffic and rolls back.
Wire a service to an Application Load Balancer and to service discovery / ECS Service Connect, and configure service auto scaling.
Articulate when to use ECS versus EKS.

Prerequisites & where this fits

You need an AWS account, the AWS CLI configured (aws configure), Docker installed locally to build and push an image, and a working grasp of IAM (ECS uses two distinct IAM roles) and VPC basics (the awsvpc network mode gives each task its own elastic network interface in your subnets, governed by a security group). Familiarity with the Application Load Balancer helps, since that is how most ECS services receive traffic. This is a Containers lesson in the AWS Zero-to-Hero course; it builds on the EC2, VPC, and ELB deep dives and is the foundation for the production companion, Production Amazon ECS on Fargate. After this, the course moves on to the Amazon CloudFront deep dive (aws-cloudfront-deep-dive-distributions-origins-caching-oac) — the CDN that often sits in front of an ECS-backed application.

Core concepts: the ECS object model

Before any settings, fix the mental model. ECS has a small, clean object hierarchy, and almost every confusion in interviews comes from blurring two of these terms. Learn them precisely.

Image — an immutable, layered package of your application and its dependencies, built from a Dockerfile and stored in a registry (ECR). Identified by a tag (myapp:1.4.2) or, immutably, by a digest (myapp@sha256:…).
Registry / repository — the registry (ECR) is the service that stores images; a repository is a named collection of related image versions within it (one repository per application image, typically).
Container definition — the spec for one container inside a task: its image, port mappings, environment variables, secrets, resource limits, log configuration, and health check. A task definition holds one or more of these.
Task definition — the blueprint (a versioned JSON document, organised into a family with numbered revisions) describing one or more containers that should run together as a unit, plus task-level settings (CPU/memory, network mode, the two IAM roles, volumes, launch-type compatibility). It is a template; it does not run anything by itself.
Task — a single running instantiation of a task definition: one or more containers running together on one host, scheduled and tracked by ECS. The unit of scheduling. When it stops, it is gone — ECS does not “restart” a task in place; it launches a fresh one.
Service — a controller that runs and maintains a specified number of tasks (the desired count) from a task definition, replaces unhealthy ones, registers them with a load balancer, integrates with auto scaling, and orchestrates deployments. A service is to a task what an Auto Scaling group is to an EC2 instance.
Cluster — a logical grouping (and capacity boundary) into which tasks and services are placed. With Fargate a cluster needs no servers at all; with the EC2 launch type the cluster is backed by container instances (EC2 hosts running the ECS agent).
Launch type / capacity provider — where tasks run: Fargate (serverless) or EC2 (your instances). Capacity providers add auto-managed capacity and Spot strategies on top.

The single most important distinction to internalise now is task vs service. A task is one running copy that, once it exits, stays exited. A service is the long-running supervisor that says “I want N healthy copies at all times” and makes that true — replacing failures, balancing across Availability Zones, and rolling out new versions. You run one-off jobs as standalone tasks (or via scheduled tasks); you run long-lived applications (web APIs, workers) as services.

Part 1 — Amazon ECR (the registry)

ECR registry types: private vs public

ECR stores your container images so ECS (and EKS, Lambda, or anything that speaks the Docker/OCI protocol) can pull them. There are two registry types.

Registry	What it is	Who can pull	Auth to pull	Typical use
Private registry	One per account per Region; holds private repositories	Only principals you grant via IAM / repository policy	Required (token via `aws ecr get-login-password`)	Your application images — the default and the common case
Public registry (ECR Public / Amazon ECR Public Gallery)	A globally reachable registry at `public.ecr.aws`	Anyone, anonymously (auth only needed to push or for higher pull rate limits)	Not required to pull	Distributing images to the world (base images, open-source tools)

For everything in this lesson we use the private registry — your application image is not something the public should pull. Each account gets one private registry per Region, addressed as <account-id>.dkr.ecr.<region>.amazonaws.com, containing as many repositories as you create.

Creating a repository: every setting

When you create a private repository (aws ecr create-repository, or console ECR → Create repository), these are the settings that matter.

Setting	What it does	Choices	Default	When to change	Gotcha
Repository name	The repo’s name (can include a namespace path, e.g. `team-a/checkout`)	Free text, lowercase	—	Use a clear `team/app` convention	Cannot be renamed after creation — you would create a new repo and re-push
Tag immutability	Whether a tag, once pushed, can be overwritten	`MUTABLE` or `IMMUTABLE`	`MUTABLE`	Set `IMMUTABLE` for production	With `MUTABLE`, re-pushing `latest` (or any tag) silently moves the tag to a new image — a supply-chain and rollback hazard
Scan on push	Run a vulnerability scan automatically when an image is pushed	On / Off (basic), or enhanced scanning at the registry level	Off (basic)	Turn on for any image you ship	Basic scan runs once on push; enhanced (via Amazon Inspector) continuously rescans as new CVEs are published
Encryption	Encrypt images at rest	`AES256` (Amazon S3-managed) or AWS KMS (AWS-managed or your CMK)	`AES256`	Use a CMK when you need key control/audit or cross-account key policies	Encryption type is fixed at creation; switching means a new repository

ECR is just a store; access is controlled by IAM identity policies (what a principal may do) plus an optional repository policy (a resource policy on the repo — the mechanism for cross-account pulls and for granting AWS services access). The execution role your ECS task uses must have ECR read permissions, which we cover below.

Pushing and pulling images

ECR speaks the standard Docker registry protocol, so the workflow is docker login → docker push/docker pull, with the login token obtained from the API. The canonical push flow:

ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
REGION=ap-south-1
REPO=demo-web

# 1. Authenticate Docker to your private registry (token valid 12 hours)
aws ecr get-login-password --region "$REGION" \
  | docker login --username AWS --password-stdin \
    "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com"

# 2. Build and tag with the full ECR URI
docker build -t "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO:1.0.0" .

# 3. Push
docker push "$ACCOUNT.dkr.ecr.$REGION.amazonaws.com/$REPO:1.0.0"

Key facts: the login token from get-login-password is valid for 12 hours; the registry endpoint is per-account-per-Region; and you should tag images with a meaningful, unique version (a Git SHA or semantic version) rather than relying on latest. To pull (from ECS, CI, or a laptop) you authenticate the same way and docker pull the URI — but ECS does the pull for you using its execution role, so you rarely pull by hand in production.

Referencing images immutably. A tag is a movable label; a digest (@sha256:…) is the content hash and never changes. For reproducible, tamper-evident deploys, reference images by digest in your task definition (or enable tag immutability so a tag behaves like a digest).

Image scanning: basic vs enhanced

ECR can scan images for known operating-system and language-package vulnerabilities (CVEs).

Scan type	Engine	When it runs	Coverage	Cost
Basic scanning	ECR’s built-in scanner (CVE feeds)	On push (if enabled) or on demand	OS packages	Free
Enhanced scanning	Amazon Inspector	On push and continuously as new CVEs appear	OS and programming-language packages (e.g. npm, pip, Maven)	Charged per image/scan via Inspector

Basic scanning is a sensible free baseline; enhanced scanning is what you want for production because it keeps re-evaluating images already in the registry as the threat landscape changes — an image that was clean last month may carry a critical CVE today. Findings are surfaced in the console, via the API, and (for enhanced) in Amazon Inspector and EventBridge, so you can alert or block on severity.

Lifecycle policies: automated cleanup

Without housekeeping, repositories accumulate hundreds of old images and quietly run up storage cost. A lifecycle policy is a set of rules that expire images automatically based on age or count.

A policy is JSON with prioritised rules; each rule selects images by tag status (tagged with given prefixes, untagged, or any) and a count- or age-based condition. Example — keep the 10 newest prod-tagged images and delete untagged images older than 14 days:

{
  "rules": [
    {
      "rulePriority": 1,
      "description": "Keep last 10 prod images",
      "selection": {
        "tagStatus": "tagged",
        "tagPrefixList": ["prod"],
        "countType": "imageCountMoreThan",
        "countNumber": 10
      },
      "action": { "type": "expire" }
    },
    {
      "rulePriority": 2,
      "description": "Expire untagged after 14 days",
      "selection": {
        "tagStatus": "untagged",
        "countType": "sinceImagePushed",
        "countUnit": "days",
        "countNumber": 14
      },
      "action": { "type": "expire" }
    }
  ]
}

Rules evaluate in priority order (lower number first), and expire is currently the only action. Untagged images pile up every time you overwrite a mutable tag, so a “delete untagged after N days” rule is almost always worth having. Note expiry is permanent — there is no recycle bin — so scope your selections carefully and never let a rule match an image a running service still references.

Part 2 — Amazon ECS (the orchestrator)

The cluster

A cluster is a logical grouping of your tasks and services and the capacity boundary they share. What “capacity” means depends on the launch model:

With Fargate, the cluster needs no servers — you simply place Fargate tasks/services into it and AWS provisions the compute invisibly. A Fargate-only cluster costs nothing until tasks run.
With the EC2 launch type, the cluster is backed by container instances — EC2 hosts running the ECS container agent that register themselves into the cluster and report available CPU/memory/ports. You manage that fleet (typically via an Auto Scaling group).
Capacity providers sit on top and let ECS manage capacity: the FARGATE and FARGATE_SPOT providers for serverless (mixing on-demand and Spot by a strategy you define), or an Auto Scaling group capacity provider for EC2 with managed scaling (ECS scales the ASG to fit pending tasks) and managed termination protection.

A cluster also carries cluster-wide settings such as Container Insights (enhanced CloudWatch metrics/logs) and a default capacity-provider strategy. You can run many services and hundreds of tasks in one cluster; clusters are free — you pay only for the compute (Fargate vCPU/GB-seconds, or the EC2 instances).

The task definition: every field

The task definition is the heart of ECS — the JSON blueprint ECS uses to launch tasks. It is organised as a family (a name) with auto-incrementing revisions (my-app:7); registering a change creates a new revision, and you deploy by pointing a service at it. The fields below are grouped as you encounter them.

Task-level settings

Field	What it is	Choices / range	Notes & gotchas
family	The task definition name; revisions increment under it	Free text	You deploy a family; ECS tracks `family:revision`
requiresCompatibilities	Which launch types this definition supports	`FARGATE`, `EC2`, `EXTERNAL`	Determines which fields are valid (Fargate forbids some, e.g. `host` network mode)
networkMode	How task networking works	`awsvpc`, `bridge`, `host`, `none`	Fargate requires `awsvpc`; see the network-mode section below
cpu / memory (task level)	The CPU/memory envelope for the whole task	See Fargate matrix below	On Fargate these are required and must be a valid pair; on EC2 they are optional caps
taskRoleArn	The task role — IAM role your application code assumes to call AWS APIs	An IAM role ARN	This is what your container uses to reach S3, DynamoDB, etc. — least-privilege here
executionRoleArn	The execution role — IAM role the ECS agent uses to pull the image and write logs	An IAM role ARN	Needs ECR pull + CloudWatch Logs + (if used) Secrets Manager/SSM read
runtimePlatform	OS and CPU architecture	`LINUX`/`WINDOWS`; `X86_64`/`ARM64`	Use `ARM64` (Graviton) on Fargate for ~20% lower cost when your image supports it
volumes	Task-level volume definitions containers can mount	bind mounts, Docker volumes, EFS, FSx (EC2), Fargate ephemeral	The storage layer; see volumes below
placementConstraints	Rules restricting where (EC2) tasks land	e.g. `memberOf` an attribute expression	EC2 launch type only; ignored on Fargate
ephemeralStorage	Size of Fargate scratch storage	20–200 GiB (Fargate)	Default 20 GiB free; raise for large temp data
pidMode / ipcMode	Share PID/IPC namespaces across containers	`task`/`host`/`none`	Advanced; `host` not allowed on Fargate
runtimePlatform / tags / proxyConfiguration	Metadata and App Mesh/Service-Connect proxy wiring	—	`proxyConfiguration` is used by App Mesh; Service Connect manages its own proxy

The Fargate CPU/memory matrix

Fargate does not accept arbitrary CPU/memory — only specific combinations, and the valid memory range is constrained by the CPU value. Memorise the shape (the exact upper bounds have grown over time; these are the widely supported tiers):

`cpu` (vCPU)	Valid `memory` range
256 (.25 vCPU)	512, 1024, 2048 MiB
512 (.5 vCPU)	1024–4096 MiB (1 GiB steps)
1024 (1 vCPU)	2048–8192 MiB (1 GiB steps)
2048 (2 vCPU)	4096–16384 MiB (1 GiB steps)
4096 (4 vCPU)	8192–30720 MiB (1 GiB steps)
8192 (8 vCPU)	16384–61440 MiB (4 GiB steps)
16384 (16 vCPU)	32768–122880 MiB (8 GiB steps)

The whole task shares this budget. If you run a sidecar (a log router or proxy), it draws from the same pool — size the task for the sum, then optionally cap each container with container-level cpu/memory.

Container-level settings (the container definition)

Each entry in containerDefinitions configures one container. The important fields:

Field	What it is	Notes & gotchas
name	Container name (unique within the task)	Used by `dependsOn`, `links`, and load-balancer target wiring
image	The image URI to run	Use the full ECR URI; reference by digest for immutability
cpu (container)	Soft/relative CPU share for this container	Optional sub-allocation of the task `cpu`
memory (hard limit)	Container is killed if it exceeds this	Set at least one of `memory`/`memoryReservation`; OOM-kill is a common silent failure
memoryReservation (soft limit)	Reserved amount; container can burst above it if the host has room (EC2)	Lets you pack more on EC2 hosts
essential	If `true`, the whole task stops when this container exits	Mark your main app essential; sidecars usually `false`
portMappings	Container ports to expose (and host ports / names)	With `awsvpc`, `hostPort` = `containerPort`; name them for Service Connect
environment	Plain-text env vars	Never put secrets here — visible in the definition
secrets	Inject values from Secrets Manager / SSM Parameter Store as env vars	The secure way to pass credentials; needs execution-role read access
environmentFiles	Bulk env vars from a file in S3	Handy for many variables
logConfiguration	Where stdout/stderr go	`awslogs` (CloudWatch), `awsfirelens` (FireLens → anywhere), `splunk`, etc.
healthCheck	A command run inside the container to report health	Distinct from the ALB health check; controls container/task health
dependsOn	Ordering: start/stop this container relative to others by condition	`START`, `COMPLETE`, `SUCCESS`, `HEALTHY` — e.g. wait for a migration container
command / entryPoint / workingDirectory	Override the image’s CMD/ENTRYPOINT/workdir	—
ulimits / linuxParameters	nofile limits, capabilities, `initProcessEnabled`, shared memory	`initProcessEnabled: true` reaps zombie processes — useful for many apps
mountPoints / volumesFrom	Mount task volumes into this container	Pairs with task-level `volumes`
readonlyRootFilesystem	Make the container’s root FS read-only	A strong security default; write only to mounted volumes
user	Run as a non-root UID/GID	Avoid running as root
stopTimeout	Grace period after SIGTERM before SIGKILL	Give your app time to drain

A minimal Fargate task definition for a web container, registered with aws ecs register-task-definition --cli-input-json file://taskdef.json:

{
  "family": "demo-web",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "512",
  "runtimePlatform": { "cpuArchitecture": "ARM64", "operatingSystemFamily": "LINUX" },
  "executionRoleArn": "arn:aws:iam::111122223333:role/ecsTaskExecutionRole",
  "taskRoleArn": "arn:aws:iam::111122223333:role/demo-web-task-role",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "111122223333.dkr.ecr.ap-south-1.amazonaws.com/demo-web:1.0.0",
      "essential": true,
      "portMappings": [{ "name": "http", "containerPort": 80, "protocol": "tcp" }],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/demo-web",
          "awslogs-region": "ap-south-1",
          "awslogs-stream-prefix": "web",
          "awslogs-create-group": "true"
        }
      },
      "healthCheck": {
        "command": ["CMD-SHELL", "curl -f http://localhost/ || exit 1"],
        "interval": 30, "timeout": 5, "retries": 3, "startPeriod": 10
      }
    }
  ]
}

Network modes explained

The networkMode decides how a task’s containers get networking — and it is a favourite interview topic.

Mode	How it works	Who can use it	When to pick it	Trade-off / gotcha
`awsvpc`	Each task gets its own elastic network interface (ENI) with a private IP in your subnet and its own security group	Fargate (required) and EC2	Almost always — first-class VPC networking, per-task security groups, works with ALB IP targets	On EC2, each task consumes an ENI; instance types cap ENIs per host (mitigated by ENI trunking)
`bridge`	Docker’s default virtual bridge on the host; containers share the host’s network via NAT and port mappings	EC2 only	Legacy / dense packing where per-task ENIs are not needed	Needs dynamic host ports + ALB to avoid port clashes; no per-task security group
`host`	Containers bind directly to the host’s network interface and ports	EC2 only	Maximum network performance; ports must be unique per host	No port remapping — only one task per host can use a given port; no per-task SG
`none`	No external networking	EC2 only	Batch/compute that needs no network	Container cannot reach the network

For essentially all new work, use awsvpc: it gives every task a real VPC IP and its own security group, integrates cleanly with the ALB (IP target type), and is the only mode Fargate supports. Read the production companion lesson for ENI/IP-address planning under awsvpc, which is where large fleets actually hit limits: Production Amazon ECS on Fargate: task networking, auto scaling, and safe rolling deployments.

Task role vs execution role — the classic confusion

ECS uses two IAM roles and they are constantly mixed up. The distinction is who uses the role and for what.

Role	Used by	Used for	Typical permissions
Execution role (`executionRoleArn`)	The ECS agent / Fargate infrastructure (before/around your container)	Pulling the image from ECR, writing logs to CloudWatch, and fetching secrets referenced in the task definition	`AmazonECSTaskExecutionRolePolicy` (ECR read + Logs) plus `secretsmanager:GetSecretValue` / `ssm:GetParameters` if you inject secrets
Task role (`taskRoleArn`)	Your application code inside the container	Calling AWS APIs your app needs at runtime (read S3, write DynamoDB, publish to SNS…)	Exactly the least-privilege set your app requires — nothing more

The mnemonic: the execution role gets the task running (pull, log, secrets); the task role is what the running app can do. A task that fails to start with a “CannotPullContainerError” or “unable to retrieve secret” almost always has an execution-role problem; an AccessDenied from inside your code at runtime is a task-role problem.

Volumes and storage

Containers are ephemeral; for files that must persist or be shared you attach a volume defined at the task level and mounted into containers via mountPoints.

Volume type	What it is	Lifetime	Use it for
Bind mount	A path on the host (EC2) or task-scoped scratch (Fargate)	Task lifetime	Sharing files between containers in the same task (e.g. a sidecar reading the app’s logs)
Docker volume	A Docker-managed volume (EC2 only)	Task or instance	Local persistence on a container instance
Amazon EFS	A shared, elastic NFS file system mounted into the task	Independent of the task — durable & shared	Shared state across tasks/AZs; persistent data on Fargate
Amazon FSx for Windows	Windows shared file storage	Independent	Windows workloads (EC2)
Fargate ephemeral storage	The task’s scratch disk (20–200 GiB)	Task lifetime	Temp files, caches — not durable

For anything that must survive a task replacement or be shared between tasks, use EFS — it is the durable, multi-AZ option and works on Fargate. Fargate’s own disk is wiped when the task stops.

Secrets, logging, and health checks

Secrets — reference Secrets Manager or SSM Parameter Store entries in the secrets block; ECS fetches them at task start using the execution role and injects them as environment variables, so they never appear in the task definition. This is the correct way to pass database passwords and API keys.
Logging — the awslogs driver streams stdout/stderr to CloudWatch Logs (set the group, Region, and stream prefix; awslogs-create-group: true auto-creates the group). For richer routing (filtering, multiple destinations, third-party SIEMs) use the awsfirelens driver, which runs a Fluent Bit/Fluentd sidecar.
Container health check — a command ECS runs inside the container to decide if it is healthy; this is separate from the load balancer’s health check. The ALB decides whether to send traffic; the container health check influences whether ECS considers the task healthy and may replace it. In a service behind an ALB, the ALB/target-group health check is usually the source of truth for traffic.

Task vs service (revisited, with settings)

You can run a task definition two ways:

Run a standalone task (aws ecs run-task) — launches one (or --count N) task that runs to completion or until stopped, with no supervision. Perfect for batch jobs, migrations, and ad-hoc work. Scheduled tasks (via EventBridge Scheduler) run a task on a cron/rate schedule — a serverless cron-for-containers.
Create a service (aws ecs create-service) — launches and maintains a desired count of tasks, replaces failures, spreads tasks across AZs, registers them with a load balancer, drives auto scaling, and orchestrates deployments. This is how you run anything long-lived.

Key service settings:

Setting	What it does	Notes
desiredCount	How many tasks the service keeps running	The number auto scaling adjusts
launchType / capacityProviderStrategy	Where tasks run	`FARGATE`, `EC2`, or a capacity-provider mix (e.g. Fargate Spot weighting)
deploymentConfiguration	Rolling-deploy bounds + circuit breaker	`minimumHealthyPercent`, `maximumPercent`, `deploymentCircuitBreaker` (see below)
deploymentController	Which deployment engine	`ECS` (rolling), `CODE_DEPLOY` (blue/green), `EXTERNAL`
loadBalancers	Target group(s) + container/port to register	Wires the service to an ALB/NLB
healthCheckGracePeriodSeconds	Ignore LB health checks for new tasks for N seconds after start	Stops slow-booting apps being killed before they are ready
placementStrategy / placementConstraints	How (EC2) tasks spread/bin-pack	`spread` across AZ, `binpack`, `random`
serviceConnectConfiguration / serviceRegistries	Service Connect or Cloud Map discovery	See discovery section
enableExecuteCommand	Allow `aws ecs execute-command` (ECS Exec) into a running task	The container-shell debugging path; needs SSM + task-role perms
propagateTags / enableECSManagedTags	Tag propagation from definition/service to tasks	Cost allocation

Service scheduler strategies: REPLICA (the default — maintain N copies, spread across AZs) or DAEMON (run exactly one task per active container instance — EC2 only — for node agents like log shippers or monitoring).

Launch types: Fargate vs EC2

The single biggest architectural choice for an ECS workload is where the tasks run.

Dimension	Fargate (serverless)	EC2 (self-managed hosts)
Who owns the host	AWS — no instances to see, patch, or scale	You — an Auto Scaling group of container instances you patch and right-size
Pricing model	Per-task vCPU + memory per second (1-minute minimum)	Per EC2 instance running, regardless of how full it is
Operational overhead	Minimal — pick CPU/memory and go	You manage AMIs, the ECS agent, scaling, bin-packing
Scaling speed	Fast; no host warm-up	Must also scale the instance fleet (mitigate with capacity-provider managed scaling / warm pools)
Density / cost at scale	Can be pricier for steady, packable, high-utilisation fleets	Cheaper when you keep instances well utilised (good bin-packing)
GPU / special hardware	Limited	Full access — GPU instances, specific families, larger sizes
Spot	`FARGATE_SPOT`	EC2 Spot in the ASG / capacity provider
Network mode	`awsvpc` only	`awsvpc`, `bridge`, `host`, `none`
Daemon tasks	Not applicable	`DAEMON` scheduling supported

Rule of thumb: start on Fargate — it removes an entire layer of undifferentiated operational work (patching, scaling, bin-packing) and is the right default for most services. Move workloads to EC2 when you have a clear reason: steady, high-utilisation fleets where careful bin-packing beats per-task pricing; GPU or specialised instance needs; daemon workloads; or per-host customisation Fargate does not allow. Many teams run both in one cluster via capacity-provider strategies (e.g. baseline on Fargate, burst on FARGATE_SPOT).

Deployment types

When you deploy a new task-definition revision to a service, the deployment controller decides how traffic shifts from old tasks to new.

Type	Controller	How it works	Rollback	When to use
Rolling update	`ECS` (default)	ECS gradually replaces old tasks with new ones in place, bounded by `minimumHealthyPercent` / `maximumPercent`	Automatic via the circuit breaker (or manual redeploy)	The default; simplest; in-place, no extra infrastructure
Blue/green	`CODE_DEPLOY`	CodeDeploy stands up a parallel (“green”) task set, shifts ALB traffic to it (all-at-once, canary, or linear), then tears down “blue”	Instant — shift traffic back to blue	Zero-downtime releases with pre-traffic validation and instant rollback
External	`EXTERNAL`	You manage task sets and traffic shifting yourself via the API (or a third-party tool)	You implement it	Custom deployment tooling / advanced control

Rolling updates and the deployment circuit breaker

A rolling deployment is governed by two percentages of the desired count:

minimumHealthyPercent — the floor of healthy tasks ECS must keep running during the deploy (e.g. 100 means never drop below the desired count — ECS adds new tasks before removing old ones).
maximumPercent — the ceiling of total tasks (old + new) during the deploy (e.g. 200 means ECS may temporarily run double the desired count).

With min=100, max=200 ECS does a classic surge: spin up replacements, wait for them to pass health checks, then drain and stop the old ones — no capacity dip. Tighter bounds (e.g. min=50, max=100) trade capacity headroom for fewer concurrent tasks.

The deployment circuit breaker (deploymentCircuitBreaker: { enable: true, rollback: true }) is the safety net: if too many new tasks fail to start or stay healthy, ECS marks the deployment failed and (with rollback: true) automatically rolls back to the last known-good revision — instead of retrying a broken image forever. Always enable it for production rolling deployments. For richer release strategies (canary/linear traffic shifting with validation hooks), use blue/green via CodeDeploy.

ALB integration

Most ECS web services sit behind an Application Load Balancer. You attach the service to a target group, name the container and port to register, and ECS keeps the target group’s membership in sync as tasks come and go.

Target type ip — with awsvpc, ECS registers each task’s ENI IP directly in the target group. This is the modern, recommended pattern (and the only option on Fargate).
Target type instance — for EC2 with bridge/host mode, the ALB targets the instance + dynamic host port. ECS uses dynamic port mapping so many tasks of the same image can share a host without port clashes.
Health checks — the target group health check (path, codes, thresholds) decides whether the ALB sends traffic to a task; pair it with healthCheckGracePeriodSeconds so slow-booting tasks are not killed before they are ready.
Connection draining / deregistration delay — when a task is stopping, the ALB stops sending new connections and lets in-flight ones finish (default 300s). Set your container stopTimeout to cover graceful shutdown.

ECS also supports the Network Load Balancer (TCP/UDP, ultra-low latency, static IP) for non-HTTP workloads. Path/host-based routing, TLS termination, and listener rules all live on the ALB exactly as they do for EC2 targets.

Service auto scaling

A service scales its desired count automatically via Application Auto Scaling (the same engine behind DynamoDB and Aurora autoscaling), using three policy types:

Policy	How it works	Best for
Target tracking	Keep a metric at a target (e.g. average CPU at 60%, or `ALBRequestCountPerTarget` at 1000)	The default — set a goal, AWS adds/removes tasks to hold it
Step scaling	Add/remove a step of tasks when an alarm breaches by a range	Fine-grained control over scale increments
Scheduled scaling	Change min/max/desired on a schedule (cron)	Predictable daily/weekly patterns (scale up before business hours)

Target tracking on CPU utilisation or ALB request count per target covers most web services. You set a minimum and maximum capacity (the bounds auto scaling stays within) and the scaling policy adjusts desired count between them. The companion production lesson covers tuning cooldowns, combining policies, and scaling the EC2 capacity layer underneath.

Service discovery and ECS Service Connect

When services need to call each other (not the public internet), they need a stable name to resolve. ECS offers two mechanisms.

Mechanism	What it is	How callers reach a service	Adds
Service discovery (Cloud Map)	ECS registers each task in AWS Cloud Map, which creates DNS records (e.g. `web.internal`)	DNS resolution to task IPs	Simple name → IP; DNS-based, so client-side caching and no built-in retries/metrics
ECS Service Connect	A managed proxy sidecar (Envoy) injected into tasks; services talk via logical names with the proxy handling routing	A logical endpoint name; the proxy load-balances per request	Per-request load balancing, retries, connection draining, and built-in traffic metrics — no DNS-caching pitfalls

Service discovery (Cloud Map) is the lightweight, DNS-based option. Service Connect is the newer, richer option: it gives you client-side load balancing, automatic retries, health-aware routing, and telemetry without an ALB between every pair of services, and it sidesteps the stale-DNS problems that plague raw DNS discovery. For internal service-to-service traffic at any scale, Service Connect is usually the better choice; reserve the ALB for north-south (ingress) traffic. The dedicated comparison — when to reach for Service Connect, Cloud Map, or an internal load balancer — is in ECS Service Connect deep dive: service discovery, traffic resilience, and migrating off ALBs.

ECS vs EKS

Both run containers on AWS; the difference is the control plane and ecosystem.

Dimension	Amazon ECS	Amazon EKS (managed Kubernetes)
Orchestrator	AWS-proprietary, fully managed; no control plane to run	Kubernetes — the open standard; AWS manages the control plane
Learning curve	Low — a handful of concepts (task def, task, service, cluster)	High — pods, deployments, services, ingress, RBAC, CRDs, the whole K8s surface
Control-plane cost	None (you pay only for compute)	A per-cluster hourly charge plus compute
Portability	AWS-only	Portable — same manifests run on any Kubernetes (other clouds, on-prem)
Ecosystem	Tight AWS integration (IAM, ALB, CloudWatch) out of the box	Vast CNCF ecosystem (Helm, Operators, Istio, Argo, Karpenter…)
Compute	Fargate or EC2	Fargate, EC2 managed node groups, or Karpenter
Best when	You want the simplest path to production containers on AWS and are happy being AWS-native	You need Kubernetes specifically — portability, an existing K8s investment/skills, or the CNCF ecosystem

The short answer: choose ECS when you want to ship containers on AWS with the least operational and conceptual overhead and have no specific need for Kubernetes; choose EKS when Kubernetes itself is a requirement — for multi-cloud portability, an existing Kubernetes platform/skill set, or the rich CNCF tooling. ECS is “the AWS way”; EKS is “Kubernetes, managed by AWS”. Neither is universally better — they optimise for different priorities.

Amazon ECS & ECR fundamentals

The diagram traces the full path: you build an image and push it to an ECR repository; a task definition references that image; a service in a cluster launches the desired number of tasks (on Fargate or EC2); an ALB routes inbound traffic to the tasks; service auto scaling adjusts the count; and Service Connect / Cloud Map lets services find each other.

Hands-on lab

You will build a tiny container image, push it to ECR, run it as an ECS service on Fargate, verify it, and clean everything up. This stays within the AWS Free Tier for the brief time it runs, but Fargate tasks bill per second while running, so do the cleanup promptly. Region used: ap-south-1 (Mumbai) — substitute your own.

Prerequisites: Docker running locally, the AWS CLI configured, and a default VPC with public subnets (every account has one). Set helper variables:

export AWS_REGION=ap-south-1
export ACCOUNT=$(aws sts get-caller-identity --query Account --output text)
export REPO=demo-web
export CLUSTER=demo-cluster

1. Create the ECR repository (immutable, scan on push)

aws ecr create-repository \
  --repository-name "$REPO" \
  --image-tag-mutability IMMUTABLE \
  --image-scanning-configuration scanOnPush=true \
  --region "$AWS_REGION"

2. Build and push a minimal image

Create a one-file site and a Dockerfile in an empty directory:

mkdir demo-web && cd demo-web
echo '<h1>Hello from ECS on Fargate</h1>' > index.html
printf 'FROM public.ecr.aws/nginx/nginx:stable\nCOPY index.html /usr/share/nginx/html/index.html\n' > Dockerfile

aws ecr get-login-password --region "$AWS_REGION" \
  | docker login --username AWS --password-stdin "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com"

docker build -t "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0" .
docker push "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0"

Expected: the push reports each layer Pushed and a final digest. Confirm and view scan results:

aws ecr describe-images --repository-name "$REPO" --region "$AWS_REGION" \
  --query 'imageDetails[].imageTags'

3. Create the cluster

aws ecs create-cluster --cluster-name "$CLUSTER" --region "$AWS_REGION"

4. Ensure the execution role exists

Most accounts already have ecsTaskExecutionRole. If not, create it with the trust policy for ECS tasks and attach the managed policy:

cat > trust.json <<'EOF'
{ "Version": "2012-10-17", "Statement": [{ "Effect": "Allow",
  "Principal": { "Service": "ecs-tasks.amazonaws.com" }, "Action": "sts:AssumeRole" }] }
EOF
aws iam create-role --role-name ecsTaskExecutionRole \
  --assume-role-policy-document file://trust.json 2>/dev/null || true
aws iam attach-role-policy --role-name ecsTaskExecutionRole \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy

5. Register the task definition

Write taskdef.json (substitute your account ID), then register it:

cat > taskdef.json <<EOF
{
  "family": "demo-web",
  "requiresCompatibilities": ["FARGATE"],
  "networkMode": "awsvpc",
  "cpu": "256",
  "memory": "512",
  "executionRoleArn": "arn:aws:iam::$ACCOUNT:role/ecsTaskExecutionRole",
  "containerDefinitions": [
    {
      "name": "web",
      "image": "$ACCOUNT.dkr.ecr.$AWS_REGION.amazonaws.com/$REPO:1.0.0",
      "essential": true,
      "portMappings": [{ "containerPort": 80, "protocol": "tcp" }],
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/demo-web",
          "awslogs-region": "$AWS_REGION",
          "awslogs-stream-prefix": "web",
          "awslogs-create-group": "true"
        }
      }
    }
  ]
}
EOF
aws ecs register-task-definition --cli-input-json file://taskdef.json --region "$AWS_REGION"

6. Run it as a service on Fargate

Grab a default-VPC subnet and a security group, then create a one-task service with a public IP (for the lab; production tasks live in private subnets behind an ALB):

SUBNET=$(aws ec2 describe-subnets --region "$AWS_REGION" \
  --filters "Name=default-for-az,Values=true" --query 'Subnets[0].SubnetId' --output text)
SG=$(aws ec2 describe-security-groups --region "$AWS_REGION" \
  --filters "Name=group-name,Values=default" --query 'SecurityGroups[0].GroupId' --output text)

aws ecs create-service \
  --cluster "$CLUSTER" \
  --service-name demo-web-svc \
  --task-definition demo-web \
  --desired-count 1 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[$SUBNET],securityGroups=[$SG],assignPublicIp=ENABLED}" \
  --deployment-configuration "deploymentCircuitBreaker={enable=true,rollback=true},minimumHealthyPercent=100,maximumPercent=200" \
  --region "$AWS_REGION"

The default security group allows all traffic within itself but not from the internet. To actually browse the page, add an inbound rule for TCP 80 from your IP to $SG. For a quick functional check, the task reaching RUNNING is enough.

7. Validate

aws ecs describe-services --cluster "$CLUSTER" --services demo-web-svc \
  --region "$AWS_REGION" --query 'services[0].{desired:desiredCount,running:runningCount,status:status}'

Expected once stable: running equals desired (1) and status is ACTIVE. List the task and read its public IP:

TASK=$(aws ecs list-tasks --cluster "$CLUSTER" --service-name demo-web-svc \
  --region "$AWS_REGION" --query 'taskArns[0]' --output text)
ENI=$(aws ecs describe-tasks --cluster "$CLUSTER" --tasks "$TASK" --region "$AWS_REGION" \
  --query "tasks[0].attachments[0].details[?name=='networkInterfaceId'].value" --output text)
aws ec2 describe-network-interfaces --network-interface-ids "$ENI" --region "$AWS_REGION" \
  --query 'NetworkInterfaces[0].Association.PublicIp' --output text

If you opened port 80, curl http://<that-ip>/ returns the Hello from ECS on Fargate page.

Cleanup

Delete in order — service, cluster, repository, logs — so nothing keeps billing:

aws ecs update-service --cluster "$CLUSTER" --service demo-web-svc \
  --desired-count 0 --region "$AWS_REGION"
aws ecs delete-service --cluster "$CLUSTER" --service demo-web-svc \
  --force --region "$AWS_REGION"
aws ecs delete-cluster --cluster "$CLUSTER" --region "$AWS_REGION"
aws ecr delete-repository --repository-name "$REPO" --force --region "$AWS_REGION"
aws logs delete-log-group --log-group-name /ecs/demo-web --region "$AWS_REGION" 2>/dev/null || true

Cost note: a 256-CPU/512-MB Fargate task costs a few US cents per hour while running; run through the lab and clean up the same session and the cost is negligible. ECR storage is billed per GB-month (the lab image is a few MB) and the first 500 MB-month of private storage is free; basic scanning and CloudWatch Logs ingestion for this tiny workload are effectively free. The cluster itself costs nothing — you pay only for running tasks and stored data.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
Task stuck `PENDING`/`STOPPED` with CannotPullContainerError	Execution role lacks ECR permissions, or the task in a private subnet has no route to ECR	Attach `AmazonECSTaskExecutionRolePolicy`; give the subnet a NAT gateway or ECR + S3 VPC endpoints
Task stops immediately with ResourceInitializationError: unable to retrieve secret	Execution role can’t read the Secrets Manager/SSM value (or no network path to the service)	Grant `secretsmanager:GetSecretValue`/`ssm:GetParameters` to the execution role; add VPC endpoints if private
App gets `AccessDenied` calling S3/DynamoDB at runtime	Wrong role — permission is on the execution role, not the task role	Put the app’s API permissions on the task role (`taskRoleArn`)
Invalid CPU/memory combination on register	Fargate only accepts specific CPU/memory pairs	Match the Fargate CPU/memory matrix (e.g. cpu 256 → memory 512/1024/2048)
Deploy never finishes; tasks cycle `PROVISIONING`→`STOPPED`	New tasks fail health checks (bad image, wrong port, slow boot)	Enable the deployment circuit breaker (auto-rollback); fix the health-check path/port; set `healthCheckGracePeriodSeconds`
Healthy tasks but 502/503 from the ALB	Target-group container/port mismatch, or SG blocks the ALB→task path	Register the correct container name + port; allow the ALB SG to reach the task SG on the container port
`bridge`-mode tasks fail to place — “ports already in use”	Static host ports clash on the EC2 host	Use dynamic host ports (`hostPort: 0`) with an ALB, or switch to `awsvpc`
Untagged images and storage cost creeping up	Mutable tags overwrite, orphaning old images; no cleanup	Use immutable tags and an ECR lifecycle policy to expire untagged/old images

Best practices

Default to Fargate; move to EC2 only for a concrete reason (GPU, steady high-utilisation packing, daemon workloads, host customisation).
Use awsvpc networking so every task has its own IP and security group, and use ALB ip target type.
Two roles, least privilege: execution role = pull/log/secrets only; task role = exactly the app’s runtime API needs.
Immutable image tags + reference by digest, plus an ECR lifecycle policy to expire old/untagged images automatically.
Enable enhanced scanning for production images so they are continuously re-evaluated against new CVEs.
Always enable the deployment circuit breaker with rollback on rolling deployments; use blue/green for releases that need pre-traffic validation and instant rollback.
Set min=100, max=200 for zero-downtime rolling deploys, and a sensible health-check grace period for slow-booting apps.
Inject secrets from Secrets Manager / SSM, never as plain-text environment variables.
Ship logs to CloudWatch (or FireLens) and turn on Container Insights for cluster/service metrics.
Run as non-root with a read-only root filesystem and write only to mounted volumes.
Use Service Connect for internal service-to-service traffic; reserve ALBs for ingress.

Security notes

The container security model on ECS rests on a few pillars. Image provenance: scan images (enhanced scanning) and use immutable tags so a deployed tag cannot be silently swapped — reference by digest for the strongest guarantee. Least-privilege identity: the task role should grant only the AWS APIs the app calls, and the execution role only pull/log/secrets — never reuse a broad role across services. Secrets: keep credentials in Secrets Manager / SSM and inject them via the secrets block so they never appear in the (readable) task definition or in environment listings. Network isolation: with awsvpc, give each service its own security group and run tasks in private subnets, reaching AWS services through VPC endpoints and the internet (if needed) through a NAT gateway. Runtime hardening: run as a non-root user, set readonlyRootFilesystem, drop Linux capabilities you do not need, and avoid privileged mode. Registry access: lock down repository policies; cross-account pulls should be explicit and scoped. Auditing: ECS, ECR, and the IAM role assumptions are all logged in CloudTrail; ECS Exec sessions are auditable via SSM. Finally, prefer Service Connect/Cloud Map internal traffic over exposing services publicly, and put a WAF + the ALB (or CloudFront) in front of anything internet-facing.

Interview & exam questions

What is the difference between a task and a service in ECS? A task is one running instantiation of a task definition — one or more containers running together on a host — and once it stops it is gone. A service is a controller that maintains a desired count of tasks: it replaces failures, spreads tasks across AZs, registers them with a load balancer, drives auto scaling, and orchestrates deployments. Tasks are for one-off/batch work; services are for long-lived applications.
Explain the task role versus the execution role. The execution role is used by the ECS agent/Fargate infrastructure to pull the image from ECR, write logs to CloudWatch, and fetch secrets referenced in the definition. The task role is assumed by your application code at runtime to call AWS APIs (S3, DynamoDB, etc.). Execution role = get the task running; task role = what the running app can do.
What are the ECS network modes, and which does Fargate require? awsvpc (each task gets its own ENI/IP and security group — required by Fargate), bridge (Docker bridge with port mapping, EC2 only), host (bind directly to host network, EC2 only), and none. awsvpc is the modern default everywhere.
When would you choose Fargate over the EC2 launch type, and vice versa? Choose Fargate for minimal operations (no hosts to patch/scale), per-task pricing, and fast scaling — the right default for most services. Choose EC2 for steady high-utilisation fleets where bin-packing beats per-task pricing, for GPU/special instances, for daemon workloads, or for host-level customisation Fargate doesn’t allow.
What does the deployment circuit breaker do? On a rolling deployment, if too many new tasks fail to start or stay healthy, the circuit breaker marks the deployment failed and (with rollback: true) automatically rolls the service back to the last healthy revision — preventing an endless loop of launching a broken image.
How do minimumHealthyPercent and maximumPercent control a rolling deploy? minimumHealthyPercent is the floor of healthy tasks ECS must keep during the deploy; maximumPercent is the ceiling of total (old+new) tasks. min=100, max=200 lets ECS add new tasks before removing old ones (no capacity dip); lower values trade headroom for fewer concurrent tasks.
How do blue/green deployments work on ECS, and what do they add over rolling? With the CodeDeploy controller, ECS stands up a parallel green task set, shifts ALB traffic to it (all-at-once, canary, or linear) with optional validation hooks, then retires the blue set. They add pre-traffic validation and instant rollback (shift traffic back) that in-place rolling updates lack.
What is ECR tag immutability and why does it matter? With IMMUTABLE repositories, a tag cannot be overwritten once pushed. This prevents a tag like prod or latest from silently pointing at a different image — a supply-chain and rollback hazard — making deployments reproducible and auditable.
What is an ECR lifecycle policy? A prioritised set of rules that automatically expire images by age or count (e.g. keep the last 10 prod images, delete untagged images older than 14 days), controlling storage cost and clutter. Expiry is permanent, so selections must avoid images that running services still reference.
How does an ECS service integrate with an Application Load Balancer? The service is attached to a target group with a named container+port; ECS keeps the target group in sync as tasks start/stop. With awsvpc, the ip target type registers each task’s ENI IP directly. A health-check grace period protects slow-booting tasks, and deregistration delay drains connections on stop.
What is the difference between service discovery (Cloud Map) and ECS Service Connect? Cloud Map is DNS-based discovery (names resolve to task IPs) — simple but subject to client DNS caching and with no built-in retries/metrics. Service Connect injects a managed proxy that gives per-request load balancing, retries, health-aware routing, and traffic metrics — better for internal service-to-service traffic.
When would you choose ECS over EKS? Choose ECS for the simplest path to production containers on AWS with no Kubernetes overhead and tight AWS integration (and no control-plane cost). Choose EKS when you specifically need Kubernetes — portability/multi-cloud, existing K8s skills/investment, or the CNCF ecosystem (Helm, Operators, Karpenter).
Basic vs enhanced image scanning in ECR? Basic uses ECR’s built-in scanner on OS packages, runs on push/on demand, and is free. Enhanced uses Amazon Inspector, covers OS and language packages, runs on push and continuously as new CVEs appear, and is billed per scan.

Quick check

Which IAM role pulls the image from ECR and writes the task’s logs?
Which network mode is mandatory on Fargate, and what does each task get under it?
What two percentages bound a rolling deployment, and what does min=100, max=200 achieve?
Name the three ECS deployment controllers.
What does an ECR lifecycle policy do, and is the action reversible?

Answers

The execution role (executionRoleArn) — it handles image pull, CloudWatch Logs, and secret retrieval. (The task role is what your app code uses at runtime.)
awsvpc — each task gets its own ENI with a private IP in your subnet and its own security group.
minimumHealthyPercent (floor of healthy tasks) and maximumPercent (ceiling of total tasks). min=100, max=200 lets ECS launch new tasks before stopping old ones, so capacity never dips during the deploy.
ECS (rolling update), CODE_DEPLOY (blue/green), and EXTERNAL (you manage task sets/traffic).
It automatically expires images by age or count to control storage and clutter; the expire action is permanent — there is no recycle bin.

Exercise

Take the lab service and harden it toward production:

Move the service into private subnets behind an internet-facing ALB: create a target group with target type ip, attach it to the service (--load-balancers), and set a health-check grace period. Confirm the page is reachable through the ALB DNS name, not a task public IP.
Add a task role granting read-only access to one S3 bucket, and prove from inside the container (via ECS Exec, aws ecs execute-command) that the app identity can list that bucket but nothing else.
Inject a value from SSM Parameter Store (or Secrets Manager) using the secrets block instead of a plain-text environment variable; verify the execution role needed the read permission and that the value is not visible in the task definition.
Configure target-tracking auto scaling on ALBRequestCountPerTarget (min 1, max 4); generate load and watch the desired count rise and fall.
Add an ECR lifecycle policy (keep last 5 images; expire untagged after 7 days), push a few revisions, and confirm old/untagged images are reaped.
Switch the service to blue/green via CodeDeploy and run a canary deploy (10% for 5 minutes, then 100%), then trigger a rollback.

Certification mapping

Exam	Objective area this supports
DVA-C02 (Developer – Associate)	Development & deployment with AWS services — packaging apps as containers, ECR push/pull and image management, authoring task definitions (roles, secrets, logging), running services, and rolling/blue-green deployments with rollback.
SAA-C03 (Solutions Architect – Associate)	Design resilient, cost-optimised architectures — choosing Fargate vs EC2, ECS vs EKS, `awsvpc` networking and ALB integration, service auto scaling, and service discovery/Service Connect for decoupled microservices.
SOA-C02 (SysOps Administrator – Associate)	Deployment, monitoring & troubleshooting — operating ECS services, Container Insights/CloudWatch logging, deployment health and circuit breaker, and diagnosing task-launch and load-balancer issues.

Glossary

ECR (Elastic Container Registry) — AWS’s managed registry for storing, versioning, and scanning container images.
Repository — a named collection of image versions within ECR (typically one per application image).
Tag immutability — a repository setting (IMMUTABLE) preventing a tag from being overwritten once pushed.
Lifecycle policy — prioritised rules that automatically expire ECR images by age or count.
Image scanning (basic/enhanced) — CVE scanning of images; basic (ECR, OS packages, on push) vs enhanced (Amazon Inspector, OS + language packages, continuous).
ECS (Elastic Container Service) — AWS’s proprietary container orchestrator that runs tasks/services from task definitions.
Cluster — a logical grouping and capacity boundary for ECS tasks and services.
Task definition — the versioned JSON blueprint (family + revisions) describing containers and task-level settings.
Container definition — the spec for one container inside a task (image, ports, env, secrets, logging, health check).
Task — a single running instantiation of a task definition; the unit of scheduling.
Service — a controller maintaining a desired count of tasks, with load-balancer, scaling, and deployment integration.
Launch type — where tasks run: Fargate (serverless) or EC2 (your container instances).
Capacity provider — managed capacity for a cluster (FARGATE, FARGATE_SPOT, or an Auto Scaling group provider).
Network mode — task networking model: awsvpc, bridge, host, or none.
awsvpc mode — each task gets its own ENI/IP and security group; required by Fargate.
Execution role — the IAM role the ECS agent uses to pull images, write logs, and fetch secrets.
Task role — the IAM role your application code assumes at runtime to call AWS APIs.
Desired count — the number of tasks a service keeps running.
Deployment circuit breaker — auto-detects a failing deployment and (optionally) rolls it back to the last healthy revision.
Rolling update / blue-green / external — the three ECS deployment controllers (ECS, CODE_DEPLOY, EXTERNAL).
Service Connect — a managed proxy giving per-request load balancing, retries, and metrics for service-to-service traffic.
Service discovery (Cloud Map) — DNS-based registration/resolution of tasks by name.
Fargate CPU/memory matrix — the fixed set of valid CPU/memory combinations Fargate accepts.

Next steps

Continue the course with the Amazon CloudFront deep dive — the CDN that commonly sits in front of an ECS-backed application for global caching and TLS. Then go deeper on running containers in production:

Production Amazon ECS on Fargate: Task Networking, Auto Scaling, and Safe Rolling Deployments — awsvpc ENI/IP planning, scaling-policy tuning, deployment circuit breakers, and graceful task lifecycle.
ECS Service Connect Deep Dive: Service Discovery, Traffic Resilience, and Migrating Off ALBs — when to use Service Connect, Cloud Map, or an internal load balancer for service-to-service traffic.

Amazon ECS & ECR, In Depth: Task Definitions, Services, Fargate vs EC2 & the Registry