Quick take — A reusable Terraform module for AWS ECS on Fargate: provisions an aws_ecs_cluster with Container Insights plus a load-balanced aws_ecs_service, task definition, target-tracking autoscaling, and CloudWatch logging. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.
Quickstart (copy-paste)
Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):
provider "aws" {
region = "us-east-1"
}
module "ecs" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"
name_prefix = "..." # Prefix for cluster/service names (team or environment).
service_name = "..." # Logical service name; also the primary container name (…
aws_region = "..." # Region for the awslogs log driver.
container_image = "..." # Fully qualified image, including tag or digest.
execution_role_arn = "..." # ECS task execution role ARN.
subnet_ids = ["...", "..."] # Subnets for awsvpc ENIs (>= 2, validated).
security_group_ids = ["...", "..."] # Security groups on task ENIs.
target_group_arn = "..." # ALB/NLB target group ARN.
}
Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.
What this module is
Amazon ECS (Elastic Container Service) is AWS’s native container orchestrator. A cluster is the logical boundary that capacity and services live in; a service is the long-running controller that keeps N copies of a task definition healthy, registers them behind a load balancer, and rolls out new revisions without dropping traffic. On their own, aws_ecs_cluster and aws_ecs_service look deceptively small — but a production service is never just those two resources. You also need a task definition with the right requires_compatibilities, a CloudWatch log group, an awsvpc network configuration with the correct security groups, an ALB target group wired to the container port, task and execution IAM roles, and an Application Auto Scaling target with a scaling policy.
This module wraps all of that into one opinionated, Fargate-first unit. You pass a container image, CPU/memory, a target group ARN, and subnet/SG IDs; the module returns a running, autoscaling, log-emitting service. Wrapping it as a module means every team ships ECS the same way — enableExecuteCommand for debugging, circuit-breaker rollback on failed deploys, Container Insights on by default — instead of copy-pasting a 200-line service block and quietly forgetting half of it.
When to use it
- You run stateless HTTP/gRPC services (APIs, web frontends, BFFs) on Fargate and want load-balanced, autoscaling deployments without managing EC2 hosts.
- You want safe rollouts: deployment circuit breaker with automatic rollback, plus health-check-gated traffic shifting.
- You need per-service autoscaling driven by CPU or ALB request count, not a fixed task count.
- You are standardizing many microservices and want a single contract for logging, secrets injection, and
ECS Execdebugging. - Reach for something else when: you need Kubernetes-native primitives (use EKS), the workload is a short-lived batch job (use ECS
RunTask/Step Functions or AWS Batch, not a long-running service), or you need GPU/host-level access that only EC2 launch type provides (this module is Fargate-only by design).
Module structure
terraform-module-aws-ecs/
├── versions.tf # provider + Terraform version pins
├── main.tf # cluster, task def, service, log group, autoscaling
├── variables.tf # var-driven inputs with validation
└── outputs.tf # cluster + service identifiers and ARNs
versions.tf
terraform {
required_version = ">= 1.5.0"
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
main.tf
locals {
name = "${var.name_prefix}-${var.service_name}"
tags = merge(
{
"Service" = var.service_name
"ManagedBy" = "terraform"
"Module" = "terraform-module-aws-ecs"
},
var.tags
)
}
# ---------------------------------------------------------------------------
# CloudWatch log group for container stdout/stderr
# ---------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "this" {
name = "/ecs/${local.name}"
retention_in_days = var.log_retention_days
kms_key_id = var.log_kms_key_arn
tags = local.tags
}
# ---------------------------------------------------------------------------
# Cluster with Container Insights
# ---------------------------------------------------------------------------
resource "aws_ecs_cluster" "this" {
name = local.name
setting {
name = "containerInsights"
value = var.container_insights ? "enabled" : "disabled"
}
tags = local.tags
}
resource "aws_ecs_cluster_capacity_providers" "this" {
cluster_name = aws_ecs_cluster.this.name
capacity_providers = ["FARGATE", "FARGATE_SPOT"]
default_capacity_provider_strategy {
capacity_provider = var.use_fargate_spot ? "FARGATE_SPOT" : "FARGATE"
weight = 1
base = var.use_fargate_spot ? 0 : 1
}
}
# ---------------------------------------------------------------------------
# Task definition (awsvpc + Fargate)
# ---------------------------------------------------------------------------
resource "aws_ecs_task_definition" "this" {
family = local.name
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.task_cpu
memory = var.task_memory
execution_role_arn = var.execution_role_arn
task_role_arn = var.task_role_arn
runtime_platform {
operating_system_family = "LINUX"
cpu_architecture = var.cpu_architecture
}
container_definitions = jsonencode([
{
name = var.service_name
image = var.container_image
essential = true
cpu = var.task_cpu
memory = var.task_memory
portMappings = [
{
containerPort = var.container_port
protocol = "tcp"
}
]
environment = [
for k, v in var.environment : { name = k, value = v }
]
secrets = [
for k, arn in var.secrets : { name = k, valueFrom = arn }
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.this.name
"awslogs-region" = var.aws_region
"awslogs-stream-prefix" = var.service_name
}
}
healthCheck = {
command = ["CMD-SHELL", var.container_health_check_command]
interval = 30
timeout = 5
retries = 3
startPeriod = 30
}
}
])
tags = local.tags
}
# ---------------------------------------------------------------------------
# Service
# ---------------------------------------------------------------------------
resource "aws_ecs_service" "this" {
name = local.name
cluster = aws_ecs_cluster.this.id
task_definition = aws_ecs_task_definition.this.arn
desired_count = var.desired_count
launch_type = var.use_fargate_spot ? null : "FARGATE"
enable_execute_command = var.enable_execute_command
health_check_grace_period_seconds = var.health_check_grace_period_seconds
deployment_circuit_breaker {
enable = true
rollback = true
}
deployment_maximum_percent = var.deployment_maximum_percent
deployment_minimum_healthy_percent = var.deployment_minimum_healthy_percent
dynamic "capacity_provider_strategy" {
for_each = var.use_fargate_spot ? [1] : []
content {
capacity_provider = "FARGATE_SPOT"
weight = 1
}
}
network_configuration {
subnets = var.subnet_ids
security_groups = var.security_group_ids
assign_public_ip = var.assign_public_ip
}
load_balancer {
target_group_arn = var.target_group_arn
container_name = var.service_name
container_port = var.container_port
}
lifecycle {
ignore_changes = [desired_count]
}
tags = local.tags
depends_on = [aws_ecs_cluster_capacity_providers.this]
}
# ---------------------------------------------------------------------------
# Target-tracking autoscaling
# ---------------------------------------------------------------------------
resource "aws_appautoscaling_target" "this" {
count = var.enable_autoscaling ? 1 : 0
max_capacity = var.autoscaling_max_capacity
min_capacity = var.autoscaling_min_capacity
resource_id = "service/${aws_ecs_cluster.this.name}/${aws_ecs_service.this.name}"
scalable_dimension = "ecs:service:DesiredCount"
service_namespace = "ecs"
}
resource "aws_appautoscaling_policy" "cpu" {
count = var.enable_autoscaling ? 1 : 0
name = "${local.name}-cpu-tracking"
policy_type = "TargetTrackingScaling"
resource_id = aws_appautoscaling_target.this[0].resource_id
scalable_dimension = aws_appautoscaling_target.this[0].scalable_dimension
service_namespace = aws_appautoscaling_target.this[0].service_namespace
target_tracking_scaling_policy_configuration {
predefined_metric_specification {
predefined_metric_type = "ECSServiceAverageCPUUtilization"
}
target_value = var.autoscaling_cpu_target
scale_in_cooldown = 300
scale_out_cooldown = 60
}
}
variables.tf
variable "name_prefix" {
description = "Prefix for cluster/service names (e.g. team or environment)."
type = string
}
variable "service_name" {
description = "Logical service name; also the primary container name."
type = string
validation {
condition = can(regex("^[a-z0-9][a-z0-9-]{0,30}$", var.service_name))
error_message = "service_name must be lowercase alphanumeric/hyphens, max 31 chars."
}
}
variable "aws_region" {
description = "Region for the awslogs log driver."
type = string
}
variable "container_image" {
description = "Fully qualified container image, including tag or digest."
type = string
}
variable "container_port" {
description = "Port the container listens on (wired to the target group)."
type = number
default = 8080
}
variable "task_cpu" {
description = "Task-level CPU units. Must be a valid Fargate combo with task_memory."
type = number
default = 512
validation {
condition = contains([256, 512, 1024, 2048, 4096, 8192, 16384], var.task_cpu)
error_message = "task_cpu must be a valid Fargate CPU value (256, 512, 1024, ...)."
}
}
variable "task_memory" {
description = "Task-level memory (MiB). Must be a valid Fargate combo with task_cpu."
type = number
default = 1024
}
variable "cpu_architecture" {
description = "Fargate CPU architecture: X86_64 or ARM64 (Graviton, cheaper)."
type = string
default = "ARM64"
validation {
condition = contains(["X86_64", "ARM64"], var.cpu_architecture)
error_message = "cpu_architecture must be X86_64 or ARM64."
}
}
variable "desired_count" {
description = "Initial task count (ignored after create when autoscaling manages it)."
type = number
default = 2
}
variable "execution_role_arn" {
description = "ECS task execution role ARN (pulls images, writes logs, reads secrets)."
type = string
}
variable "task_role_arn" {
description = "IAM role assumed by the running container for AWS API calls."
type = string
default = null
}
variable "subnet_ids" {
description = "Subnet IDs for the awsvpc ENIs (private subnets recommended)."
type = list(string)
validation {
condition = length(var.subnet_ids) >= 2
error_message = "Provide at least two subnets across AZs for high availability."
}
}
variable "security_group_ids" {
description = "Security groups attached to task ENIs."
type = list(string)
}
variable "assign_public_ip" {
description = "Assign a public IP to tasks (only for public subnets without NAT)."
type = bool
default = false
}
variable "target_group_arn" {
description = "ALB/NLB target group ARN tasks register into."
type = string
}
variable "health_check_grace_period_seconds" {
description = "Grace period before the LB health check can mark a new task unhealthy."
type = number
default = 60
}
variable "container_health_check_command" {
description = "Shell command for the container-level health check."
type = string
default = "curl -f http://localhost:8080/healthz || exit 1"
}
variable "deployment_minimum_healthy_percent" {
description = "Minimum percent of tasks kept healthy during a deploy."
type = number
default = 100
}
variable "deployment_maximum_percent" {
description = "Maximum percent of tasks allowed during a deploy."
type = number
default = 200
}
variable "enable_execute_command" {
description = "Enable ECS Exec (SSM) into running tasks for debugging."
type = bool
default = true
}
variable "container_insights" {
description = "Enable CloudWatch Container Insights on the cluster."
type = bool
default = true
}
variable "use_fargate_spot" {
description = "Run tasks on FARGATE_SPOT capacity instead of on-demand FARGATE."
type = bool
default = false
}
variable "log_retention_days" {
description = "CloudWatch log retention in days."
type = number
default = 30
}
variable "log_kms_key_arn" {
description = "Optional KMS key ARN to encrypt the log group."
type = string
default = null
}
variable "environment" {
description = "Plain environment variables injected into the container."
type = map(string)
default = {}
}
variable "secrets" {
description = "Secrets map: env var name => Secrets Manager/SSM Parameter ARN."
type = map(string)
default = {}
}
variable "enable_autoscaling" {
description = "Provision Application Auto Scaling target + CPU policy."
type = bool
default = true
}
variable "autoscaling_min_capacity" {
description = "Minimum task count for autoscaling."
type = number
default = 2
}
variable "autoscaling_max_capacity" {
description = "Maximum task count for autoscaling."
type = number
default = 10
}
variable "autoscaling_cpu_target" {
description = "Target average CPU utilization percent for scaling."
type = number
default = 60
}
variable "tags" {
description = "Additional tags merged onto all resources."
type = map(string)
default = {}
}
outputs.tf
output "cluster_id" {
description = "ECS cluster ID (ARN)."
value = aws_ecs_cluster.this.id
}
output "cluster_arn" {
description = "ECS cluster ARN."
value = aws_ecs_cluster.this.arn
}
output "cluster_name" {
description = "ECS cluster name."
value = aws_ecs_cluster.this.name
}
output "service_id" {
description = "ECS service ARN/ID."
value = aws_ecs_service.this.id
}
output "service_name" {
description = "ECS service name."
value = aws_ecs_service.this.name
}
output "task_definition_arn" {
description = "Full ARN (with revision) of the active task definition."
value = aws_ecs_task_definition.this.arn
}
output "task_definition_family" {
description = "Task definition family name."
value = aws_ecs_task_definition.this.family
}
output "log_group_name" {
description = "CloudWatch log group receiving container logs."
value = aws_cloudwatch_log_group.this.name
}
output "autoscaling_target_resource_id" {
description = "Application Auto Scaling resource ID, or null if disabled."
value = try(aws_appautoscaling_target.this[0].resource_id, null)
}
How to use it
module "ecs_cluster_service" {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"
name_prefix = "prod"
service_name = "checkout-api"
aws_region = "ap-south-1"
container_image = "1234567890.dkr.ecr.ap-south-1.amazonaws.com/checkout-api:2026.06.1"
container_port = 8080
task_cpu = 1024
task_memory = 2048
cpu_architecture = "ARM64"
execution_role_arn = aws_iam_role.ecs_execution.arn
task_role_arn = aws_iam_role.checkout_task.arn
subnet_ids = module.network.private_subnet_ids
security_group_ids = [aws_security_group.checkout_tasks.id]
target_group_arn = aws_lb_target_group.checkout.arn
environment = {
LOG_LEVEL = "info"
APP_REGION = "ap-south-1"
}
secrets = {
DB_PASSWORD = aws_secretsmanager_secret.db.arn
STRIPE_API_KEY = aws_secretsmanager_secret.stripe.arn
}
enable_autoscaling = true
autoscaling_min_capacity = 3
autoscaling_max_capacity = 20
autoscaling_cpu_target = 55
tags = {
Environment = "prod"
CostCenter = "payments"
}
}
# Downstream: alarm on the service's CPU using the autoscaling resource id,
# and surface the log group to a centralized dashboard.
resource "aws_cloudwatch_metric_alarm" "checkout_high_cpu" {
alarm_name = "checkout-api-cpu-high"
namespace = "AWS/ECS"
metric_name = "CPUUtilization"
statistic = "Average"
comparison_operator = "GreaterThanThreshold"
threshold = 85
period = 60
evaluation_periods = 5
dimensions = {
ClusterName = module.ecs_cluster_service.cluster_name
ServiceName = module.ecs_cluster_service.service_name
}
alarm_actions = [aws_sns_topic.oncall.arn]
}
With Terragrunt
Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.
1. Root config — live/terragrunt.hcl (inherited by every module):
remote_state {
backend = "s3"
generate = { path = "backend.tf", if_exists = "overwrite" }
config = {
# ...s3 state bucket/container + key per path...
}
}
2. Module config — live/prod/ecs/terragrunt.hcl:
include "root" {
path = find_in_parent_folders()
}
terraform {
source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"
}
inputs = {
name_prefix = "..."
service_name = "..."
aws_region = "..."
container_image = "..."
execution_role_arn = "..."
subnet_ids = ["...", "..."]
security_group_ids = ["...", "..."]
target_group_arn = "..."
}
3. Deploy one environment, or roll out all modules together:
cd live/prod/ecs && terragrunt apply # this module
terragrunt run-all apply # every module under live/prod
Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.
Inputs
| Name | Type | Default | Required | Description |
|---|---|---|---|---|
| name_prefix | string | — | Yes | Prefix for cluster/service names (team or environment). |
| service_name | string | — | Yes | Logical service name; also the primary container name (validated). |
| aws_region | string | — | Yes | Region for the awslogs log driver. |
| container_image | string | — | Yes | Fully qualified image, including tag or digest. |
| container_port | number | 8080 | No | Container listen port, wired to the target group. |
| task_cpu | number | 512 | No | Task-level CPU units (validated Fargate value). |
| task_memory | number | 1024 | No | Task-level memory (MiB); must form a valid Fargate combo. |
| cpu_architecture | string | ARM64 | No | X86_64 or ARM64 (Graviton). |
| desired_count | number | 2 | No | Initial task count; ignored after create. |
| execution_role_arn | string | — | Yes | ECS task execution role ARN. |
| task_role_arn | string | null | No | IAM role assumed by the running container. |
| subnet_ids | list(string) | — | Yes | Subnets for awsvpc ENIs (>= 2, validated). |
| security_group_ids | list(string) | — | Yes | Security groups on task ENIs. |
| assign_public_ip | bool | false | No | Assign public IP to tasks. |
| target_group_arn | string | — | Yes | ALB/NLB target group ARN. |
| health_check_grace_period_seconds | number | 60 | No | LB health-check grace period for new tasks. |
| container_health_check_command | string | curl … /healthz | No | Container-level health check command. |
| deployment_minimum_healthy_percent | number | 100 | No | Min healthy percent during deploys. |
| deployment_maximum_percent | number | 200 | No | Max percent during deploys. |
| enable_execute_command | bool | true | No | Enable ECS Exec (SSM) into tasks. |
| container_insights | bool | true | No | Enable Container Insights on the cluster. |
| use_fargate_spot | bool | false | No | Use FARGATE_SPOT capacity. |
| log_retention_days | number | 30 | No | CloudWatch log retention. |
| log_kms_key_arn | string | null | No | KMS key for log encryption. |
| environment | map(string) | {} | No | Plain env vars for the container. |
| secrets | map(string) | {} | No | Env var name => Secrets Manager/SSM ARN. |
| enable_autoscaling | bool | true | No | Provision autoscaling target + CPU policy. |
| autoscaling_min_capacity | number | 2 | No | Minimum task count. |
| autoscaling_max_capacity | number | 10 | No | Maximum task count. |
| autoscaling_cpu_target | number | 60 | No | Target average CPU percent. |
| tags | map(string) | {} | No | Extra tags merged onto all resources. |
Outputs
| Name | Description |
|---|---|
| cluster_id | ECS cluster ID (ARN). |
| cluster_arn | ECS cluster ARN. |
| cluster_name | ECS cluster name (used in CloudWatch dimensions). |
| service_id | ECS service ARN/ID. |
| service_name | ECS service name. |
| task_definition_arn | Full ARN (with revision) of the active task definition. |
| task_definition_family | Task definition family name. |
| log_group_name | CloudWatch log group receiving container logs. |
| autoscaling_target_resource_id | Application Auto Scaling resource ID, or null when disabled. |
Enterprise scenario
A payments platform runs roughly 40 microservices on Fargate across dev, staging, and prod accounts. Each service team instantiates this module from a thin per-service stack, passing only the image tag and the team’s IAM roles, so every service inherits the same guardrails: Container Insights for the SRE dashboards, deployment circuit breaker so a bad checkout-api:2026.06.1 rolls back automatically instead of paging the on-call at 2am, secrets pulled from Secrets Manager rather than baked into images, and ARM64 Graviton tasks that cut compute cost ~20% versus x86. When the platform team needs to raise the default log retention for PCI compliance, they bump one variable default and cut a new module tag — all 40 services pick it up on their next apply.
Best practices
- Pin images by digest in production, not floating tags like
:latest. A mutable tag means twoterraform applyruns can launch different code with no diff; a digest (@sha256:...) makes deployments deterministic and forces a visible task-definition revision. - Keep
deployment_circuit_breakerwithrollback = trueand setdeployment_minimum_healthy_percent = 100for customer-facing services so a failed rollout never reduces capacity below the current footprint. - Run tasks in private subnets with
assign_public_ip = false, reach the internet via NAT or VPC endpoints, and scope task security groups to only the ALB’s SG on the container port — never0.0.0.0/0. - Inject credentials via
secrets, neverenvironment. Values inenvironmentare visible in the task definition and the console;secretsresolves Secrets Manager/SSM ARNs at launch and keeps them out of state in plaintext. - Let autoscaling own
desired_count. This module’signore_changes = [desired_count]prevents Terraform from fighting Application Auto Scaling and snapping the service back to its baseline on every plan. - Prefer ARM64 Graviton and consider
use_fargate_spotfor non-critical / async workers to cut cost, but keep latency-sensitive request paths on on-demand Fargate to avoid Spot interruptions, and name resources${name_prefix}-${service_name}so cluster, logs, and alarms line up across accounts.