Terraform Module: AWS ECS Cluster & Service — Fargate workloads with rolling deploys and autoscaling baked in

Quick take — A reusable Terraform module for AWS ECS on Fargate: provisions an aws_ecs_cluster with Container Insights plus a load-balanced aws_ecs_service, task definition, target-tracking autoscaling, and CloudWatch logging. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "ecs" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"

  name_prefix        = "..."           # Prefix for cluster/service names (team or environment).
  service_name       = "..."           # Logical service name; also the primary container name (…
  aws_region         = "..."           # Region for the awslogs log driver.
  container_image    = "..."           # Fully qualified image, including tag or digest.
  execution_role_arn = "..."           # ECS task execution role ARN.
  subnet_ids         = ["...", "..."]  # Subnets for awsvpc ENIs (>= 2, validated).
  security_group_ids = ["...", "..."]  # Security groups on task ENIs.
  target_group_arn   = "..."           # ALB/NLB target group ARN.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon ECS (Elastic Container Service) is AWS’s native container orchestrator. A cluster is the logical boundary that capacity and services live in; a service is the long-running controller that keeps N copies of a task definition healthy, registers them behind a load balancer, and rolls out new revisions without dropping traffic. On their own, aws_ecs_cluster and aws_ecs_service look deceptively small — but a production service is never just those two resources. You also need a task definition with the right requires_compatibilities, a CloudWatch log group, an awsvpc network configuration with the correct security groups, an ALB target group wired to the container port, task and execution IAM roles, and an Application Auto Scaling target with a scaling policy.

This module wraps all of that into one opinionated, Fargate-first unit. You pass a container image, CPU/memory, a target group ARN, and subnet/SG IDs; the module returns a running, autoscaling, log-emitting service. Wrapping it as a module means every team ships ECS the same way — enableExecuteCommand for debugging, circuit-breaker rollback on failed deploys, Container Insights on by default — instead of copy-pasting a 200-line service block and quietly forgetting half of it.

When to use it

You run stateless HTTP/gRPC services (APIs, web frontends, BFFs) on Fargate and want load-balanced, autoscaling deployments without managing EC2 hosts.
You want safe rollouts: deployment circuit breaker with automatic rollback, plus health-check-gated traffic shifting.
You need per-service autoscaling driven by CPU or ALB request count, not a fixed task count.
You are standardizing many microservices and want a single contract for logging, secrets injection, and ECS Exec debugging.
Reach for something else when: you need Kubernetes-native primitives (use EKS), the workload is a short-lived batch job (use ECS RunTask/Step Functions or AWS Batch, not a long-running service), or you need GPU/host-level access that only EC2 launch type provides (this module is Fargate-only by design).

Module structure

terraform-module-aws-ecs/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # cluster, task def, service, log group, autoscaling
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # cluster + service identifiers and ARNs

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.name_prefix}-${var.service_name}"

  tags = merge(
    {
      "Service"   = var.service_name
      "ManagedBy" = "terraform"
      "Module"    = "terraform-module-aws-ecs"
    },
    var.tags
  )
}

# ---------------------------------------------------------------------------
# CloudWatch log group for container stdout/stderr
# ---------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "this" {
  name              = "/ecs/${local.name}"
  retention_in_days = var.log_retention_days
  kms_key_id        = var.log_kms_key_arn
  tags              = local.tags
}

# ---------------------------------------------------------------------------
# Cluster with Container Insights
# ---------------------------------------------------------------------------
resource "aws_ecs_cluster" "this" {
  name = local.name

  setting {
    name  = "containerInsights"
    value = var.container_insights ? "enabled" : "disabled"
  }

  tags = local.tags
}

resource "aws_ecs_cluster_capacity_providers" "this" {
  cluster_name       = aws_ecs_cluster.this.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = var.use_fargate_spot ? "FARGATE_SPOT" : "FARGATE"
    weight            = 1
    base              = var.use_fargate_spot ? 0 : 1
  }
}

# ---------------------------------------------------------------------------
# Task definition (awsvpc + Fargate)
# ---------------------------------------------------------------------------
resource "aws_ecs_task_definition" "this" {
  family                   = local.name
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = var.execution_role_arn
  task_role_arn            = var.task_role_arn

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = var.cpu_architecture
  }

  container_definitions = jsonencode([
    {
      name      = var.service_name
      image     = var.container_image
      essential = true
      cpu       = var.task_cpu
      memory    = var.task_memory

      portMappings = [
        {
          containerPort = var.container_port
          protocol      = "tcp"
        }
      ]

      environment = [
        for k, v in var.environment : { name = k, value = v }
      ]

      secrets = [
        for k, arn in var.secrets : { name = k, valueFrom = arn }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.this.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = var.service_name
        }
      }

      healthCheck = {
        command     = ["CMD-SHELL", var.container_health_check_command]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 30
      }
    }
  ])

  tags = local.tags
}

# ---------------------------------------------------------------------------
# Service
# ---------------------------------------------------------------------------
resource "aws_ecs_service" "this" {
  name            = local.name
  cluster         = aws_ecs_cluster.this.id
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count
  launch_type     = var.use_fargate_spot ? null : "FARGATE"

  enable_execute_command = var.enable_execute_command

  health_check_grace_period_seconds = var.health_check_grace_period_seconds

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = var.deployment_maximum_percent
  deployment_minimum_healthy_percent = var.deployment_minimum_healthy_percent

  dynamic "capacity_provider_strategy" {
    for_each = var.use_fargate_spot ? [1] : []
    content {
      capacity_provider = "FARGATE_SPOT"
      weight            = 1
    }
  }

  network_configuration {
    subnets          = var.subnet_ids
    security_groups  = var.security_group_ids
    assign_public_ip = var.assign_public_ip
  }

  load_balancer {
    target_group_arn = var.target_group_arn
    container_name   = var.service_name
    container_port   = var.container_port
  }

  lifecycle {
    ignore_changes = [desired_count]
  }

  tags = local.tags

  depends_on = [aws_ecs_cluster_capacity_providers.this]
}

# ---------------------------------------------------------------------------
# Target-tracking autoscaling
# ---------------------------------------------------------------------------
resource "aws_appautoscaling_target" "this" {
  count = var.enable_autoscaling ? 1 : 0

  max_capacity       = var.autoscaling_max_capacity
  min_capacity       = var.autoscaling_min_capacity
  resource_id        = "service/${aws_ecs_cluster.this.name}/${aws_ecs_service.this.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu" {
  count = var.enable_autoscaling ? 1 : 0

  name               = "${local.name}-cpu-tracking"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.this[0].resource_id
  scalable_dimension = aws_appautoscaling_target.this[0].scalable_dimension
  service_namespace  = aws_appautoscaling_target.this[0].service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = var.autoscaling_cpu_target
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

variables.tf

variable "name_prefix" {
  description = "Prefix for cluster/service names (e.g. team or environment)."
  type        = string
}

variable "service_name" {
  description = "Logical service name; also the primary container name."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9][a-z0-9-]{0,30}$", var.service_name))
    error_message = "service_name must be lowercase alphanumeric/hyphens, max 31 chars."
  }
}

variable "aws_region" {
  description = "Region for the awslogs log driver."
  type        = string
}

variable "container_image" {
  description = "Fully qualified container image, including tag or digest."
  type        = string
}

variable "container_port" {
  description = "Port the container listens on (wired to the target group)."
  type        = number
  default     = 8080
}

variable "task_cpu" {
  description = "Task-level CPU units. Must be a valid Fargate combo with task_memory."
  type        = number
  default     = 512

  validation {
    condition     = contains([256, 512, 1024, 2048, 4096, 8192, 16384], var.task_cpu)
    error_message = "task_cpu must be a valid Fargate CPU value (256, 512, 1024, ...)."
  }
}

variable "task_memory" {
  description = "Task-level memory (MiB). Must be a valid Fargate combo with task_cpu."
  type        = number
  default     = 1024
}

variable "cpu_architecture" {
  description = "Fargate CPU architecture: X86_64 or ARM64 (Graviton, cheaper)."
  type        = string
  default     = "ARM64"

  validation {
    condition     = contains(["X86_64", "ARM64"], var.cpu_architecture)
    error_message = "cpu_architecture must be X86_64 or ARM64."
  }
}

variable "desired_count" {
  description = "Initial task count (ignored after create when autoscaling manages it)."
  type        = number
  default     = 2
}

variable "execution_role_arn" {
  description = "ECS task execution role ARN (pulls images, writes logs, reads secrets)."
  type        = string
}

variable "task_role_arn" {
  description = "IAM role assumed by the running container for AWS API calls."
  type        = string
  default     = null
}

variable "subnet_ids" {
  description = "Subnet IDs for the awsvpc ENIs (private subnets recommended)."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) >= 2
    error_message = "Provide at least two subnets across AZs for high availability."
  }
}

variable "security_group_ids" {
  description = "Security groups attached to task ENIs."
  type        = list(string)
}

variable "assign_public_ip" {
  description = "Assign a public IP to tasks (only for public subnets without NAT)."
  type        = bool
  default     = false
}

variable "target_group_arn" {
  description = "ALB/NLB target group ARN tasks register into."
  type        = string
}

variable "health_check_grace_period_seconds" {
  description = "Grace period before the LB health check can mark a new task unhealthy."
  type        = number
  default     = 60
}

variable "container_health_check_command" {
  description = "Shell command for the container-level health check."
  type        = string
  default     = "curl -f http://localhost:8080/healthz || exit 1"
}

variable "deployment_minimum_healthy_percent" {
  description = "Minimum percent of tasks kept healthy during a deploy."
  type        = number
  default     = 100
}

variable "deployment_maximum_percent" {
  description = "Maximum percent of tasks allowed during a deploy."
  type        = number
  default     = 200
}

variable "enable_execute_command" {
  description = "Enable ECS Exec (SSM) into running tasks for debugging."
  type        = bool
  default     = true
}

variable "container_insights" {
  description = "Enable CloudWatch Container Insights on the cluster."
  type        = bool
  default     = true
}

variable "use_fargate_spot" {
  description = "Run tasks on FARGATE_SPOT capacity instead of on-demand FARGATE."
  type        = bool
  default     = false
}

variable "log_retention_days" {
  description = "CloudWatch log retention in days."
  type        = number
  default     = 30
}

variable "log_kms_key_arn" {
  description = "Optional KMS key ARN to encrypt the log group."
  type        = string
  default     = null
}

variable "environment" {
  description = "Plain environment variables injected into the container."
  type        = map(string)
  default     = {}
}

variable "secrets" {
  description = "Secrets map: env var name => Secrets Manager/SSM Parameter ARN."
  type        = map(string)
  default     = {}
}

variable "enable_autoscaling" {
  description = "Provision Application Auto Scaling target + CPU policy."
  type        = bool
  default     = true
}

variable "autoscaling_min_capacity" {
  description = "Minimum task count for autoscaling."
  type        = number
  default     = 2
}

variable "autoscaling_max_capacity" {
  description = "Maximum task count for autoscaling."
  type        = number
  default     = 10
}

variable "autoscaling_cpu_target" {
  description = "Target average CPU utilization percent for scaling."
  type        = number
  default     = 60
}

variable "tags" {
  description = "Additional tags merged onto all resources."
  type        = map(string)
  default     = {}
}

outputs.tf

output "cluster_id" {
  description = "ECS cluster ID (ARN)."
  value       = aws_ecs_cluster.this.id
}

output "cluster_arn" {
  description = "ECS cluster ARN."
  value       = aws_ecs_cluster.this.arn
}

output "cluster_name" {
  description = "ECS cluster name."
  value       = aws_ecs_cluster.this.name
}

output "service_id" {
  description = "ECS service ARN/ID."
  value       = aws_ecs_service.this.id
}

output "service_name" {
  description = "ECS service name."
  value       = aws_ecs_service.this.name
}

output "task_definition_arn" {
  description = "Full ARN (with revision) of the active task definition."
  value       = aws_ecs_task_definition.this.arn
}

output "task_definition_family" {
  description = "Task definition family name."
  value       = aws_ecs_task_definition.this.family
}

output "log_group_name" {
  description = "CloudWatch log group receiving container logs."
  value       = aws_cloudwatch_log_group.this.name
}

output "autoscaling_target_resource_id" {
  description = "Application Auto Scaling resource ID, or null if disabled."
  value       = try(aws_appautoscaling_target.this[0].resource_id, null)
}

How to use it

module "ecs_cluster_service" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"

  name_prefix  = "prod"
  service_name = "checkout-api"
  aws_region   = "ap-south-1"

  container_image = "1234567890.dkr.ecr.ap-south-1.amazonaws.com/checkout-api:2026.06.1"
  container_port  = 8080
  task_cpu        = 1024
  task_memory     = 2048
  cpu_architecture = "ARM64"

  execution_role_arn = aws_iam_role.ecs_execution.arn
  task_role_arn      = aws_iam_role.checkout_task.arn

  subnet_ids         = module.network.private_subnet_ids
  security_group_ids = [aws_security_group.checkout_tasks.id]
  target_group_arn   = aws_lb_target_group.checkout.arn

  environment = {
    LOG_LEVEL   = "info"
    APP_REGION  = "ap-south-1"
  }

  secrets = {
    DB_PASSWORD     = aws_secretsmanager_secret.db.arn
    STRIPE_API_KEY  = aws_secretsmanager_secret.stripe.arn
  }

  enable_autoscaling      = true
  autoscaling_min_capacity = 3
  autoscaling_max_capacity = 20
  autoscaling_cpu_target   = 55

  tags = {
    Environment = "prod"
    CostCenter  = "payments"
  }
}

# Downstream: alarm on the service's CPU using the autoscaling resource id,
# and surface the log group to a centralized dashboard.
resource "aws_cloudwatch_metric_alarm" "checkout_high_cpu" {
  alarm_name          = "checkout-api-cpu-high"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  statistic           = "Average"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 85
  period              = 60
  evaluation_periods  = 5

  dimensions = {
    ClusterName = module.ecs_cluster_service.cluster_name
    ServiceName = module.ecs_cluster_service.service_name
  }

  alarm_actions = [aws_sns_topic.oncall.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root config — live/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module config — live/prod/ecs/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  service_name = "..."
  aws_region = "..."
  container_image = "..."
  execution_role_arn = "..."
  subnet_ids = ["...", "..."]
  security_group_ids = ["...", "..."]
  target_group_arn = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/ecs && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name	Type	Default	Required	Description
name_prefix	string	—	Yes	Prefix for cluster/service names (team or environment).
service_name	string	—	Yes	Logical service name; also the primary container name (validated).
aws_region	string	—	Yes	Region for the awslogs log driver.
container_image	string	—	Yes	Fully qualified image, including tag or digest.
container_port	number	8080	No	Container listen port, wired to the target group.
task_cpu	number	512	No	Task-level CPU units (validated Fargate value).
task_memory	number	1024	No	Task-level memory (MiB); must form a valid Fargate combo.
cpu_architecture	string	ARM64	No	X86_64 or ARM64 (Graviton).
desired_count	number	2	No	Initial task count; ignored after create.
execution_role_arn	string	—	Yes	ECS task execution role ARN.
task_role_arn	string	null	No	IAM role assumed by the running container.
subnet_ids	list(string)	—	Yes	Subnets for awsvpc ENIs (>= 2, validated).
security_group_ids	list(string)	—	Yes	Security groups on task ENIs.
assign_public_ip	bool	false	No	Assign public IP to tasks.
target_group_arn	string	—	Yes	ALB/NLB target group ARN.
health_check_grace_period_seconds	number	60	No	LB health-check grace period for new tasks.
container_health_check_command	string	curl … /healthz	No	Container-level health check command.
deployment_minimum_healthy_percent	number	100	No	Min healthy percent during deploys.
deployment_maximum_percent	number	200	No	Max percent during deploys.
enable_execute_command	bool	true	No	Enable ECS Exec (SSM) into tasks.
container_insights	bool	true	No	Enable Container Insights on the cluster.
use_fargate_spot	bool	false	No	Use FARGATE_SPOT capacity.
log_retention_days	number	30	No	CloudWatch log retention.
log_kms_key_arn	string	null	No	KMS key for log encryption.
environment	map(string)	{}	No	Plain env vars for the container.
secrets	map(string)	{}	No	Env var name => Secrets Manager/SSM ARN.
enable_autoscaling	bool	true	No	Provision autoscaling target + CPU policy.
autoscaling_min_capacity	number	2	No	Minimum task count.
autoscaling_max_capacity	number	10	No	Maximum task count.
autoscaling_cpu_target	number	60	No	Target average CPU percent.
tags	map(string)	{}	No	Extra tags merged onto all resources.

Outputs

Name	Description
cluster_id	ECS cluster ID (ARN).
cluster_arn	ECS cluster ARN.
cluster_name	ECS cluster name (used in CloudWatch dimensions).
service_id	ECS service ARN/ID.
service_name	ECS service name.
task_definition_arn	Full ARN (with revision) of the active task definition.
task_definition_family	Task definition family name.
log_group_name	CloudWatch log group receiving container logs.
autoscaling_target_resource_id	Application Auto Scaling resource ID, or null when disabled.

Enterprise scenario

A payments platform runs roughly 40 microservices on Fargate across dev, staging, and prod accounts. Each service team instantiates this module from a thin per-service stack, passing only the image tag and the team’s IAM roles, so every service inherits the same guardrails: Container Insights for the SRE dashboards, deployment circuit breaker so a bad checkout-api:2026.06.1 rolls back automatically instead of paging the on-call at 2am, secrets pulled from Secrets Manager rather than baked into images, and ARM64 Graviton tasks that cut compute cost ~20% versus x86. When the platform team needs to raise the default log retention for PCI compliance, they bump one variable default and cut a new module tag — all 40 services pick it up on their next apply.

Best practices

Pin images by digest in production, not floating tags like :latest. A mutable tag means two terraform apply runs can launch different code with no diff; a digest (@sha256:...) makes deployments deterministic and forces a visible task-definition revision.
Keep deployment_circuit_breaker with rollback = true and set deployment_minimum_healthy_percent = 100 for customer-facing services so a failed rollout never reduces capacity below the current footprint.
Run tasks in private subnets with assign_public_ip = false, reach the internet via NAT or VPC endpoints, and scope task security groups to only the ALB’s SG on the container port — never 0.0.0.0/0.
Inject credentials via secrets, never environment. Values in environment are visible in the task definition and the console; secrets resolves Secrets Manager/SSM ARNs at launch and keeps them out of state in plaintext.
Let autoscaling own desired_count. This module’s ignore_changes = [desired_count] prevents Terraform from fighting Application Auto Scaling and snapping the service back to its baseline on every plan.
Prefer ARM64 Graviton and consider use_fargate_spot for non-critical / async workers to cut cost, but keep latency-sensitive request paths on on-demand Fargate to avoid Spot interruptions, and name resources ${name_prefix}-${service_name} so cluster, logs, and alarms line up across accounts.