IaC AWS

Terraform Module: AWS ECS Cluster & Service — Fargate workloads with rolling deploys and autoscaling baked in

Quick take — A reusable Terraform module for AWS ECS on Fargate: provisions an aws_ecs_cluster with Container Insights plus a load-balanced aws_ecs_service, task definition, target-tracking autoscaling, and CloudWatch logging. New here? Jump to the Quickstart below to deploy it in minutes; read on for how it works and when to reach for it.

Quickstart (copy-paste)

Minimal, runnable configuration — drop this in a .tf file and fill in the "..." placeholders (each required input is commented):

provider "aws" {
  region = "us-east-1"
}

module "ecs" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"

  name_prefix        = "..."           # Prefix for cluster/service names (team or environment).
  service_name       = "..."           # Logical service name; also the primary container name (…
  aws_region         = "..."           # Region for the awslogs log driver.
  container_image    = "..."           # Fully qualified image, including tag or digest.
  execution_role_arn = "..."           # ECS task execution role ARN.
  subnet_ids         = ["...", "..."]  # Subnets for awsvpc ENIs (>= 2, validated).
  security_group_ids = ["...", "..."]  # Security groups on task ENIs.
  target_group_arn   = "..."           # ALB/NLB target group ARN.
}

Then terraform init && terraform apply. Every other input has a sensible default — see Inputs below to override behaviour.

What this module is

Amazon ECS (Elastic Container Service) is AWS’s native container orchestrator. A cluster is the logical boundary that capacity and services live in; a service is the long-running controller that keeps N copies of a task definition healthy, registers them behind a load balancer, and rolls out new revisions without dropping traffic. On their own, aws_ecs_cluster and aws_ecs_service look deceptively small — but a production service is never just those two resources. You also need a task definition with the right requires_compatibilities, a CloudWatch log group, an awsvpc network configuration with the correct security groups, an ALB target group wired to the container port, task and execution IAM roles, and an Application Auto Scaling target with a scaling policy.

This module wraps all of that into one opinionated, Fargate-first unit. You pass a container image, CPU/memory, a target group ARN, and subnet/SG IDs; the module returns a running, autoscaling, log-emitting service. Wrapping it as a module means every team ships ECS the same way — enableExecuteCommand for debugging, circuit-breaker rollback on failed deploys, Container Insights on by default — instead of copy-pasting a 200-line service block and quietly forgetting half of it.

When to use it

Module structure

terraform-module-aws-ecs/
├── versions.tf      # provider + Terraform version pins
├── main.tf          # cluster, task def, service, log group, autoscaling
├── variables.tf     # var-driven inputs with validation
└── outputs.tf       # cluster + service identifiers and ARNs

versions.tf

terraform {
  required_version = ">= 1.5.0"

  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 5.0"
    }
  }
}

main.tf

locals {
  name = "${var.name_prefix}-${var.service_name}"

  tags = merge(
    {
      "Service"   = var.service_name
      "ManagedBy" = "terraform"
      "Module"    = "terraform-module-aws-ecs"
    },
    var.tags
  )
}

# ---------------------------------------------------------------------------
# CloudWatch log group for container stdout/stderr
# ---------------------------------------------------------------------------
resource "aws_cloudwatch_log_group" "this" {
  name              = "/ecs/${local.name}"
  retention_in_days = var.log_retention_days
  kms_key_id        = var.log_kms_key_arn
  tags              = local.tags
}

# ---------------------------------------------------------------------------
# Cluster with Container Insights
# ---------------------------------------------------------------------------
resource "aws_ecs_cluster" "this" {
  name = local.name

  setting {
    name  = "containerInsights"
    value = var.container_insights ? "enabled" : "disabled"
  }

  tags = local.tags
}

resource "aws_ecs_cluster_capacity_providers" "this" {
  cluster_name       = aws_ecs_cluster.this.name
  capacity_providers = ["FARGATE", "FARGATE_SPOT"]

  default_capacity_provider_strategy {
    capacity_provider = var.use_fargate_spot ? "FARGATE_SPOT" : "FARGATE"
    weight            = 1
    base              = var.use_fargate_spot ? 0 : 1
  }
}

# ---------------------------------------------------------------------------
# Task definition (awsvpc + Fargate)
# ---------------------------------------------------------------------------
resource "aws_ecs_task_definition" "this" {
  family                   = local.name
  requires_compatibilities = ["FARGATE"]
  network_mode             = "awsvpc"
  cpu                      = var.task_cpu
  memory                   = var.task_memory
  execution_role_arn       = var.execution_role_arn
  task_role_arn            = var.task_role_arn

  runtime_platform {
    operating_system_family = "LINUX"
    cpu_architecture        = var.cpu_architecture
  }

  container_definitions = jsonencode([
    {
      name      = var.service_name
      image     = var.container_image
      essential = true
      cpu       = var.task_cpu
      memory    = var.task_memory

      portMappings = [
        {
          containerPort = var.container_port
          protocol      = "tcp"
        }
      ]

      environment = [
        for k, v in var.environment : { name = k, value = v }
      ]

      secrets = [
        for k, arn in var.secrets : { name = k, valueFrom = arn }
      ]

      logConfiguration = {
        logDriver = "awslogs"
        options = {
          "awslogs-group"         = aws_cloudwatch_log_group.this.name
          "awslogs-region"        = var.aws_region
          "awslogs-stream-prefix" = var.service_name
        }
      }

      healthCheck = {
        command     = ["CMD-SHELL", var.container_health_check_command]
        interval    = 30
        timeout     = 5
        retries     = 3
        startPeriod = 30
      }
    }
  ])

  tags = local.tags
}

# ---------------------------------------------------------------------------
# Service
# ---------------------------------------------------------------------------
resource "aws_ecs_service" "this" {
  name            = local.name
  cluster         = aws_ecs_cluster.this.id
  task_definition = aws_ecs_task_definition.this.arn
  desired_count   = var.desired_count
  launch_type     = var.use_fargate_spot ? null : "FARGATE"

  enable_execute_command = var.enable_execute_command

  health_check_grace_period_seconds = var.health_check_grace_period_seconds

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  deployment_maximum_percent         = var.deployment_maximum_percent
  deployment_minimum_healthy_percent = var.deployment_minimum_healthy_percent

  dynamic "capacity_provider_strategy" {
    for_each = var.use_fargate_spot ? [1] : []
    content {
      capacity_provider = "FARGATE_SPOT"
      weight            = 1
    }
  }

  network_configuration {
    subnets          = var.subnet_ids
    security_groups  = var.security_group_ids
    assign_public_ip = var.assign_public_ip
  }

  load_balancer {
    target_group_arn = var.target_group_arn
    container_name   = var.service_name
    container_port   = var.container_port
  }

  lifecycle {
    ignore_changes = [desired_count]
  }

  tags = local.tags

  depends_on = [aws_ecs_cluster_capacity_providers.this]
}

# ---------------------------------------------------------------------------
# Target-tracking autoscaling
# ---------------------------------------------------------------------------
resource "aws_appautoscaling_target" "this" {
  count = var.enable_autoscaling ? 1 : 0

  max_capacity       = var.autoscaling_max_capacity
  min_capacity       = var.autoscaling_min_capacity
  resource_id        = "service/${aws_ecs_cluster.this.name}/${aws_ecs_service.this.name}"
  scalable_dimension = "ecs:service:DesiredCount"
  service_namespace  = "ecs"
}

resource "aws_appautoscaling_policy" "cpu" {
  count = var.enable_autoscaling ? 1 : 0

  name               = "${local.name}-cpu-tracking"
  policy_type        = "TargetTrackingScaling"
  resource_id        = aws_appautoscaling_target.this[0].resource_id
  scalable_dimension = aws_appautoscaling_target.this[0].scalable_dimension
  service_namespace  = aws_appautoscaling_target.this[0].service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = "ECSServiceAverageCPUUtilization"
    }
    target_value       = var.autoscaling_cpu_target
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

variables.tf

variable "name_prefix" {
  description = "Prefix for cluster/service names (e.g. team or environment)."
  type        = string
}

variable "service_name" {
  description = "Logical service name; also the primary container name."
  type        = string

  validation {
    condition     = can(regex("^[a-z0-9][a-z0-9-]{0,30}$", var.service_name))
    error_message = "service_name must be lowercase alphanumeric/hyphens, max 31 chars."
  }
}

variable "aws_region" {
  description = "Region for the awslogs log driver."
  type        = string
}

variable "container_image" {
  description = "Fully qualified container image, including tag or digest."
  type        = string
}

variable "container_port" {
  description = "Port the container listens on (wired to the target group)."
  type        = number
  default     = 8080
}

variable "task_cpu" {
  description = "Task-level CPU units. Must be a valid Fargate combo with task_memory."
  type        = number
  default     = 512

  validation {
    condition     = contains([256, 512, 1024, 2048, 4096, 8192, 16384], var.task_cpu)
    error_message = "task_cpu must be a valid Fargate CPU value (256, 512, 1024, ...)."
  }
}

variable "task_memory" {
  description = "Task-level memory (MiB). Must be a valid Fargate combo with task_cpu."
  type        = number
  default     = 1024
}

variable "cpu_architecture" {
  description = "Fargate CPU architecture: X86_64 or ARM64 (Graviton, cheaper)."
  type        = string
  default     = "ARM64"

  validation {
    condition     = contains(["X86_64", "ARM64"], var.cpu_architecture)
    error_message = "cpu_architecture must be X86_64 or ARM64."
  }
}

variable "desired_count" {
  description = "Initial task count (ignored after create when autoscaling manages it)."
  type        = number
  default     = 2
}

variable "execution_role_arn" {
  description = "ECS task execution role ARN (pulls images, writes logs, reads secrets)."
  type        = string
}

variable "task_role_arn" {
  description = "IAM role assumed by the running container for AWS API calls."
  type        = string
  default     = null
}

variable "subnet_ids" {
  description = "Subnet IDs for the awsvpc ENIs (private subnets recommended)."
  type        = list(string)

  validation {
    condition     = length(var.subnet_ids) >= 2
    error_message = "Provide at least two subnets across AZs for high availability."
  }
}

variable "security_group_ids" {
  description = "Security groups attached to task ENIs."
  type        = list(string)
}

variable "assign_public_ip" {
  description = "Assign a public IP to tasks (only for public subnets without NAT)."
  type        = bool
  default     = false
}

variable "target_group_arn" {
  description = "ALB/NLB target group ARN tasks register into."
  type        = string
}

variable "health_check_grace_period_seconds" {
  description = "Grace period before the LB health check can mark a new task unhealthy."
  type        = number
  default     = 60
}

variable "container_health_check_command" {
  description = "Shell command for the container-level health check."
  type        = string
  default     = "curl -f http://localhost:8080/healthz || exit 1"
}

variable "deployment_minimum_healthy_percent" {
  description = "Minimum percent of tasks kept healthy during a deploy."
  type        = number
  default     = 100
}

variable "deployment_maximum_percent" {
  description = "Maximum percent of tasks allowed during a deploy."
  type        = number
  default     = 200
}

variable "enable_execute_command" {
  description = "Enable ECS Exec (SSM) into running tasks for debugging."
  type        = bool
  default     = true
}

variable "container_insights" {
  description = "Enable CloudWatch Container Insights on the cluster."
  type        = bool
  default     = true
}

variable "use_fargate_spot" {
  description = "Run tasks on FARGATE_SPOT capacity instead of on-demand FARGATE."
  type        = bool
  default     = false
}

variable "log_retention_days" {
  description = "CloudWatch log retention in days."
  type        = number
  default     = 30
}

variable "log_kms_key_arn" {
  description = "Optional KMS key ARN to encrypt the log group."
  type        = string
  default     = null
}

variable "environment" {
  description = "Plain environment variables injected into the container."
  type        = map(string)
  default     = {}
}

variable "secrets" {
  description = "Secrets map: env var name => Secrets Manager/SSM Parameter ARN."
  type        = map(string)
  default     = {}
}

variable "enable_autoscaling" {
  description = "Provision Application Auto Scaling target + CPU policy."
  type        = bool
  default     = true
}

variable "autoscaling_min_capacity" {
  description = "Minimum task count for autoscaling."
  type        = number
  default     = 2
}

variable "autoscaling_max_capacity" {
  description = "Maximum task count for autoscaling."
  type        = number
  default     = 10
}

variable "autoscaling_cpu_target" {
  description = "Target average CPU utilization percent for scaling."
  type        = number
  default     = 60
}

variable "tags" {
  description = "Additional tags merged onto all resources."
  type        = map(string)
  default     = {}
}

outputs.tf

output "cluster_id" {
  description = "ECS cluster ID (ARN)."
  value       = aws_ecs_cluster.this.id
}

output "cluster_arn" {
  description = "ECS cluster ARN."
  value       = aws_ecs_cluster.this.arn
}

output "cluster_name" {
  description = "ECS cluster name."
  value       = aws_ecs_cluster.this.name
}

output "service_id" {
  description = "ECS service ARN/ID."
  value       = aws_ecs_service.this.id
}

output "service_name" {
  description = "ECS service name."
  value       = aws_ecs_service.this.name
}

output "task_definition_arn" {
  description = "Full ARN (with revision) of the active task definition."
  value       = aws_ecs_task_definition.this.arn
}

output "task_definition_family" {
  description = "Task definition family name."
  value       = aws_ecs_task_definition.this.family
}

output "log_group_name" {
  description = "CloudWatch log group receiving container logs."
  value       = aws_cloudwatch_log_group.this.name
}

output "autoscaling_target_resource_id" {
  description = "Application Auto Scaling resource ID, or null if disabled."
  value       = try(aws_appautoscaling_target.this[0].resource_id, null)
}

How to use it

module "ecs_cluster_service" {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"

  name_prefix  = "prod"
  service_name = "checkout-api"
  aws_region   = "ap-south-1"

  container_image = "1234567890.dkr.ecr.ap-south-1.amazonaws.com/checkout-api:2026.06.1"
  container_port  = 8080
  task_cpu        = 1024
  task_memory     = 2048
  cpu_architecture = "ARM64"

  execution_role_arn = aws_iam_role.ecs_execution.arn
  task_role_arn      = aws_iam_role.checkout_task.arn

  subnet_ids         = module.network.private_subnet_ids
  security_group_ids = [aws_security_group.checkout_tasks.id]
  target_group_arn   = aws_lb_target_group.checkout.arn

  environment = {
    LOG_LEVEL   = "info"
    APP_REGION  = "ap-south-1"
  }

  secrets = {
    DB_PASSWORD     = aws_secretsmanager_secret.db.arn
    STRIPE_API_KEY  = aws_secretsmanager_secret.stripe.arn
  }

  enable_autoscaling      = true
  autoscaling_min_capacity = 3
  autoscaling_max_capacity = 20
  autoscaling_cpu_target   = 55

  tags = {
    Environment = "prod"
    CostCenter  = "payments"
  }
}

# Downstream: alarm on the service's CPU using the autoscaling resource id,
# and surface the log group to a centralized dashboard.
resource "aws_cloudwatch_metric_alarm" "checkout_high_cpu" {
  alarm_name          = "checkout-api-cpu-high"
  namespace           = "AWS/ECS"
  metric_name         = "CPUUtilization"
  statistic           = "Average"
  comparison_operator = "GreaterThanThreshold"
  threshold           = 85
  period              = 60
  evaluation_periods  = 5

  dimensions = {
    ClusterName = module.ecs_cluster_service.cluster_name
    ServiceName = module.ecs_cluster_service.service_name
  }

  alarm_actions = [aws_sns_topic.oncall.arn]
}

With Terragrunt

Terragrunt keeps this module DRY across environments — define the backend and provider once in a root config, then a thin terragrunt.hcl per environment supplies only the inputs that differ.

1. Root configlive/terragrunt.hcl (inherited by every module):

remote_state {
  backend = "s3"
  generate = { path = "backend.tf", if_exists = "overwrite" }
  config = {
    # ...s3 state bucket/container + key per path...
  }
}

2. Module configlive/prod/ecs/terragrunt.hcl:

include "root" {
  path = find_in_parent_folders()
}

terraform {
  source = "git::https://dev.azure.com/teknohut/kloudvin/_git/terraform-modules//terraform-module-aws-ecs?ref=v1.0.0"
}

inputs = {
  name_prefix = "..."
  service_name = "..."
  aws_region = "..."
  container_image = "..."
  execution_role_arn = "..."
  subnet_ids = ["...", "..."]
  security_group_ids = ["...", "..."]
  target_group_arn = "..."
}

3. Deploy one environment, or roll out all modules together:

cd live/prod/ecs && terragrunt apply        # this module
terragrunt run-all apply                      # every module under live/prod

Why Terragrunt here: the backend and provider live in one place instead of being copy-pasted into every module; inputs is overridden per environment (dev / stage / prod) without forking the module; and run-all orchestrates dependencies across modules. Reach for it once you have more than one environment or more than a handful of modules — for a single stack, the plain Quickstart above is enough.

Inputs

Name Type Default Required Description
name_prefix string Yes Prefix for cluster/service names (team or environment).
service_name string Yes Logical service name; also the primary container name (validated).
aws_region string Yes Region for the awslogs log driver.
container_image string Yes Fully qualified image, including tag or digest.
container_port number 8080 No Container listen port, wired to the target group.
task_cpu number 512 No Task-level CPU units (validated Fargate value).
task_memory number 1024 No Task-level memory (MiB); must form a valid Fargate combo.
cpu_architecture string ARM64 No X86_64 or ARM64 (Graviton).
desired_count number 2 No Initial task count; ignored after create.
execution_role_arn string Yes ECS task execution role ARN.
task_role_arn string null No IAM role assumed by the running container.
subnet_ids list(string) Yes Subnets for awsvpc ENIs (>= 2, validated).
security_group_ids list(string) Yes Security groups on task ENIs.
assign_public_ip bool false No Assign public IP to tasks.
target_group_arn string Yes ALB/NLB target group ARN.
health_check_grace_period_seconds number 60 No LB health-check grace period for new tasks.
container_health_check_command string curl … /healthz No Container-level health check command.
deployment_minimum_healthy_percent number 100 No Min healthy percent during deploys.
deployment_maximum_percent number 200 No Max percent during deploys.
enable_execute_command bool true No Enable ECS Exec (SSM) into tasks.
container_insights bool true No Enable Container Insights on the cluster.
use_fargate_spot bool false No Use FARGATE_SPOT capacity.
log_retention_days number 30 No CloudWatch log retention.
log_kms_key_arn string null No KMS key for log encryption.
environment map(string) {} No Plain env vars for the container.
secrets map(string) {} No Env var name => Secrets Manager/SSM ARN.
enable_autoscaling bool true No Provision autoscaling target + CPU policy.
autoscaling_min_capacity number 2 No Minimum task count.
autoscaling_max_capacity number 10 No Maximum task count.
autoscaling_cpu_target number 60 No Target average CPU percent.
tags map(string) {} No Extra tags merged onto all resources.

Outputs

Name Description
cluster_id ECS cluster ID (ARN).
cluster_arn ECS cluster ARN.
cluster_name ECS cluster name (used in CloudWatch dimensions).
service_id ECS service ARN/ID.
service_name ECS service name.
task_definition_arn Full ARN (with revision) of the active task definition.
task_definition_family Task definition family name.
log_group_name CloudWatch log group receiving container logs.
autoscaling_target_resource_id Application Auto Scaling resource ID, or null when disabled.

Enterprise scenario

A payments platform runs roughly 40 microservices on Fargate across dev, staging, and prod accounts. Each service team instantiates this module from a thin per-service stack, passing only the image tag and the team’s IAM roles, so every service inherits the same guardrails: Container Insights for the SRE dashboards, deployment circuit breaker so a bad checkout-api:2026.06.1 rolls back automatically instead of paging the on-call at 2am, secrets pulled from Secrets Manager rather than baked into images, and ARM64 Graviton tasks that cut compute cost ~20% versus x86. When the platform team needs to raise the default log retention for PCI compliance, they bump one variable default and cut a new module tag — all 40 services pick it up on their next apply.

Best practices

TerraformAWSECS Cluster & ServiceModuleIaC
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading