Multi-Environment 3-Tier Infrastructure with Terragrunt & CI/CD Approval Gates

Every team eventually arrives at the same wall. You have a tidy set of Terraform modules, you have a dev environment that works, and then someone says the sentence that changes everything: “we need uat, staging, and prod too, and prod can’t be applied by whoever happens to run the pipeline.” That is the moment a single root configuration stops being enough. You now need four near-identical environments that share module code but differ in size, redundancy, and — crucially — in who is allowed to change them. This lesson is the centrepiece of the Terragrunt track: we take a real 3-tier web application (load balancer → compute → database, with object storage and a VPC underneath) and build the full dev→uat→staging→prod promotion pipeline with Terragrunt for DRY configuration and a graduated CI/CD approval-gate model — dev auto-applies, uat and staging need a human to click approve, and prod is gated behind GitHub Environments with required reviewers. We will use OIDC so there are no long-lived cloud keys anywhere, and we will wire drift detection so the pipeline tells you when reality has wandered from code.

This is the layout and the workflow that real platform teams run. By the end you will have a repository structure you can copy, worked terragrunt.hcl snippets for every layer, and an approval model that satisfies an auditor. The examples target AWS for concreteness, but the pattern is cloud-agnostic — the same structure works against Azure or GCP by swapping the backend block and the provider, and everything runs identically on OpenTofu.

Learning objectives

By the end of this lesson you will be able to:

Design a repository that separates a global module library from a live tree of per-environment instantiations for a 3-tier application.
Write per-environment terragrunt.hcl files that compose only the modules an app needs (vpc, ec2/compute, alb, rds, s3) and wire them with dependency blocks.
Derive remote-state keys and provider roles from the directory path so dev and prod can never collide or cross-contaminate.
Implement a graduated approval-gate pipeline: dev auto-apply, uat/staging manual approval, prod via GitHub Environments + required reviewers (with the Azure DevOps and Atlantis/Spacelift equivalents).
Configure OIDC keyless authentication so CI assumes a per-environment role with no stored secrets.
Add drift detection so scheduled plans surface out-of-band changes before they bite.

Prerequisites

You should be comfortable with core Terraform (HCL, providers, the init→plan→apply→destroy workflow, and remote state) and with authoring reusable modules — covered in Terraform Fundamentals and Authoring Terraform Modules earlier in this track. You should also know Terragrunt’s basic blocks (terraform, include, remote_state, inputs, dependency, generate) from Terragrunt Fundamentals: DRY Configurations, Remote State & Dependencies. This lesson sits in the Terragrunt module of the Terraform Zero-to-Hero course and is the advanced capstone that ties modules, Terragrunt, and CI/CD together before we move on to troubleshooting and architecture. A free GitHub account and a sandbox cloud account (AWS free tier is fine for the lab) are enough to follow along; production-scale resources are described but you can validate the whole structure with validate/plan and never spend a rupee.

The shape of the problem: one app, four environments

A 3-tier application has three logical tiers — a presentation/load-balancing tier, an application/compute tier, and a data tier — usually with shared foundational pieces (a network and some object storage). Stamped across four environments, that is a lot of resources, and the naive approach (copy the root module four times) produces duplication that drifts apart within weeks. The Terragrunt approach keeps the definition of each tier in one reusable module and instantiates it per environment with only the differences spelled out.

The environments are not equal. They differ deliberately along axes that matter for cost and safety:

Aspect	dev	uat	staging	prod
Purpose	fast iteration	business/UAT sign-off	production rehearsal	live traffic
Compute size	small (t3.small)	medium	prod-like	prod (right-sized)
Instance count / autoscale	1	2	2–4	3–20
RDS	single-AZ, small	single-AZ	Multi-AZ	Multi-AZ + read replicas
NAT gateways	single (cost saving)	single	one per AZ	one per AZ
Deletion protection	off	off	on	on
Apply policy	auto on merge	manual approval	manual approval	required reviewers
Blast radius	throwaway	low	medium	highest

The last two rows are the heart of this lesson. The infrastructure differences are inputs; the who-can-apply differences are CI/CD approval gates. Terragrunt handles the first, your CI platform handles the second, and the repository layout is what makes both tractable.

Repository layout: global modules + a per-app live tree

Separate two things that change at different rates and for different reasons. Modules are reusable, versioned building blocks that change rarely and deliberately. The live tree is the per-environment instantiation that changes constantly. Mixing them is the original sin of Terraform repositories.

infra/
  modules/                          # GLOBAL reusable module library (versioned)
    app-vpc/                        # VPC, subnets, NAT, route tables
      main.tf  variables.tf  outputs.tf  versions.tf  README.md
    app-ec2/                        # compute tier (ASG / launch template)
    app-alb/                        # load-balancing tier (ALB + target group)
    app-rds/                        # data tier (RDS instance / cluster)
    app-s3/                         # object storage (assets / state-adjacent)
  apps/
    3tier-app/                      # ONE application, four environments
      root.hcl                      # shared backend + provider generation
      env.hcl                       # (optional) app-wide common inputs
      envs/
        dev/
          env.hcl                   # account_id, environment="dev", region
          vpc/terragrunt.hcl
          s3/terragrunt.hcl
          rds/terragrunt.hcl
          ec2/terragrunt.hcl
          alb/terragrunt.hcl
        uat/
          env.hcl
          vpc/terragrunt.hcl
          ...
        staging/
          env.hcl
          ...
        prod/
          env.hcl
          ...
  .github/workflows/                # CI/CD: plan on PR, gated apply on merge

Two conventions are doing the heavy lifting here:

The directory path is the identity of a unit. apps/3tier-app/envs/prod/rds is unambiguous, and we will derive the state key directly from that path so two units can never overwrite each other’s state.
modules/ is global; apps/ composes from it. The 3-tier app does not redefine a VPC — it consumes the app-vpc module. A second app would reuse the same library. This is the difference between a module library and a pile of copy-pasted resources.

A note on where modules live: in this lesson they sit in the same repo under modules/ for readability, but in production you almost always pin to a versioned source — a Git tag (git::git@github.com:acme/infra-modules.git//app-vpc?ref=v1.4.0) or a private registry. Path-based sources are fine for a monorepo; versioned sources are what make promotion a deliberate, reviewable act (more on that below).

The global module library: compose only what the app needs

The point of a global library is that an app composes only the modules it requires. Our 3-tier app needs five: app-vpc, app-s3, app-rds, app-ec2, app-alb. Each is an ordinary Terraform module with typed inputs and clear outputs; Terragrunt never changes how a module is written, only how it is invoked.

Module	Tier	Key inputs (per env)	Key outputs (consumed by)
`app-vpc`	foundation	`vpc_cidr`, `az_count`, `single_nat_gateway`	`vpc_id`, `private_subnet_ids`, `public_subnet_ids` → ec2, alb, rds
`app-s3`	foundation	`bucket_name`, `versioning`, `force_destroy`	`bucket_id`, `bucket_arn` → ec2
`app-rds`	data	`instance_class`, `multi_az`, `subnet_ids`, `deletion_protection`	`db_endpoint`, `db_sg_id` → ec2
`app-ec2`	compute	`instance_type`, `min/max_size`, `subnet_ids`, `db_endpoint`, `bucket_arn`	`asg_name`, `instance_sg_id` → alb
`app-alb`	presentation	`subnet_ids`, `target_group_port`, `instance_sg_id`	`alb_dns_name`

The dependency direction is the natural one for a 3-tier app: VPC and S3 have no dependencies; RDS and EC2 need the VPC’s subnets; EC2 also needs the RDS endpoint and the S3 bucket ARN; ALB needs the EC2 security group to allow traffic through. Terragrunt will infer the apply order from these relationships — you never hand-write “vpc first.”

Here is the skeleton of one module so the inputs/outputs contract is concrete (app-rds):

# modules/app-rds/variables.tf
variable "identifier"          { type = string }
variable "instance_class"      { type = string }
variable "allocated_storage"   { type = number, default = 20 }
variable "multi_az"            { type = bool,   default = false }
variable "subnet_ids"          { type = list(string) }
variable "vpc_id"              { type = string }
variable "deletion_protection" { type = bool,   default = true }
variable "tags"                { type = map(string), default = {} }

# modules/app-rds/outputs.tf
output "db_endpoint" { value = aws_db_instance.this.endpoint }
output "db_sg_id"    { value = aws_security_group.db.id }

The module says nothing about which environment it is in — that is entirely the live tree’s job.

DRY foundations: root.hcl generates backend and provider

Define the backend and provider once in apps/3tier-app/root.hcl, and let every unit inherit it via include. This is the core DRY win: a new environment or a new tier never repeats a backend or provider block.

# apps/3tier-app/root.hcl
locals {
  env_vars = read_terragrunt_config(find_in_parent_folders("env.hcl"))
  account_id  = local.env_vars.locals.account_id
  environment = local.env_vars.locals.environment
  aws_region  = local.env_vars.locals.aws_region
}

# 1) Remote state: one definition, path-derived key → per-env isolation
remote_state {
  backend = "s3"
  generate = {
    path      = "backend.tf"
    if_exists = "overwrite_terragrunt"
  }
  config = {
    bucket       = "acme-tfstate-${local.account_id}"
    key          = "${path_relative_to_include()}/terraform.tfstate"
    region       = local.aws_region
    encrypt      = true
    use_lockfile = true     # S3-native lock; no DynamoDB table needed on current versions
  }
}

# 2) Provider: generated per unit, role derived from the env's account_id
generate "provider" {
  path      = "provider.tf"
  if_exists = "overwrite_terragrunt"
  contents  = <<EOF
provider "aws" {
  region = "${local.aws_region}"
  assume_role {
    role_arn = "arn:aws:iam::${local.account_id}:role/terraform-exec"
  }
  default_tags {
    tags = {
      Environment = "${local.environment}"
      Application = "3tier-app"
      ManagedBy   = "terragrunt"
    }
  }
}
EOF
}

# 3) App-wide inputs every unit can rely on
inputs = {
  environment = local.environment
  aws_region  = local.aws_region
}

Two details make this safe by construction:

key = "${path_relative_to_include()}/terraform.tfstate" turns the directory path into the state key. The prod RDS unit lands at envs/prod/rds/terraform.tfstate; the dev RDS unit at envs/dev/rds/terraform.tfstate. State isolation is automatic — there is no shared state and no chance of a dev apply touching prod’s objects.
The provider’s assume_role.role_arn is computed from each environment’s account_id. If dev and prod live in separate cloud accounts (the recommended blast-radius boundary), a dev apply physically cannot authenticate to prod. The guarantee comes from structure, not from a runbook step someone might skip.

Each environment carries its identity in a tiny env.hcl:

# apps/3tier-app/envs/dev/env.hcl
locals {
  environment = "dev"
  account_id  = "111111111111"
  aws_region  = "ap-south-1"
}

# apps/3tier-app/envs/prod/env.hcl
locals {
  environment = "prod"
  account_id  = "444444444444"   # separate account = hard isolation
  aws_region  = "ap-south-1"
}

Worked terragrunt.hcl: composing and wiring the tiers

Now the per-unit files. Each one is short: include the root (which activates the generated backend and provider), point terraform.source at the module, declare dependency blocks for what it consumes, and set environment-specific inputs.

Foundation — the VPC (no dependencies):

# apps/3tier-app/envs/dev/vpc/terragrunt.hcl
include "root" {
  path = find_in_parent_folders("root.hcl")
}

terraform {
  source = "${dirname(find_in_parent_folders("root.hcl"))}/../../modules/app-vpc"
  # production: source = "git::git@github.com:acme/infra-modules.git//app-vpc?ref=v1.4.0"
}

inputs = {
  vpc_cidr           = "10.10.0.0/16"
  az_count           = 2
  single_nat_gateway = true     # dev: one NAT to save ~₹3,000/mo
}

The prod VPC is the same module, differing only in inputs:

# apps/3tier-app/envs/prod/vpc/terragrunt.hcl
include "root" { path = find_in_parent_folders("root.hcl") }
terraform { source = "git::git@github.com:acme/infra-modules.git//app-vpc?ref=v1.4.0" }

inputs = {
  vpc_cidr           = "10.40.0.0/16"
  az_count           = 3
  single_nat_gateway = false    # prod: one NAT per AZ for resilience
}

Data tier — RDS (depends on the VPC):

# apps/3tier-app/envs/prod/rds/terragrunt.hcl
include "root" { path = find_in_parent_folders("root.hcl") }
terraform { source = "git::git@github.com:acme/infra-modules.git//app-rds?ref=v1.4.0" }

dependency "vpc" {
  config_path = "../vpc"
  mock_outputs = {
    vpc_id             = "vpc-00000000000000000"
    private_subnet_ids = ["subnet-aaaa", "subnet-bbbb", "subnet-cccc"]
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
}

inputs = {
  identifier          = "3tier-prod"
  instance_class      = "db.r6g.large"
  multi_az            = true
  allocated_storage   = 200
  deletion_protection = true
  vpc_id              = dependency.vpc.outputs.vpc_id
  subnet_ids          = dependency.vpc.outputs.private_subnet_ids
}

The mock_outputs block is the part people get wrong. When you plan the RDS unit before the VPC has ever been applied, the VPC’s real outputs do not exist and the plan would fail trying to read them. Mock values let plan/validate/init proceed with placeholders. The mock_outputs_allowed_terraform_commands allowlist is the safety latch: it ensures apply and destroy are never fed fake subnet IDs — an apply only runs once the real outputs exist.

Compute tier — EC2 (depends on VPC, RDS, and S3 — the composition point):

# apps/3tier-app/envs/prod/ec2/terragrunt.hcl
include "root" { path = find_in_parent_folders("root.hcl") }
terraform { source = "git::git@github.com:acme/infra-modules.git//app-ec2?ref=v1.4.0" }

dependency "vpc" {
  config_path  = "../vpc"
  mock_outputs = { private_subnet_ids = ["subnet-aaaa", "subnet-bbbb"] }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
}
dependency "rds" {
  config_path  = "../rds"
  mock_outputs = { db_endpoint = "mock.endpoint:5432", db_sg_id = "sg-rds-mock" }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
}
dependency "s3" {
  config_path  = "../s3"
  mock_outputs = { bucket_arn = "arn:aws:s3:::mock-bucket" }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
}

inputs = {
  instance_type = "m6i.large"
  min_size      = 3
  max_size      = 20
  subnet_ids    = dependency.vpc.outputs.private_subnet_ids
  db_endpoint   = dependency.rds.outputs.db_endpoint
  db_sg_id      = dependency.rds.outputs.db_sg_id
  bucket_arn    = dependency.s3.outputs.bucket_arn
}

Presentation tier — ALB (depends on VPC and EC2’s security group):

# apps/3tier-app/envs/prod/alb/terragrunt.hcl
include "root" { path = find_in_parent_folders("root.hcl") }
terraform { source = "git::git@github.com:acme/infra-modules.git//app-alb?ref=v1.4.0" }

dependency "vpc" {
  config_path  = "../vpc"
  mock_outputs = { public_subnet_ids = ["subnet-pub1", "subnet-pub2"] }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
}
dependency "ec2" {
  config_path  = "../ec2"
  mock_outputs = { instance_sg_id = "sg-app-mock", asg_name = "mock-asg" }
  mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
}

inputs = {
  subnet_ids        = dependency.vpc.outputs.public_subnet_ids
  target_group_port = 8080
  instance_sg_id    = dependency.ec2.outputs.instance_sg_id
}

Because each unit declares what it consumes, Terragrunt computes the full DAG: vpc and s3 first (in parallel), then rds, then ec2, then alb. You never encoded that order — it emerged from the dependency graph.

Running a whole environment

run-all walks every terragrunt.hcl under a directory, builds the DAG, and runs your command in topological order, parallelising independent units:

# Stand up the entire dev environment in dependency order
cd infra/apps/3tier-app/envs/dev
terragrunt run-all plan
terragrunt run-all apply

Terragrunt’s current direction. Recent Terragrunt introduces units and stacks (terragrunt.stack.hcl) as a first-class way to describe a whole environment as one composable artefact, and the CLI is consolidating run-all behaviour under the run command (e.g. terragrunt run --all plan). The find_in_parent_folders/dependency/generate model in this lesson remains fully supported and is what the overwhelming majority of repositories use today; stacks are the direction of travel for describing the envs/ tree more declaratively. Both run identically on OpenTofu by setting terraform_binary = "tofu" (or TG_TF_PATH=tofu).

Promotion: the same code flows dev → uat → staging → prod

Promotion is the payoff of this structure. Module code is identical across environments; only inputs and account wiring differ, and those live in small, reviewable files. A change flows like this:

Edit the module in modules/ (or the modules repo) and cut a release tag, e.g. v1.5.0.
Bump dev by changing ref=v1.4.0 → ref=v1.5.0 in envs/dev/.../terragrunt.hcl. Open a PR; CI shows the plan; merge auto-applies to dev.
Bake, watch dev, then bump uat to v1.5.0 — same one-line diff, separate PR, requires manual approval to apply.
Bump staging, approve, apply. This is your production rehearsal.
Bump prod — the prod PR diff is a single ref= line, which is exactly what you want a reviewer to scrutinise. Apply is gated behind required reviewers.

The reviewer of the prod change sees a one-line diff and a terraform plan, not a wall of resources. Environment-specific behaviour (sizes, Multi-AZ, NAT topology) stays in inputs and never moves between environments. Pin module versions per environment rather than floating all environments off main — the entire point of promotion is that prod runs code that already survived dev, uat, and staging.

The approval-gate model: dev-auto / uat-staging-manual / prod-reviewers

This is where infrastructure-as-code becomes change governance. The same pipeline definition applies a different gate per environment. The model:

Environment	Trigger	Gate	Who approves	Rationale
dev	merge to `main`	none — auto-apply	nobody	fast feedback; throwaway blast radius
uat	merge to `main`	manual approval	any team engineer	business sign-off happens here; cheap insurance
staging	merge to `main`	manual approval	any team engineer	production rehearsal; protect the dress run
prod	merge to `main`	required reviewers + (optional) wait timer	a named approver group, not the author	highest blast radius; audited, second-person change

The graduation is deliberate: friction rises with blast radius. Dev gets none so engineers iterate freely; prod gets a named, audited, second-person approval so no single person can change live infrastructure alone. The “required reviewers ≠ author” rule is the one auditors care about most — it enforces separation of duties.

GitHub Actions: Environments + required reviewers

GitHub’s native mechanism is Environments. You create four environments (dev, uat, staging, prod) in repo settings; on the protected ones you enable required reviewers (and optionally a wait timer and a branch restriction). A job that references environment: prod pauses before running until a listed reviewer approves — in the GitHub UI, via API, or via notification. Crucially, the approval is recorded on the deployment, giving you the audit trail for free.

# .github/workflows/deploy.yml
name: terragrunt-deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write     # OIDC — mint a short-lived token, no stored keys
  contents: read

jobs:
  dev:
    runs-on: ubuntu-latest
    environment: dev                       # no protection rules → auto-applies
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::111111111111:role/gha-terragrunt-dev
          aws-region: ap-south-1
      - uses: gruntwork-io/terragrunt-action@v2
        with:
          tg_command: "run-all apply --terragrunt-non-interactive"
          tg_dir: "infra/apps/3tier-app/envs/dev"

  uat:
    needs: dev
    runs-on: ubuntu-latest
    environment: uat                       # protection: manual approval (one reviewer)
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::222222222222:role/gha-terragrunt-uat
          aws-region: ap-south-1
      - uses: gruntwork-io/terragrunt-action@v2
        with:
          tg_command: "run-all apply --terragrunt-non-interactive"
          tg_dir: "infra/apps/3tier-app/envs/uat"

  staging:
    needs: uat
    runs-on: ubuntu-latest
    environment: staging                   # protection: manual approval
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::333333333333:role/gha-terragrunt-staging
          aws-region: ap-south-1
      - uses: gruntwork-io/terragrunt-action@v2
        with:
          tg_command: "run-all apply --terragrunt-non-interactive"
          tg_dir: "infra/apps/3tier-app/envs/staging"

  prod:
    needs: staging
    runs-on: ubuntu-latest
    environment: prod                      # protection: REQUIRED REVIEWERS (named group) + wait timer
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::444444444444:role/gha-terragrunt-prod
          aws-region: ap-south-1
      - uses: gruntwork-io/terragrunt-action@v2
        with:
          tg_command: "run-all apply --terragrunt-non-interactive"
          tg_dir: "infra/apps/3tier-app/envs/prod"

The gating is entirely declarative: dev has no protection rules and runs immediately; uat/staging carry a single-approver rule; prod carries required reviewers plus an optional wait timer. The needs: chain enforces order so prod can never run before staging has succeeded. A companion on: pull_request workflow should run run-all plan (read-only) so reviewers see the diff before they merge — apply on merge, plan on PR is the canonical split.

The same gate on other platforms

The model is portable; only the gating primitive changes:

Platform	Auto-apply (dev)	Manual gate (uat/staging)	Reviewer gate (prod)
GitHub Actions	environment with no rules	environment + 1 required reviewer	environment + required reviewers (named team) + wait timer + branch limit
Azure DevOps	pipeline stage, no checks	Environment with an Approvals check	Environment with Approvals (group) + Business Hours / exclusive-lock checks
GitLab CI	job runs on merge	`when: manual` job	protected environment + `when: manual` + deployment approval rules
Atlantis	`automerge` / auto-apply on dev workspace	`apply_requirements: [approved]`	`apply_requirements: [approved, mergeable]` + CODEOWNERS on `envs/prod/**`
Spacelift	autodeploy stack	manual confirm + a [policy] approval	login/approval policy requiring a second approver on the prod stack

Whatever the platform, the principle is identical: dev removes friction, prod adds an audited second pair of eyes, and uat/staging sit in between. A neat reinforcement on GitHub and Atlantis is CODEOWNERS: require the platform/SRE team as code owners on infra/apps/3tier-app/envs/prod/**, so a prod change cannot even merge without their review — defence in depth alongside the deployment gate.

OIDC keyless authentication: no long-lived secrets

Long-lived cloud access keys stored in CI are the single most common way infrastructure pipelines get breached. OIDC (OpenID Connect) removes them entirely. The CI platform acts as an identity provider; your cloud trusts that provider and issues a short-lived credential scoped to a specific role, only for the duration of the job, only for workflows that match a subject claim you control.

The flow, end to end:

You register the CI platform’s OIDC issuer as an identity provider in the cloud (one-time setup).
You create a role per environment (gha-terragrunt-dev, …-prod) whose trust policy says: trust tokens from this issuer, but only when the sub claim matches my repo and (for prod) a protected ref or environment.
At job start, the CI runner requests a signed OIDC token; the cloud verifies it against the issuer and the trust condition, then returns temporary credentials.
Terragrunt’s generated provider then assume_roles into the per-environment terraform-exec role to actually create resources.

The AWS trust policy that scopes prod to the protected environment looks like this:

{
  "Effect": "Allow",
  "Principal": { "Federated": "arn:aws:iam::444444444444:oidc-provider/token.actions.githubusercontent.com" },
  "Action": "sts:AssumeRoleWithWebIdentity",
  "Condition": {
    "StringEquals": {
      "token.actions.githubusercontent.com:aud": "sts.amazonaws.com",
      "token.actions.githubusercontent.com:sub": "repo:acme/infra:environment:prod"
    }
  }
}

The sub condition is the security boundary: only a job running in the prod GitHub Environment (which itself requires reviewer approval) can assume the prod role. A pull-request job, or a job in the dev environment, simply will not match. The cloud-specific equivalents are Azure Workload Identity Federation (federated credentials on a managed identity / app registration, used with azure/login’s OIDC mode) and GCP Workload Identity Federation (a workload identity pool + provider mapped to a service account). In all three, the result is the same: zero stored secrets, short-lived tokens, per-environment scoping.

Multi-environment 3-tier Terragrunt + approval gates

The diagram shows the global module library on the left feeding the four environment columns, each composing the five tiers via terragrunt.hcl, with the CI/CD pipeline along the bottom applying the graduated gates — auto for dev, manual for uat/staging, required reviewers for prod — and OIDC minting per-environment credentials.

Drift detection: catch out-of-band change before it bites

Drift is when reality diverges from code — someone widens a security-group rule in the console during an incident, and now your state and the world disagree. The fix is a scheduled plan: run run-all plan on every environment on a timer, and fail (or alert) if any unit shows a non-empty diff.

# .github/workflows/drift.yml
name: drift-detection
on:
  schedule:
    - cron: "0 6 * * *"     # 06:00 daily
permissions:
  id-token: write
  contents: read
jobs:
  drift:
    strategy:
      matrix:
        env: [dev, uat, staging, prod]
    runs-on: ubuntu-latest
    environment: ${{ matrix.env }}        # read-only role; plan never needs apply rights
    steps:
      - uses: actions/checkout@v4
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::role/gha-terragrunt-${{ matrix.env }}-readonly
          aws-region: ap-south-1
      - uses: gruntwork-io/terragrunt-action@v2
        with:
          tg_command: "run-all plan -detailed-exitcode"
          tg_dir: "infra/apps/3tier-app/envs/${{ matrix.env }}"

-detailed-exitcode returns 2 when there is a diff, which fails the job and surfaces the drift in your alerts. Use a read-only role for drift detection — a scheduled job should never hold apply rights. Managed platforms (Spacelift, Terraform Cloud/HCP, env0) ship drift detection as a built-in feature; the scheduled-plan pattern above is the free, portable equivalent.

Hands-on lab

This lab builds and validates the structure end to end using only local tooling and the free tier — you can do the whole thing with validate/plan and spend nothing. We will scaffold the layout, prove that state keys and provider roles are path-derived, and confirm the dependency graph.

Tooling (all free): Terraform 1.x or OpenTofu, Terragrunt, and (optionally) the AWS CLI for the OIDC-free local part. Verify:

terraform -version      # or: tofu -version
terragrunt -version

Steps:

Create the skeleton:

mkdir -p infra/modules/{app-vpc,app-s3,app-rds,app-ec2,app-alb}
mkdir -p infra/apps/3tier-app/envs/{dev,uat,staging,prod}
cd infra/apps/3tier-app

Write root.hcl and a dev/env.hcl (copy the snippets above; for the lab you can set bucket to a name you own and drop the assume_role block so it runs against your default credentials).
Add a trivial app-vpc module (a single null_resource is enough to prove wiring without spending), then a dev/vpc/terragrunt.hcl that sources it.
Prove path-derived state keys — initialise and inspect the generated backend:
```
cd envs/dev/vpc
terragrunt init
cat backend.tf            # exception to the no-cat rule: confirm generation
```
Expected output: a backend.tf whose key is envs/dev/vpc/terraform.tfstate. Repeat under envs/prod/vpc and confirm the key is envs/prod/vpc/terraform.tfstate — different, automatically.
Prove the dependency graph. Add rds and ec2 units with dependency blocks, then from the env root:
```
cd ../../             # envs/dev
terragrunt graph-dependencies
```
Expected output: Graphviz DOT showing vpc upstream of rds and ec2. Pipe to dot -Tpng -o graph.png if you want the picture.
Validate every unit without touching the cloud:
```
terragrunt run-all validate
```
Expected output: each unit reports Success! The configuration is valid.
Dry-run the whole environment (mock outputs let downstream units plan):
```
terragrunt run-all plan
```
Expected output: plans for all units in dependency order; downstream units show approximate diffs fed by mock outputs.

Validation: you have succeeded when (a) dev and prod generate different state keys from the same root.hcl, (b) graph-dependencies shows vpc before rds/ec2, and © run-all validate passes for every unit.

Cleanup:

# If you applied anything real:
cd infra/apps/3tier-app/envs/dev
terragrunt run-all destroy --terragrunt-non-interactive
# Remove generated files and the scaffold:
find infra -name 'backend.tf' -o -name 'provider.tf' -o -name '.terragrunt-cache' -prune
rm -rf infra

Cost note: if you keep everything at validate/plan, the lab is free. If you actually apply the real 3-tier modules, the meaningful costs are the RDS instance, the NAT gateway(s), and the ALB (each roughly a few hundred to a few thousand rupees per month). Use single_nat_gateway = true, the smallest instance classes, and terragrunt run-all destroy the moment you are done — and never leave a prod-sized RDS or a per-AZ NAT topology running in a sandbox.

Common mistakes & troubleshooting

Symptom	Likely cause	Fix
`Error: Unsupported attribute … dependency.vpc.outputs.vpc_id` on `plan`	dependency not yet applied and no `mock_outputs`	add `mock_outputs` with the keys the consumer reads, plus the command allowlist
`apply` proceeded with obviously fake IDs (e.g. `subnet-aaaa`)	`mock_outputs_allowed_terraform_commands` includes `apply`	restrict the allowlist to `["validate","plan","init"]` so apply uses real outputs
Two environments fighting over the same state	hard-coded `key` instead of `path_relative_to_include()`	derive the key from the path; verify each unit’s `backend.tf` has a distinct key
Prod pipeline applied without approval	job missing `environment:` or environment has no protection rules	reference the protected `environment:` and configure required reviewers on it
`Error: could not assume role … is not authorized to perform sts:AssumeRoleWithWebIdentity`	OIDC trust policy `sub` claim does not match the workflow/environment	align the `sub` condition with `repo:org/repo:environment:<env>` (or ref) exactly
`run-all apply` applies in the wrong order	a `dependency` block is missing, so Terragrunt cannot see the edge	add the `dependency` for every output a unit consumes; re-check `graph-dependencies`
Drift job never fails despite console changes	plan exit code ignored	use `plan -detailed-exitcode`; treat exit `2` as drift
Module change hit prod unexpectedly	environments float off `main` instead of pinned tags	pin `ref=` per environment; promote by bumping the tag deliberately

Best practices

Separate modules/ from the live tree. The library is versioned and stable; the live tree changes constantly. Never define a resource directly in envs/.
Compose, don’t copy. The 3-tier app consumes app-vpc/app-ec2/app-alb/app-rds/app-s3; a second app reuses the same library. Duplication is the enemy.
Derive state keys and provider roles from the path. Isolation and account-correctness then come from structure, not discipline.
Pin module versions per environment. Promote dev→uat→staging→prod by bumping a tag, one reviewable line at a time. Never let prod track main.
Plan on PR, apply on merge, gate by environment. Reviewers should see the diff before they approve; the gate enforces who may apply.
One cloud account per environment where you can — it makes the blast-radius boundary a hard wall rather than a tag convention.
Use OIDC everywhere; ban long-lived keys. Scope each role’s trust to the specific repo and environment.
Run scheduled drift detection with a read-only role on every environment, and alert on a non-empty plan.
Put the platform team in CODEOWNERS for envs/prod/** so prod changes need their review to even merge.
Keep the exit ramp open. Terragrunt only generates standard Terraform files and calls the normal binary; commit the generated backend.tf/provider.tf and inline inputs as .tfvars and you are back to vanilla Terraform with state intact.

Security notes

The security posture of this design rests on three pillars. First, no stored secrets: OIDC means there are no long-lived cloud keys in CI to leak; tokens are short-lived and scoped by the sub claim to a specific repository and environment. Second, separation of duties: prod apply requires a named reviewer who is not the author, recorded on the deployment for audit, and reinforced by CODEOWNERS on envs/prod/**. Third, hard isolation: per-environment accounts plus path-derived state and per-account assume_role mean a dev pipeline physically cannot authenticate to prod or touch its state.

A few more deliberate choices: state buckets are encrypted (encrypt = true) and access-controlled because state can contain secrets — an RDS connection detail or an initially-set password lands in state in plaintext, so the backend must be treated as sensitive. Grant CI roles least privilege — the drift role is read-only; the apply roles are scoped to the resources the app actually manages, not account-wide admin. Enable deletion protection on prod RDS and S3 (and consider a prevent_destroy lifecycle on irreplaceable resources) so a bad plan cannot delete the data tier. Finally, run policy-as-code (OPA/Conftest, Checkov, or tfsec) as a gate in the PR pipeline so a plan that opens 0.0.0.0/0 or disables encryption fails review automatically — covered in the policy-gates lessons of this track.

Interview & exam questions

Why use Terragrunt for a multi-environment 3-tier app instead of Terraform workspaces? Workspaces share one backend and one provider config and branch on terraform.workspace, which couples environments and offers no per-account provider isolation. Terragrunt keeps each environment in its own directory with a path-derived state key and a per-account assume_role, giving hard isolation, DRY backend/provider generation, and inter-module dependencies — the things a four-environment, multi-account setup actually needs.
How does Terragrunt guarantee dev and prod never share state? The remote_state block derives key from path_relative_to_include(), so each unit’s state path equals its directory path (envs/prod/rds/... vs envs/dev/rds/...). The keys differ by construction; there is no shared state and no manual key to get wrong.
What do mock_outputs and mock_outputs_allowed_terraform_commands do, and why is the allowlist critical? mock_outputs supply placeholder values so a consumer can plan/validate/init before its dependency has been applied. The allowlist restricts those mocks to read-only commands so apply and destroy are never fed fake IDs — apply only runs against real outputs. Including apply in the allowlist is a serious bug.
Describe the graduated approval-gate model and the reasoning behind it. Dev auto-applies (no gate) for fast feedback on a throwaway environment; uat and staging require a manual approval as cheap insurance and to protect the production rehearsal; prod requires named reviewers who are not the author, plus optionally a wait timer. Friction scales with blast radius, and prod enforces an audited second-person change.
How does OIDC remove the need for stored cloud credentials, and what scopes the access? The CI platform issues a short-lived signed token; the cloud trusts that issuer and returns temporary credentials only when the token’s sub claim matches a trust condition you set. Scoping to repo:org/repo:environment:prod means only a job in the approved prod environment can assume the prod role — no static keys exist.
A teammate widened a security-group rule in the console during an incident. How does your pipeline catch it, and how do you reconcile? A scheduled drift job runs run-all plan -detailed-exitcode; the non-empty plan returns exit code 2, fails the job, and alerts. You reconcile by either codifying the change (update the module/inputs and apply) or reverting it (re-apply to bring reality back to code) — deliberately, via the normal gated pipeline.
Why pin module source to a tag per environment rather than floating off main? Promotion’s whole value is that prod runs code already proven in dev, uat, and staging. If every environment tracks main, a merge changes all environments at once — you have effectively deployed straight to prod. Per-environment tags make promotion a deliberate, reviewable, one-line act.
How would you enforce that a prod change cannot be merged without the platform team’s review? Add the platform/SRE team as code owners on infra/apps/3tier-app/envs/prod/** via CODEOWNERS, with required reviews from code owners enabled on the protected branch. This blocks the merge; the GitHub Environment’s required reviewers additionally block the apply — defence in depth.
What is the dependency order Terragrunt computes for this app, and how does it know? vpc and s3 (no deps) → rds (needs vpc) → ec2 (needs vpc, rds, s3) → alb (needs vpc, ec2). Terragrunt infers it from the dependency blocks; run-all builds the DAG and applies in topological order, parallelising independent units. You never write the order explicitly.
Compare doing the prod gate in GitHub Actions vs Azure DevOps vs Atlantis. GitHub uses Environments with required reviewers (and optional wait timer/branch limit). Azure DevOps uses an Environment with an Approvals check (plus Business Hours / exclusive-lock checks). Atlantis uses apply_requirements: [approved, mergeable] plus CODEOWNERS on the prod path. The primitive differs; the principle — an audited second approver on prod — is identical.
Where can secrets end up in this system, and how do you protect them? In state (RDS connection details, initially-set passwords) — so the backend is encrypted and access-controlled, and real secrets are kept out of Terraform (Vault / cloud secret manager). In CI — eliminated by OIDC. In logs — avoid printing sensitive outputs; mark variables sensitive.
How do you keep an exit ramp from Terragrunt? Terragrunt only generates standard backend.tf/provider.tf and calls the normal terraform/tofu binary. To leave, commit the generated files, inline the inputs as .tfvars, and you are back to vanilla Terraform with state untouched. Adopt Terragrunt for the duplication it removes, and keep the generated output boring enough that walking away stays possible.

Quick check

What Terragrunt function makes each unit’s state key unique without you hand-writing it?
Which environments in this model auto-apply, which need a manual approval, and which need required reviewers?
What is the purpose of mock_outputs_allowed_terraform_commands, and which commands must it exclude?
In OIDC, what part of the token is used to scope a job to the prod environment’s role?
What flag makes a scheduled run-all plan fail when drift is present?

Answers

path_relative_to_include() — used as the key in the remote_state config, it turns the directory path into the state key.
dev auto-applies; uat and staging require a manual approval; prod requires required reviewers (a named group, not the author).
It restricts mock outputs to read-only commands so apply/destroy always use real dependency outputs; it must exclude apply and destroy.
The sub (subject) claim, matched against a condition like repo:org/repo:environment:prod in the role’s trust policy.
-detailed-exitcode — it returns exit code 2 on a non-empty plan, failing the job.

Exercise

Extend the repository to a second region for prod only, to support an active/passive DR posture, without breaking the dev→uat→staging→prod model:

Add envs/prod-dr/env.hcl with the same account_id as prod but a different aws_region, and stamp the vpc/rds/ec2/alb units there (no S3 if you replicate the bucket cross-region).
Confirm via the generated backend.tf that prod-dr state keys are distinct from prod (they will be, because the path differs).
Add a prod-dr job to the pipeline that depends on prod and is gated behind the same required-reviewers Environment, so DR changes are governed identically to prod.
Add prod-dr to the drift-detection matrix with a read-only role.
Stretch: make the RDS in prod-dr a cross-region read replica by adding a dependency "primary_rds" { config_path = "../../prod/rds" } and passing its identifier as the replica source. Note the new edge in graph-dependencies and reason about what run-all destroy ordering must now be.

Write down: which files you added versus changed (you should add many and change almost none of the existing units), and one sentence on why the approval gate for prod-dr should match prod rather than staging.

Certification mapping

This lesson maps directly to the HashiCorp Terraform Associate (003) objectives on remote state and backends, module sources and versioning, the core workflow, and managing multiple environments — Terragrunt is the orchestration layer, but the underlying Terraform concepts (state isolation, module composition, dependency ordering) are exactly what the exam tests. The CI/CD approval-gate and OIDC material maps to the cloud DevOps professional exams: AWS Certified DevOps Engineer – Professional (DOP-C02) (deployment governance, OIDC/short-lived credentials, multi-account strategy), Microsoft Azure DevOps Engineer Expert (AZ-400) (Environments, approvals and checks, deployment gates, workload identity federation), and Google Cloud Professional DevOps Engineer (progressive delivery and workload identity federation). The drift-detection and policy-gate practices also surface in those DevOps exams’ “secure and govern infrastructure” domains.

Glossary

Live tree — the per-environment instantiation of infrastructure (apps/.../envs/dev|uat|staging|prod), as opposed to the reusable modules/ library.
Unit — a single deployable Terragrunt directory (one terragrunt.hcl sourcing one module); the atom that run-all operates on. Recent Terragrunt also uses stacks to group units.
dependency block — declares that one unit consumes another’s outputs; Terragrunt uses these edges to build the apply-order DAG.
mock_outputs — placeholder values that let a consumer plan/validate before its dependency is applied; gated by a command allowlist so apply never uses them.
path_relative_to_include() — Terragrunt function returning the path from the included (parent) config to the current unit; used to derive a unique state key.
Approval gate — a CI/CD control that pauses a deployment until a condition is met (manual click, named reviewer, wait timer).
GitHub Environment — a deployment target with optional protection rules (required reviewers, wait timer, branch restriction) and recorded approvals.
Required reviewers — a protection rule naming people/teams who must approve before a job targeting that environment runs.
OIDC (OpenID Connect) — a federation standard letting CI obtain short-lived cloud credentials by presenting a signed token, removing stored keys.
sub claim — the subject field of an OIDC token (e.g. repo:org/repo:environment:prod) used to scope which workflows may assume a role.
Drift — divergence between deployed reality and the code/state; detected by a scheduled plan.
Promotion — flowing the same module version from dev → uat → staging → prod, deliberately, one reviewable tag bump per environment.
Separation of duties — the control that the person approving a prod change is not the person who authored it.

Next steps

You now have the full multi-environment, gated, keyless pipeline for a 3-tier app. The natural next lesson is Terraform Troubleshooting: State, Providers, Drift, Dependencies & Debugging, which gives you the symptom→cause→fix playbooks for when a run-all goes wrong, state gets stuck or corrupted, or a provider/auth error blocks a gated apply. From there, deepen the governance side with the OPA/Conftest and Checkov/tfsec policy-gate lessons (turn “a reviewer should catch this” into “the pipeline rejects it automatically”), and the GitHub Actions Terraform OIDC lesson for the PR-automation details. When you are ready to see where this pattern sits in the bigger picture, The Terraform Architecting Ladder places Terragrunt DRY multi-account at rung four and shows the path up to a full enterprise IaC platform.