A two-account Terragrunt repo is easy. Forty accounts across three regions, with two hundred units whose apply order is dictated by a dependency graph nobody can hold in their head, is a different machine. The wrapper that removed your backend.tf duplication is now the thing standing between a one-line PR and a thirty-minute serialized apply that fails on unit 137 and leaves the run half-applied.
This article is about that machine: how Terragrunt builds the dependency DAG, how run --all and run --graph traverse it, how to make plans survive a greenfield where downstream outputs do not yet exist, and — the part that actually matters at scale — how to run only the units a change touched, in parallel, safely, in CI. It assumes you already know include, remote_state generation, and the live/modules split. If you do not, start with the DRY multi-environment article and come back.
A note on CLI syntax. Terragrunt v0.88.0 redesigned the CLI.
terragrunt run-all applyis nowterragrunt run --all apply;graph-dependenciesis nowdag graph;--terragrunt-include-diris now--queue-include-dir;--terragrunt-non-interactiveis now--non-interactive. The legacy forms still work as deprecated aliases, so older pipelines will not break, but every command below uses the current syntax. If your CI logs warn about deprecated commands, that is what they mean.
1. Structure the live tree so the path is the identity
The hierarchy is account / region / environment / component, and it is not cosmetic. Terragrunt derives the state key from path_relative_to_include(), the provider role from the account file you are standing under, and the DAG from the config_path references between sibling directories. The directory layout is the deployment topology.
infra/
modules/ # versioned, reusable TF/OpenTofu modules
live/
root.hcl # backend + provider generation, version pins
_envcommon/ # per-component config shared across all envs
network.hcl
eks.hcl
prod/
account.hcl # account_id, account_name
us-east-1/
region.hcl # aws_region
platform/ # the "environment" layer
network/
terragrunt.hcl
eks/
terragrunt.hcl
rds/
terragrunt.hcl
eu-west-1/
region.hcl
platform/
network/
eks/
staging/
account.hcl
us-east-1/ ...
Two structural rules pay off at scale:
- One module instantiation per leaf directory. A leaf with a
terragrunt.hclis a unit — Terragrunt’s atomic node in the DAG. Never put two modules in one directory; you lose the ability to plan, target, and roll back them independently. _envcommon/for component-level DRY. The EKS inputs that never differ between staging and prod (addon versions, IRSA wiring, log retention) live in_envcommon/eks.hcland are pulled in with a secondinclude. Only the genuinely environment-specific values (cluster name, node counts) stay in the leaf. This is what keeps the promotion diff to a handful of lines across two hundred units.
2. Keep the root DRY with locals and read_terragrunt_config
The root config is read by every unit, so it carries the expensive-to-repeat facts exactly once: backend, provider, and the version pins that keep a 200-unit run reproducible.
# live/root.hcl
locals {
account_vars = read_terragrunt_config(find_in_parent_folders("account.hcl"))
region_vars = read_terragrunt_config(find_in_parent_folders("region.hcl"))
account_id = local.account_vars.locals.account_id
aws_region = local.region_vars.locals.aws_region
}
# Pin the toolchain. A 200-unit run is only reproducible if every unit
# runs the same OpenTofu and Terragrunt versions.
terraform_version_constraint = ">= 1.9.0, < 2.0.0"
terragrunt_version_constraint = ">= 0.88.0"
remote_state {
backend = "s3"
generate = {
path = "backend.tf"
if_exists = "overwrite_terragrunt"
}
config = {
bucket = "acme-tfstate-${local.account_id}"
key = "${path_relative_to_include()}/terraform.tfstate"
region = local.aws_region
encrypt = true
use_lockfile = true # S3-native locking; no DynamoDB table needed
}
}
generate "provider" {
path = "provider.tf"
if_exists = "overwrite_terragrunt"
contents = <<-EOF
provider "aws" {
region = "${local.aws_region}"
assume_role {
role_arn = "arn:aws:iam::${local.account_id}:role/terraform-exec"
}
default_tags {
tags = { ManagedBy = "terragrunt", Account = "${local.account_vars.locals.account_name}" }
}
}
EOF
}
terraform_version_constraint makes Terragrunt fail fast if the binary on the runner drifts from the pinned range, instead of letting a newer OpenTofu silently rewrite your state format mid-run. Pin the Terragrunt binary itself in CI tooling (a mise/asdf .tool-versions file or a container tag) — the terragrunt_version_constraint attribute is a guardrail, not a version manager.
3. The dependency graph: dependency vs dependencies
Two blocks build the DAG, and the distinction is load-bearing.
dependencies(plural) declares ordering only. It is a list of paths that must be applied before this unit. No data crosses the edge.dependency(singular) declares ordering and data flow. It reads the target unit’s outputs and exposes them asdependency.<name>.outputs.<key>.
You almost always want dependency — if a unit needs ordering, it usually needs an output too. Reach for dependencies only for pure sequencing with no data (for example, “do not touch the app tier until the IAM bootstrap unit has run”).
# live/prod/us-east-1/platform/eks/terragrunt.hcl
include "root" {
path = find_in_parent_folders("root.hcl")
}
include "envcommon" {
path = "${dirname(find_in_parent_folders("root.hcl"))}/_envcommon/eks.hcl"
expose = true # make its locals readable here
}
dependency "network" {
config_path = "../network"
mock_outputs = {
vpc_id = "vpc-mock00000000000"
private_subnet_ids = ["subnet-mock1", "subnet-mock2", "subnet-mock3"]
}
mock_outputs_allowed_terraform_commands = ["validate", "plan", "init"]
mock_outputs_merge_strategy_with_state = "shallow"
}
# Pure ordering, no data: wait for the org-wide IAM baseline.
dependencies {
paths = ["../../../_baseline/iam"]
}
inputs = {
cluster_name = "prod-platform"
vpc_id = dependency.network.outputs.vpc_id
subnet_ids = dependency.network.outputs.private_subnet_ids
}
Terragrunt reads every terragrunt.hcl under the run root, resolves each config_path, and assembles a directed acyclic graph. For plan/apply it walks the graph so dependencies run before dependents; for destroy it walks it in reverse, tearing down dependents first. You never write the order. A cycle (A depends on B depends on A) is a hard error at graph-construction time, which is exactly when you want to find it.
4. Mock outputs: surviving plan-time and greenfield applies
This is the single most misunderstood mechanic in Terragrunt, and the one that breaks naive CI.
When you plan the EKS unit but network has never been applied, network has no outputs. Reading dependency.network.outputs.vpc_id would fail and abort the plan — so on a fresh repo you could never produce a full-stack plan. mock_outputs supplies placeholder values so plan, validate, and init proceed against fakes.
The mock_outputs_allowed_terraform_commands allowlist is the safety interlock. It must exclude apply and destroy. With the allowlist above, an apply that cannot find real outputs will fail rather than feed a fake subnet ID into a real cluster. That failure is correct: it means you tried to apply a dependent before its dependency, and Terragrunt’s run --all ordering exists precisely so you never hit it in practice.
mock_outputs_merge_strategy_with_state controls what happens once the dependency has partial real state — common when you add a new output to an already-applied module:
| Strategy | Behavior |
|---|---|
no_merge (default) |
If real state exists, use it as-is and ignore mocks. A newly-added output that the applied state lacks will be missing, failing the plan. |
shallow |
Real outputs win; mocks fill only top-level keys the state does not yet have. The usual choice. |
deep_map_only |
Like shallow, but recurses into map-typed outputs, filling absent keys inside maps. |
Use
shallowas your default. The failure mode ofno_merge— add an output to a module, and every downstreamplanbreaks until you re-apply the dependency first — is a needless ordering constraint on a read-only operation.shallowlets the plan proceed on a mock for the one new key while using real values for everything else.
A separate knob, skip_outputs = true, tells Terragrunt to never call terragrunt output on the dependency (it still enforces ordering). Do not combine it with mock_outputs expecting “mocks only when real outputs are absent”: skip_outputs means “always mock,” mock_outputs means “mock only as a fallback.” They answer different questions.
5. Orchestrate with run --all and run --graph
run --all is the workhorse: it discovers every unit under the current directory, builds the DAG, and executes your command in topological order, parallelizing independent units.
# Stand up an entire region in dependency order.
cd infra/live/prod/us-east-1
terragrunt run --all plan
terragrunt run --all apply --non-interactive
Two operational truths:
- A greenfield
run --all planis approximate, not byte-exact. Downstream units plan against mocked outputs, so their plans show placeholder ARNs and counts. Read it as a sanity check on intent and ordering, not as the literal diff thatapplywill produce. The real plan for a downstream unit is only exact after its dependency has applied. run --all applyauto-approves by default. Across many units there is no sane way to interactively confirm each one, so Terragrunt adds-auto-approve. If that makes you nervous,--no-auto-approverestores per-unit confirmation (rarely what you want in CI, frequently what you want for a hand-driven prod teardown).
For destroys, the graph runs in reverse — and this is where run --graph earns its place. run --all destroy from a region root tears down everything under it. When you want to destroy one unit and everything that depends on it (its downstream cone), without touching unrelated units, use run --graph:
# Destroy the network unit AND every unit that depends on it,
# in the correct reverse order. Run from inside the target unit.
cd infra/live/prod/us-east-1/platform/network
terragrunt run --graph destroy
run --graph is anchored to the current unit and traverses the dependency edges out from it; run --all is anchored to a directory and processes everything beneath it. Knowing which you mean is the difference between deleting a VPC’s dependents cleanly and deleting an entire region.
Inspect the graph itself before trusting any of this:
cd infra/live/prod/us-east-1
terragrunt dag graph | dot -Tsvg > dag.svg # Graphviz DOT to a diagram
6. Selective execution: only the units a change touched
At 200 units, run --all plan over the whole repo on every PR is minutes of wasted compute and a wall of noise. The goal is to run only the affected units.
Terragrunt gives you two mechanisms. The blunt one is glob inclusion:
# Plan only the EKS units across every prod region.
terragrunt run --all plan --queue-include-dir "prod/*/platform/eks"
# Plan everything in us-east-1 except RDS.
terragrunt run --all plan \
--queue-include-dir "prod/us-east-1/*" \
--queue-exclude-dir "prod/us-east-1/platform/rds"
The sharper one is change-aware and is what you actually want in CI. --filter-affected targets the units modified between the default branch and HEAD:
# Plan only units whose code changed vs the default branch — and,
# because it respects the DAG, the dependents of those units too.
terragrunt run --all plan --filter-affected
There is a subtlety the blunt globs miss: a change to a shared file (_envcommon/eks.hcl, or a local module under modules/) affects every unit that reads it, even though no leaf terragrunt.hcl changed. --queue-include-units-reading catches exactly that class:
# If _envcommon/eks.hcl changed, plan every unit that includes/reads it.
terragrunt run --all plan \
--queue-include-units-reading "_envcommon/eks.hcl"
The three
--queue-*flags above are now aliases for the newer--filterquery language, so current docs may show--filter; the queue-prefixed forms remain valid and read more clearly for directory-shaped selection. Two formerly-common flags are now deprecated because their behavior is the default:--queue-strict-include(inclusion is strict now) and--queue-exclude-external(external dependencies are excluded by default).
7. CI: detect affected units and parallelize safely
The naive pipeline runs run --all over the whole repo and serializes. The scalable one computes the affected set, plans it on PRs, and applies it on merge, bounded by parallelism so you do not exhaust provider rate limits or the runner.
# .github/workflows/terragrunt.yml
name: terragrunt
on:
pull_request:
push:
branches: [main]
jobs:
terragrunt:
runs-on: ubuntu-latest
permissions:
id-token: write # OIDC; no long-lived AWS keys
contents: read
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # --filter-affected needs full history to diff
- uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::222222222222:role/ci-terraform
aws-region: us-east-1
- uses: gruntwork-io/terragrunt-action@v3
with:
tofu_version: 1.9.0
tg_version: 0.88.0
tg_dir: infra/live/prod
# PRs plan the affected set; pushes to main apply it.
tg_command: >-
run --all
${{ github.event_name == 'pull_request' && 'plan' || 'apply' }}
--filter-affected
--non-interactive
--parallelism 8
--queue-ignore-errors
The flags that make this safe at scale:
fetch-depth: 0—--filter-affecteddiffs against the default branch, which requires real git history. A shallow checkout silently makes it find nothing (and plan nothing, which looks like success).--parallelism 8caps concurrent units. The DAG width can be dozens of independent units; without a cap you will hit AWS API throttling and OOM the runner. Tune to your account’s rate limits; 4-10 is a sane band.--non-interactiveforces non-prompting behavior — mandatory in CI, where a hung prompt is a hung job.--queue-ignore-errorson the plan job surfaces every broken unit in one run instead of aborting on the first. You get the full list of failures per PR rather than fixing them one slow round-trip at a time.
There is one trap worth stating plainly: --queue-ignore-errors does not mean “apply what you can and skip the rest” in a way that is safe for apply. On apply, a failed dependency means its dependents should not proceed — they would apply against stale or mock data. Keep --queue-ignore-errors for plan/validate; on the apply job, prefer the default fail-fast behavior so a failed network unit stops its EKS dependent rather than applying it blind.
Pin both binaries in the action (tofu_version, tg_version) so the runner cannot drift from your *_version_constraint pins and fail the whole run on a version check. For environments beyond a sandbox, also pin every module source to a tag — source = "git::...//eks?ref=v1.5.0" — so a plan today and an apply on merge run identical module code. Promotion then becomes a reviewed one-line ref= bump per environment.
Verify
Confirm the orchestration behaves before you trust it on prod.
cd infra/live/prod/us-east-1
# 1. The DAG is acyclic and ordered as you expect.
terragrunt dag graph | dot -Tsvg > /tmp/dag.svg
# network has no inbound edges; eks and rds depend on it.
# 2. A full-stack validate touches no cloud state but exercises every unit.
terragrunt run --all validate --non-interactive
# 3. Change detection selects the right set. From a feature branch:
git checkout -b verify/affected
touch platform/eks/terragrunt.hcl # simulate an EKS-only change
terragrunt run --all plan --filter-affected --non-interactive
# Expect: eks (and any dependents) planned; network and rds skipped.
# 4. Shared-file fan-out works.
terragrunt run --all plan \
--queue-include-units-reading "$(git rev-parse --show-toplevel)/infra/live/_envcommon/eks.hcl" \
--non-interactive
# Expect: every unit that includes _envcommon/eks.hcl appears.
# 5. State keys are isolated per unit.
aws s3 ls s3://acme-tfstate-222222222222/prod/us-east-1/ --recursive
# Expect distinct keys: .../platform/network/terraform.tfstate, .../platform/eks/...
You are checking four properties: the graph is acyclic and ordered correctly, --filter-affected plans only what changed plus its dependents, a shared-file edit fans out to every reader, and each unit owns a distinct state key.
Checklist
Where this approach stops scaling
run --all over a single repo has a ceiling. Two patterns push it out. First, partition the run root: never run run --all from the repo root in production — anchor it at an account or region so the DAG and blast radius stay bounded. Second, when units genuinely form a deployable bundle (a whole environment promoted at once), evaluate Terragrunt stacks (terragrunt.stack.hcl), which compose units into a higher-level node you version and run as one. Stacks are newer and some flag interactions still have rough edges, so validate on a non-critical environment first.
The discipline underneath all of it is the same one that makes the live/modules split worth maintaining: the directory path is the identity, the DAG is derived not authored, and every selective-execution flag just runs a subset of that derived graph. Get the graph right and run --all is a detail. Get it wrong and no flag will save the apply.