IaC Architecture

The Terraform Architecting Ladder: From a Single Module to an Enterprise IaC Platform

The most expensive Terraform mistake I review is almost never a syntax error. It is a structural one — a team that has built a sprawling, multi-account, policy-gated platform to manage a handful of resources two engineers touch once a month, or its mirror image: a business running its entire production estate out of a single main.tf with local state on one laptop, one terraform apply away from an outage nobody can reconstruct. Both teams skipped the only question that matters in Terraform architecture: what do the team and the estate actually demand, and what is the simplest structure that meets it with room to grow?

Terraform architecture is not a contest of impressive tooling — it is the disciplined act of adding exactly as much structure as team size, blast radius and governance burden force you to, and not one layer more. So this lesson teaches Terraform as a ladder: six designs for managing the same infrastructure, from a single root config you can run in an afternoon to an enterprise platform hundreds of engineers self-serve against. Each rung adds one named capability — reuse, state safety, environment isolation, multi-account DRY composition, automated governance, self-service with drift control — and each has a price in complexity. The skill it builds is reading a situation (how many people, environments, how much risk, what the auditor asks) and landing on the right rung.

This lesson assumes you know the moving parts; here we assemble them into architectures. Two companions go deep where this stays wide: DRY multi-environment infrastructure with Terragrunt unpacks rung 4 in full, and Terraform remote state at scale is the backbone of rungs 2 through 6.

Learning objectives

By the end of this lesson you will be able to:

Prerequisites & where this fits

This is an Advanced lesson in the Terraform Zero-to-Hero course (Architecture module). You will get the most from it having already met the core workflow and state from the fundamentals lesson, authored a reusable module, worked through Terragrunt fundamentals (dependency, run-all), and seen the multi-environment 3-tier centerpiece. This lesson is the synthesis — it shows all of them driven by requirements across a realistic progression rather than as isolated techniques, and is the bridge into the portfolio and certification capstones.

A note on versions: everything targets Terraform 1.x (the 1.9/1.10 line current in 2026) and notes OpenTofu — the MPL-licensed fork many teams adopted after the BSL relicensing — as a drop-in alternative wherever it matters; the ladder is identical for both. Terragrunt references assume its current direction toward first-class stacks and units. The reasoning, not the exact version, is the lesson. A running example keeps the variable honest: every rung manages the same infrastructure — “AcmeStack”, a network, a few compute instances, a database and a bucket — so the only thing that changes between designs is how many people and environments depend on it, and how much a bad change costs.

How to read the requirement axes

Every rung is described against the same set of axes. Internalise these — they are the vocabulary of a Terraform structural decision, and they are exactly what a good reviewer probes.

Axis What it asks Why it drives structure
Team size & concurrency One person, or many committing at once? Concurrent applies force remote state + locking; many teams force isolation and code review gates.
Environments One, or dev/stage/prod (and more)? Multiple environments force a strategy to keep their configs DRY yet independently safe.
Accounts / subscriptions One cloud account, or many (per-env, per-team)? Multi-account forces per-account backends, provider/credential routing and DRY generation.
Blast radius If one apply goes wrong, what breaks? Drives state splitting — the smaller the state, the smaller the failure domain.
Governance & compliance Must changes be reviewed, scanned, policy-checked, audited? Forces policy-as-code, CI gates, approvals and an audit trail.
Self-service Do non-experts need to provision safely? Forces a platform layer (golden modules, guardrails, a portal) so people can’t foot-gun.
Change velocity & drift How often does it change; does it drift out-of-band? Forces automation, drift detection and continuous reconciliation.

Keep this table in your head as you read each rung. The structure is always a response to a specific movement in these axes — never an aesthetic preference or a résumé line.

Rung 1 — Single root configuration, local state

Scenario. You are learning Terraform, prototyping a stack, or building a throwaway sandbox. One person works on it; there is exactly one environment (in effect, yours); nobody else applies; if it breaks, you destroy and recreate it. There is no audit requirement and no concept of “production”. This is also the legitimate starting point for almost every real project on day one.

The design. A single directory — one root module — holding main.tf, variables.tf, outputs.tf and versions.tf (with pinned required_version and provider versions). Resources are declared inline; state is the default local terraform.tfstate file sitting next to the code. The workflow is the bare core loop: terraform init, plan, apply, destroy. No modules of your own yet, no backend, no automation.

acmestack/
  main.tf        # vnet, instances, db, bucket — declared inline
  variables.tf
  outputs.tf
  versions.tf    # required_version + provider pins
  terraform.tfstate   # local state (gitignored)

Key decisions & tradeoffs. The defining decision is local state, on the principle do the simplest thing that works: zero setup, instant feedback, a tight learning loop. The tradeoffs are stark and must be named — state lives on one machine (lose the laptop, lose the map of reality), there is no locking (a second concurrent apply would corrupt state, fine because there is no second person), secrets land in plaintext in terraform.tfstate (so it must be .gitignored, never committed — see security notes), and there is no isolation because there is only one environment. The related decision is no custom modules: inlining is correct while the stack is small and singular — premature modularisation is its own anti-pattern.

Concern Choice at this rung The catch
State Local terraform.tfstate Single machine; no backup; no locking
Reuse Inline resources, no modules Copy-paste if you ever clone the stack
Isolation None (one environment) Cannot safely run a second copy
Collaboration Single user A second concurrent apply corrupts state
Secrets Plaintext in local state Must be gitignored; never commit

When this is enough. Genuinely: learning, demos, spikes, personal sandboxes, and the very first commit of any project. Do not add a backend, modules or a pipeline to a thing you are about to throw away or have not yet shaped. Stop here unless a second person needs to apply, the state matters enough to need durability and locking, or you have started copy-pasting the config to make a second environment — any one of those is the signal to climb to rung 2.

Rung 2 — Reusable modules + remote state

Scenario. AcmeStack is now real and shared. Two or three engineers commit to it; the state genuinely matters (it maps live infrastructure the team depends on); two people might run apply close enough in time to collide. You have also noticed you are copy-pasting blocks — the same “instance with disk, NIC and tags” appears three times. There is still effectively one environment, or one that you stand up by hand, but the team and the value of the state have crossed a threshold.

The design. Two upgrades, together. First, extract reusable modules: factor recurring resource groupings into local child modules (modules/network, modules/compute, modules/database) with typed inputs, validation, sensitive outputs and a README; the root module shrinks to a composition that wires modules together. Second, move state to a remote backend with locking — an S3 bucket + DynamoDB lock table, an Azure Storage account + blob lease, or a GCS bucket (which has built-in locking). Now state is durable, shared and safe under concurrency.

acmestack/
  main.tf          # composes the modules
  backend.tf       # remote backend (S3/azurerm/gcs) with locking
  modules/
    network/       # main.tf variables.tf outputs.tf — typed, validated
    compute/
    database/

Key decisions & tradeoffs. The first decision is the module boundary: reuse versus over-abstraction. A module that thinly wraps one resource adds indirection for nothing; one that bundles a coherent, repeated unit pays for itself immediately — so extract on the second or third repetition, around a concept you can name. The second decision is the backend and its locking story, and the load-bearing point is locking: a remote state without locking is more dangerous than local state, because now multiple people can race it. (S3 historically needed a DynamoDB table; modern Terraform also supports S3-native lockfile locking — either way, locking must be on.) State now also lives off any laptop, encrypted at rest, backed up and versioned.

Upgrade Rung 1 Rung 2
Reuse Inline resources Local child modules with typed inputs/outputs
State location Local file Remote backend (S3/azurerm/gcs)
Locking None State locking on (DynamoDB / blob lease / GCS)
Durability One laptop Durable, encrypted, versioned bucket
Team One person A few engineers, concurrency-safe

This is the rung Terraform remote state at scale opens on — backend choice, locking semantics and partial config are its subject, and everything above here builds on the remote-state foundation laid at this rung.

When this is enough. Small teams managing a single, coherent estate with one environment (or one stood up by hand): an internal tool, a startup’s early footprint, a shared sandbox-plus-prod where “prod” is one thing. You now have durability, locking and reuse — the three things rung 1 lacked. Stop here unless you need genuinely separate, independently-applyable environments (dev and stage and prod), at which point the single state and backend become the bottleneck that pushes you to rung 3.

Rung 3 — Multi-environment (workspaces or folder-per-env)

Scenario. AcmeStack now has to exist as distinct environments — at least dev, staging and prod — that evolve at different speeds and must fail independently. A mistake in dev must be impossible to apply to prod by accident. The environments are largely the same shape (that is the point of having them) but differ in size, counts and a handful of settings. The team is small-to-medium and wants to avoid duplicating the whole configuration three times.

The design. Keep the modules from rung 2, but introduce an environment strategy with per-environment state isolation. There are two mainstream shapes, and choosing between them is the whole decision at this rung:

  1. CLI workspaces — one configuration, multiple named workspaces (terraform workspace new staging), each with its own state file under the same backend (the backend stores state keyed per workspace). Differences are driven by terraform.workspace and per-workspace .tfvars.
  2. Directory-per-environment — a folder per env (environments/dev, environments/staging, environments/prod), each with its own backend key and its own *.tfvars, all consuming the same shared modules.
acmestack/
  modules/ ...                 # shared, as in rung 2
  environments/
    dev/      main.tf  backend.tf(key=dev)   dev.tfvars
    staging/  main.tf  backend.tf(key=stg)   staging.tfvars
    prod/     main.tf  backend.tf(key=prod)  prod.tfvars

Key decisions & tradeoffs. The headline decision is workspaces vs folders — a genuine fork with a well-known trap. Workspaces are cheap (no duplication) but share one backend, one provider config and one set of code, making it dangerously easy to point the same code at the wrong environment; the weak isolation is why many teams have applied a dev change to prod after forgetting which workspace was selected. HashiCorp’s own guidance is that workspaces suit short-lived or near-identical variants, but separate directories (or separate root modules) are the safer pattern for long-lived dev/stage/prod — each environment gets a physically distinct backend key, can pin its own provider/module versions, and cannot be confused for another. The cost of folders is some duplication of thin root wiring — exactly the pain rung 4 exists to remove. The second decision is expressing per-env differences via .tfvars, keeping the shape identical and only the values different, so promotion is “same code, different inputs”.

Strategy State isolation Isolation strength Duplication Best for
CLI workspaces Per-workspace key, shared backend/providers Weak — easy to target the wrong env Minimal Short-lived or near-identical variants
Folder-per-env Separate backend key per folder Strong — physically distinct Some root-wiring duplication Long-lived dev/stage/prod

When this is enough. Teams running a handful of environments in one cloud account (or where multi-account is not yet a requirement), happy to repeat a little root-module wiring per environment, and not yet needing automated governance. This is a very common and perfectly respectable place to stop — many solid teams live here for years. Stop here unless you are spreading across multiple accounts/subscriptions (per-env or per-team), the per-env duplication has become real maintenance pain, or you need to wire dependencies between stacks and run them together — those push you to rung 4.

Rung 4 — Terragrunt DRY across multiple accounts + dependencies

Scenario. AcmeStack now spans multiple cloud accounts — a common enterprise pattern of one account (or subscription/project) per environment, sometimes per team, for hard isolation and billing separation. You are feeling two distinct pains: duplication (the same backend block, provider block and wiring copy-pasted into every environment folder, with only a key or region changing) and dependencies (the compute stack needs the network stack’s outputs; the database stack needs both; you want to apply them in the right order across many units with one command). The team is medium-sized and the estate has grown to many stacks per environment.

The design. Introduce Terragrunt as a thin orchestration layer over the same Terraform modules. A root terragrunt.hcl defines the backend and provider once and generates them into every unit (remote_state, generate "provider"), with per-account/per-env values resolved from the folder hierarchy via find_in_parent_folders() and path_relative_to_include(). Each leaf unit is a tiny terragrunt.hcl that points at a module source and supplies inputs. Inter-stack wiring uses dependency blocks to pass one unit’s outputs into another’s inputs, and run-all applies the whole graph in dependency order. Per-account credentials are routed per environment, so prod’s apply uses prod’s account.

live/
  terragrunt.hcl                      # root: remote_state + generate provider (DRY)
  dev/   account.hcl                  # account-id / creds for dev
    network/   terragrunt.hcl         # source=modules/network ; inputs
    compute/   terragrunt.hcl         # dependency "network" -> inputs.subnet_ids
    database/  terragrunt.hcl
  prod/  account.hcl
    network/   ...                    # same modules, prod inputs, prod backend key
    compute/   ...

Key decisions & tradeoffs. The first is adopting Terragrunt at all: it removes rung-3 duplication (backend/provider generated once, environments become inputs-only) and adds cross-stack dependency wiring and run-all, at the price of a second tool to learn and pin plus indirection that can obscure what plain Terraform would do. It earns its keep with many units across many accounts; for three small environments it can be over-engineering (rung-3 folders may be plenty). Note Terragrunt’s current direction toward first-class stacks/units, which formalises this composition. The second decision is per-account state-key and credential layout — each environment gets its own backend (often in its own account) and credentials, bounding blast radius by account and stack. The third is dependency granularity: splitting network/compute/database into separate states shrinks each blast radius and enables dependency wiring, but too-fine splitting makes run-all graphs sprawling and slow — split around stable seams (network rarely changes; app frequently does).

Concern Rung 3 (folders) Rung 4 (Terragrunt multi-account)
Backend/provider config Repeated per env folder Generated once, inherited everywhere
Accounts Usually one Multiple (per-env/per-team), creds routed per unit
Inter-stack wiring Manual terraform_remote_state dependency blocks + run-all ordering
Per-env difference .tfvars inputs in tiny leaf terragrunt.hcl
Extra tooling None Terragrunt (a second binary to pin)

This rung is treated end-to-end in DRY multi-environment infrastructure with Terragrunt — backend/provider generation, the dependency/run-all graph and dev→prod promotion are its whole subject, so that lesson is your build manual here.

When this is enough. Medium-to-large teams running many stacks across multiple accounts who have removed duplication and wired dependencies, but whose governance is still “we review pull requests and trust each other” — a great many competent platform teams. Stop here unless you need enforced governance (automated policy checks, required approvals, scanning, a private registry, an audit trail) — that compliance pressure pushes you to rung 5.

Rung 5 — Module registry + policy-as-code + CI/CD approval gates

Scenario. AcmeStack is now governed. Security and compliance require that no infrastructure change reaches prod without automated checks and a recorded approval: every plan must be policy-checked (no public S3 buckets, mandatory tags, allowed regions/SKUs only), scanned for misconfiguration and secrets, and gated by required reviewers, with the whole thing logged for audit. Multiple teams now consume shared modules, so those modules need to be versioned and discoverable rather than copied between repos. Credentials must not be long-lived secrets sitting in CI.

The design. Three layers added on top of rung 4 (which the apply still runs underneath). First, a module registry: publish reusable modules with SemVer tags to a private registry (a private Terraform Registry, Terraform Cloud/Spacelift private registry, or Git-ref-pinned modules), with generated docs (terraform-docs) and tests (terraform validate, Terratest) — consumers pin versions. Second, policy-as-code and scanning in CI: a pipeline runs fmtvalidatetflint → security scan (tfsec/Trivy/Checkov) → planpolicy evaluation against the plan JSON (OPA/Conftest or Sentinel) → and only then a gated apply. Third, CI/CD approval gates with keyless auth: dev auto-applies; staging/prod require manual approval (GitHub Environments with required reviewers, Azure DevOps approvals, or Atlantis/Spacelift policies), and CI authenticates to the cloud via OIDC — short-lived, keyless, no stored secrets.

CI pipeline (per PR / per env):
  fmt → validate → tflint → tfsec/checkov(scan) →
  plan(-out) → opa/sentinel(policy on plan.json) →
  [dev: auto-apply]  [staging/prod: manual approval] → apply
auth: OIDC federation → short-lived cloud creds (no static keys)
registry: modules published with SemVer tags + terraform-docs + Terratest

Key decisions & tradeoffs. The first is where policy is enforced — pre-merge (fast feedback, admin-bypassable) versus on the plan in the apply pipeline (authoritative, harder to bypass); mature setups do both, with the plan-time gate as the source of truth. The cost is friction: every gate adds latency and the odd false positive, so policies must be curated, not maximalist, or engineers route around them. The second decision, OIDC over static credentials, is close to non-negotiable: short-lived federated tokens remove the single most common Terraform breach — a long-lived cloud key leaked from CI. The third is registry and versioning discipline: publishing modules with SemVer and pinning consumers turns “everyone copies the VPC module” into “everyone depends on network ~> 3.2” — the difference between coordinated and chaotic upgrades — at the cost of release process and the discipline of not breaking SemVer.

Layer What it adds Tools (2026) The cost
Module registry Versioned, discoverable shared modules Private TFC/Spacelift registry, Git tags, terraform-docs, Terratest Release process + SemVer discipline
Scanning Catch misconfig/secrets pre-apply tfsec, Trivy, Checkov False positives; needs tuning
Policy-as-code Enforce org rules on the plan OPA/Conftest, Sentinel Friction; curate, don’t maximise
Approval gates Required human sign-off + audit trail GitHub Environments, Azure DevOps approvals, Atlantis Latency on every prod change
Keyless auth No long-lived secrets in CI OIDC federation One-time federation setup

Companion lessons map onto this rung exactly — policy-as-code with OPA/Conftest on the plan and Sentinel policy sets, and scanning with tfsec/Trivy/Checkov — and the CI promotion/approval pattern is the back half of the multi-environment 3-tier centerpiece.

When this is enough. Organisations with real compliance obligations, multiple consuming teams, and a need to prove (not just assert) that every change was checked and approved. This is where most serious enterprises land and stay — governance, shared-module reuse and keyless security without the cost of a full self-service platform. Stop here unless the number of teams and demand for self-service grows to where a central team gating every change becomes the bottleneck, drift goes undetected across a large estate, and you need a product-grade platform — rung 6.

Rung 6 — Enterprise IaC platform

Scenario. This is the apex. AcmeStack is now a fraction of an estate that dozens of teams and hundreds of engineers provision against. The central platform team cannot be in the loop for every change without becoming a bottleneck, so the requirement flips from gating to enabling: teams must self-serve golden, pre-approved infrastructure safely; drift must be detected and reconciled continuously across thousands of resources; governance must be centralised and consistent; and the whole thing must have an audit trail, cost visibility and an SLA. The platform itself is now a product with internal customers.

The design. Adopt a managed/orchestrated IaC platformTerraform Cloud/Enterprise (HCP Terraform), Spacelift, or Atlantis (self-hosted) — and wrap everything below it. The platform provides remote runs (plan/apply executed centrally, not on laptops or ad-hoc CI), VCS-driven workflows with speculative plans on PRs, a private module registry as the source of golden modules, integrated policy-as-code (Sentinel/OPA) enforced on every run, RBAC and SSO, OIDC/dynamic provider credentials, and state managed by the platform with full history and locking. On top sit self-service capabilities — no-code/golden modules, run templates, or a Backstage-style portal so a developer fills a form and gets a compliant stack — and continuous drift detection that schedules plans, surfaces out-of-band changes, and can auto-remediate or alert. Governance (policy, cost estimation, run approvals) is configured centrally and applied everywhere.

Enterprise IaC platform (TFC/Enterprise · Spacelift · Atlantis)
  ├─ Remote runs (central plan/apply) ── state mgmt + history + locking
  ├─ VCS-driven: speculative plan on PR → policy → cost estimate → approve → apply
  ├─ Private module registry (golden modules, SemVer, no-code modules)
  ├─ Policy-as-code (Sentinel/OPA) enforced on every run + RBAC/SSO + OIDC
  ├─ Self-service: golden/no-code modules · Backstage portal · run templates
  └─ Continuous drift detection → alert / auto-reconcile

Key decisions & tradeoffs. The first is buy vs self-host: HCP Terraform/Enterprise and Spacelift are managed (fast, supported, licensed per-resource/seat and genuinely expensive at scale); Atlantis is open-source self-hosted (cheap in licence, costly in team-time to run and harden). Either way you are funding a platform team and product, justified only by the leverage it gives many consuming teams. The second decision is the shape of self-service: golden/no-code modules and a portal remove foot-guns, but every guardrail is a constraint someone will eventually need to escape, so the platform needs a sanctioned “break glass” path or people route around it. The third is drift strategy: detect-and-alert (safe, needs humans) versus auto-reconcile (powerful, but auto-applying to fix drift can cause incidents when the drift was an intentional emergency fix) — most platforms start alert-only and graduate specific stacks to auto-remediation. The overriding point for any sponsor: this is the most expensive and complex rung, in licence and in the SRE organisation it requires, justified by organisational scale — many teams, large estate, real compliance — not ambition.

Capability What it delivers The catch
Remote runs + managed state Central, audited plan/apply; no laptop applies Licence cost; platform is now critical infra
Integrated policy + cost Sentinel/OPA + cost estimate on every run Central team owns policy lifecycle
Private registry + no-code Golden, self-serve modules Guardrails need a break-glass escape
RBAC/SSO + OIDC Enterprise auth, keyless cloud access Identity integration effort
Continuous drift detection Estate-wide drift surfaced/reconciled Auto-remediation can cause incidents
Self-service portal (Backstage) Developers provision via a form A whole portal to build and run

Drift is the recurring operational theme here; the detection-and-reconciliation strategy this rung depends on is treated in full in the dedicated Terraform drift detection and reconciliation lesson, and the registry/state foundations come straight from Terraform remote state at scale.

When this is enough. When the organisation, not any single workload, demands it: many autonomous teams blocked by a central bottleneck, an estate too large to govern by reviewing every PR by hand, compliance that requires centralised policy and audit, and the engineering maturity to run a platform as a product. This is the ceiling of the ladder — there is nothing above it; the work beyond this rung is operating it well (golden-path quality, drift hygiene, cost governance), not adding more architecture. For the vast majority of teams, reaching this rung would be a textbook over-engineering error.

The Terraform architecting ladder

The diagram lays the six rungs side by side so the shape of the climb is visible at a glance: each step adds a specific capability — durability and locking → reuse → environment isolation → multi-account DRY composition → enforced governance → self-service at scale — while complexity and operational burden rise non-linearly, and the lesson is to climb exactly as high as the team and the estate force you and no higher.

How to choose a rung from requirements

You never pick a rung by taste. You read the axes and let them point. Here is the decision distilled into a single table — read it top to bottom and stop at the first row whose requirement you genuinely have.

If the situation is… …the rung is Why
Learning, a spike, a throwaway sandbox; one person 1 — Single root + local state Zero setup, tight loop; nothing of value to protect yet
A few engineers, shared state that matters, copy-paste creeping in 2 — Modules + remote state Durability, locking and reuse — the three things rung 1 lacks
Distinct dev/stage/prod that must fail independently 3 — Multi-env (workspaces/folders) Per-env state isolation; prefer folders for long-lived envs
Many stacks across multiple accounts + cross-stack dependencies 4 — Terragrunt DRY multi-account Removes duplication; dependency/run-all ordering
Compliance needs enforced policy, scanning, approvals, shared versioned modules 5 — Registry + policy-as-code + CI gates Provable governance + keyless OIDC + module reuse
Many teams need safe self-service; drift across a huge estate; central governance 6 — Enterprise IaC platform Enabling at scale; justified only by organisational size

Four rules govern the whole climb:

  1. The team and the estate drive the rung — not fashion, not résumé-building. The single best question in any review is “what requirement forces us off the rung below?” If you cannot answer it crisply, you have over-engineered.
  2. State isolation is the spine of the ladder. Local → remote+locking → per-env → per-account → platform-managed. Each step shrinks the blast radius. Most reliability wins on this ladder are really state-isolation wins.
  3. Remote state with locking (rung 2) is the highest-ROI single step. It removes the most common real failures — lost state and concurrent-apply corruption — for almost no complexity. Most teams should reach it on roughly their second week, not their second year.
  4. Every step up spends simplicity to buy reuse, isolation or governance. Make the trade deliberately, write down what you bought and what it cost in complexity, and you will rarely be wrong.

The honest summary: most competent teams belong on rung 3 or 4. Rung 2 is the floor for anything shared. Rung 5 is for the genuinely compliance-bound and multi-team. Rung 6 is for large organisations with the scale to amortise a platform — and over-engineering everywhere else. Climbing this ladder is easy; the discipline — and the seniority — is knowing when to stop.

Hands-on lab

This lab walks the rung-1 → rung-2 transition — the most important step on the ladder — on your own machine with only the free, offline random and local providers (no cloud account, no charges). Prerequisites: Terraform 1.x or OpenTofu (terraform -version / tofu -version); everything works identically with tofu substituted for terraform.

Step 1 — Rung 1: a single root config with local state.

mkdir -p ladder-lab && cd ladder-lab
cat > main.tf <<'EOF'
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    random = { source = "hashicorp/random", version = "~> 3.6" }
    local  = { source = "hashicorp/local",  version = "~> 2.5" }
  }
}
variable "environment" {
  type    = string
  default = "dev"
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "environment must be one of dev, staging, prod."
  }
}
resource "random_pet" "stack" { length = 2 }
resource "local_file" "manifest" {
  filename = "${path.module}/manifest-${var.environment}.txt"
  content  = "AcmeStack: ${random_pet.stack.id} (env=${var.environment})\n"
}
output "stack_name" { value = random_pet.stack.id }
EOF
terraform init                 # backend = local by default
terraform apply -auto-approve  # -> "Apply complete! Resources: 2 added"

Validate. You now have a local terraform.tfstate and manifest-dev.txt; terraform state list shows random_pet.stack and local_file.manifest. That is rung 1: one config, local state, one “environment”.

Step 2 — Rung 2 move: extract a reusable module. Factor the stack into a child module, then have the root compose it:

mkdir -p modules/acmestack
cat > modules/acmestack/main.tf <<'EOF'
terraform {
  required_providers {
    random = { source = "hashicorp/random", version = "~> 3.6" }
    local  = { source = "hashicorp/local",  version = "~> 2.5" }
  }
}
variable "environment" { type = string }
resource "random_pet" "stack" { length = 2 }
resource "local_file" "manifest" {
  filename = "${path.root}/manifest-${var.environment}.txt"
  content  = "AcmeStack: ${random_pet.stack.id} (env=${var.environment})\n"
}
output "stack_name" { value = random_pet.stack.id }
EOF
cat > main.tf <<'EOF'
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    random = { source = "hashicorp/random", version = "~> 3.6" }
    local  = { source = "hashicorp/local",  version = "~> 2.5" }
  }
}
variable "environment" {
  type    = string
  default = "dev"
}
module "acmestack" {
  source      = "./modules/acmestack"
  environment = var.environment
}
output "stack_name" { value = module.acmestack.stack_name }
EOF
terraform init                 # registers the module
terraform apply -auto-approve
terraform state list           # -> module.acmestack.random_pet.stack, ...

Validate. Resources now live under module.acmestack.* — the composition is real, and the same module could be instantiated again with a different environment. That is rung 2’s reuse. The piece this offline lab cannot show is rung 2’s other half — a remote backend with locking: with local state, two concurrent applies in two shells would race with nothing stopping them, whereas an S3+DynamoDB / azurerm-blob-lease / GCS backend would lock and make the second wait. See Terraform remote state at scale for the real backend setup.

Cleanup.

terraform destroy -auto-approve
cd .. && rm -rf ladder-lab

Cost note. Zero — the random and local providers create no cloud resources. The only real cost of climbing the actual ladder is backend storage (pennies) and, at the top rungs, platform licences and a platform team.

Common mistakes & troubleshooting

Symptom Likely cause Fix
Building a platform (rung 5/6) for a 3-person team Over-engineering; copying a big-company setup Climb down — match the rung to team/estate; rung 2 or 3 is likely right
Error acquiring the state lock then stuck A crashed/abandoned run left a lock (rung 2+) Confirm no apply is running, then terraform force-unlock <LOCK_ID>; investigate why it crashed
A dev change was applied to prod CLI workspaces misselected; weak isolation Move long-lived envs to folders/separate roots (rung 3); never rely on the selected workspace for prod safety
State on one laptop; teammate can’t apply Still on rung 1 local state with a team Migrate to a remote backend with locking (terraform init -migrate-state) — the rung-2 step
Secrets visible in terraform.tfstate State always stores values in plaintext Use a remote backend with encryption + tight IAM; never commit state; treat the bucket as a secret store
Terragrunt dependency returns empty/unknown outputs Dependency not applied yet, or wrong output name Apply dependencies first (or use mock_outputs for plan), and match the exact output name
CI apply fails with credential/permission errors after moving to OIDC OIDC trust/role policy misconfigured Fix the federated trust (correct subject claim, audience) and the role’s permissions; OIDC issues are config, not code
Drift keeps reappearing on a resource Out-of-band change, or a provider default Terraform doesn’t manage Detect with a scheduled plan (rung 6); decide to reconcile in code or ignore_changes if intentionally external

Best practices

Security notes {#security-notes}

Interview & exam questions

  1. Walk me through the maturity progression from a single Terraform config to an enterprise platform. (Look for: the six rungs by capability — local→remote+locking→multi-env→multi-account DRY→policy+CI gates→platform; the idea that each step answers a requirement and shrinks blast radius; “climb only when forced”.)
  2. When would you choose CLI workspaces versus a directory-per-environment, and what’s the trap? (Look for: workspaces share one backend/provider/code → weak isolation and the classic “applied to the wrong env” bug; folders/separate roots give physical isolation per backend key → preferred for long-lived dev/stage/prod; workspaces fine for short-lived/near-identical variants.)
  3. Why is remote state with locking the highest-ROI step on this ladder? (Look for: removes lost-state and concurrent-apply corruption — the two most common real failures — for minimal complexity; a remote state without locking is more dangerous than local because it enables races.)
  4. What problem does Terragrunt solve at the multi-account rung, and what does it cost? (Look for: removes backend/provider duplication via generate/remote_state, adds dependency wiring and run-all ordering across many units/accounts; cost is a second tool to learn/pin and an extra layer of indirection; over-engineering for three small envs.)
  5. How do you bound blast radius as a Terraform estate grows? (Look for: split monolithic state into per-stack states along stable seams; per-environment and per-account isolation; smaller states = smaller failure domains; don’t over-split or run-all graphs sprawl.)
  6. Where and how should organisational policy be enforced in a Terraform workflow? (Look for: policy-as-code — OPA/Conftest or Sentinel — evaluated on the plan; ideally both pre-merge for fast feedback and at apply-time as the authoritative gate; curate policies to avoid friction; pair with scanning tfsec/Checkov.)
  7. Why move CI to OIDC instead of static cloud credentials, and at which rung? (Look for: short-lived keyless tokens remove the most common breach — a leaked long-lived key from CI; appears at rung 5 when you automate apply; it’s the secure default for pipelines now.)
  8. What does an “enterprise IaC platform” (rung 6) add over a good CI/CD + Terragrunt setup, and when is it justified? (Look for: remote managed runs/state, integrated policy + cost, private registry/no-code modules, RBAC/SSO, continuous drift detection, self-service portal; justified by organisational scale — many teams, large estate, central governance — not ambition; most expensive rung.)
  9. How do you keep multiple environments DRY without losing isolation? (Look for: shared modules consumed by each env; differences expressed as .tfvars/inputs only, identical shape; per-env separate backend keys/accounts; Terragrunt to generate the repeated backend/provider config; promotion = same code, different inputs.)
  10. What’s your strategy for drift in a large estate, and what’s the risk of auto-remediation? (Look for: scheduled plans to detect, alert vs auto-reconcile; auto-applying to “fix” drift can clobber an intentional emergency change and cause an incident; start alert-only, graduate safe stacks to auto-remediation; ignore_changes for deliberately external attributes.)
  11. Is this ladder strictly linear? Where can you skip or stop? (Look for: not strictly — most teams legitimately stop at rung 3 or 4; rung 6 is organisational, often skipped entirely; rung 2 is the floor for anything shared; the right answer is the cheapest structure that meets the requirement.)

Quick check

  1. Which single step is the highest-ROI on the ladder, and what two failures does it remove?
  2. For long-lived dev/stage/prod, which environment strategy is safer — CLI workspaces or folders — and why?
  3. Name the two problems Terragrunt solves at the multi-account rung.
  4. At which rung does OIDC keyless auth appear, and what risk does it eliminate?
  5. What makes an enterprise IaC platform (rung 6) justified — and what is the one-line test for whether you’ve over-climbed?

Answers.

  1. Moving to a remote backend with locking (rung 2). It removes lost/unbackuped state (state now durable off any laptop) and concurrent-apply corruption (locking serialises applies) — for almost no added complexity.
  2. Folders / separate root modules. Each environment gets a physically distinct backend key and can pin its own provider/module versions, so a change cannot be applied to the wrong environment; CLI workspaces share one backend/provider/code, giving only weak isolation.
  3. Duplication (it generates the repeated backend and provider config once via remote_state/generate, so environments become inputs-only) and dependencies (it wires one stack’s outputs into another via dependency and applies the graph in order with run-all).
  4. Rung 5, when you automate apply in CI. It eliminates long-lived cloud credentials stored in the pipeline — the most common Terraform-related breach — by federating to short-lived keyless tokens.
  5. It is justified by organisational scale: many autonomous teams, an estate too large to govern by hand, central compliance, and the maturity to run a platform as a product. The over-climb test: if you cannot name the requirement that forces you off the rung below, you have over-engineered.

Exercise

The brief. You are the platform architect for “Northwind”, a mid-sized company. The situation as it actually is: there are four product teams (about 25 engineers total); the estate spans three cloud accounts (one per environment: dev, staging, prod); teams currently copy-paste a “service” Terraform module between their repos and it has drifted into four divergent versions; security has just mandated that no change reaches prod without an automated policy check (mandatory tags, no public buckets, approved regions) and a recorded approval, and that CI must stop using the long-lived cloud access key it currently stores. There is no request yet for developer self-service portals, and no team is blocked waiting on a central group. Budget is real but not unlimited. Choose a rung, name the key additions, and state the one thing you would push back on.

Write your answer before reading on.

Model answer. Read the axes. Multiple accounts (per-env) + many stacks + copy-pasted-and-drifted shared module + a brand-new mandate for enforced policy, recorded approvals, and keyless CI → this clearly crosses the governance boundary, so rungs 3 and 4 alone are not sufficient. But there is no self-service requirement and no central-bottleneck pain — nobody is blocked waiting on a platform team — so a full enterprise IaC platform (rung 6, Spacelift/TFC Enterprise + Backstage) is over-engineering: its cost and the platform-team investment aren’t justified by any stated requirement. The right rung is 5 — module registry + policy-as-code + CI/CD approval gates, sitting on a rung-4 Terragrunt multi-account base (which the three-accounts-and-many-stacks situation already implies). The concrete additions: publish the “service” module to a private registry with SemVer and have all four teams pin a version (this directly fixes the four-divergent-copies problem); add a CI pipeline of fmt/validate/tflintscan (tfsec/Checkov) → planpolicy (OPA/Conftest or Sentinel on the plan: mandatory tags, no public buckets, approved regions) → gated apply with required reviewers on staging/prod via GitHub Environments (or equivalent); and switch CI auth to OIDC federation to kill the stored access key.

The thing to push back on: if anyone proposes jumping straight to rung 6 (“let’s buy a platform and build a self-service portal while we’re at it”), challenge it — what requirement forces self-service? None is stated and no team is blocked, so a platform spends licence and platform-team budget on a problem Northwind doesn’t have yet, while slowing the actual urgent need (governance). Deliver rung 5 now; revisit a platform only when team count or central-bottleneck pain genuinely appears. Also flag that consolidating four divergent module copies into one versioned module will surface behavioural differences — plan a brief reconciliation so adopting the canonical version doesn’t silently change someone’s infrastructure.

Certification mapping

This lesson is judgement-and-architecture material that sits above the HashiCorp Terraform Associate (003) objectives and feeds the cloud DevOps design exams.

Cert Relevance
HashiCorp Terraform Associate (003) Primary. Directly tests the building blocks each rung uses — purpose of state and local vs remote backends with locking, modules and the registry, workspaces, version pinning, and the core workflow. The exam won’t ask “which rung”, but every rung is assembled from its objectives; knowing why you’d choose each makes the option-style questions easy.
AWS / Azure / GCP DevOps (DOP-C02 / AZ-400 / Cloud DevOps Engineer) The upper rungs are squarely here: IaC in CI/CD with approval gates, policy-as-code, OIDC/keyless pipeline auth, artefact/module versioning, and drift management as an operational practice.
Terraform Cloud/Enterprise & vendor platform paths Rung 6 maps to HCP Terraform/Enterprise (and Spacelift) capabilities — remote runs, Sentinel policy sets, private registry, RBAC/SSO, drift detection.

For the Terraform Associate specifically, drill the rung-2/3 boundary the exam loves: local vs remote state, what locking protects against, workspaces vs separate configurations, and module sources/versioning. Those are reliable points and they are exactly the structural choices this ladder is built from.

Glossary

Next steps

You now have the spine of Terraform structural judgement: situation in, the right rung out. The natural next lesson turns judgement into artefacts — Real-World Terraform Portfolio Projects: From a First Module to a Multi-Cloud Platform — where each rung of this ladder becomes a buildable, GitHub-presentable project with a quantified resume bullet, so you can demonstrate the progression you just learned to reason about.

To deepen the surrounding material:

TerraformTerragruntIaCArchitecturePolicy as CodeTerraform Associate
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading