IaC Troubleshooting

Terraform Troubleshooting: State, Providers, Drift, Dependencies & Debugging

The difference between an engineer who is comfortable with Terraform and one who still dreads terraform apply is rarely knowledge of obscure resources. It is a method. When a plan wants to destroy a database, or apply halts with Error acquiring the state lock, or the provider throws a cryptic 403, the strong engineer does not guess and start deleting things — they run a short, fixed sequence that pins the failure to one layer (state, provider/auth, configuration, or the real cloud), form a falsifiable hypothesis, prove it, fix it with the smallest safe change, and verify. Everyone else runs apply again and hopes, or — worse — reaches for terraform state rm because a forum post said so, and turns a five-minute fix into an outage.

This lesson gives you that method and then turns it into playbooks: for each common failure you get a table of symptom → likely cause → the diagnostic command that confirms it → the fix. We cover the failures you genuinely hit in production and that come up in interviews and on the HashiCorp Terraform Associate exam — state problems (a stuck lock, drift, a corrupt or version-mismatched state file, and the careful state surgery of mv/rm/import/replace and the declarative moved/removed blocks); provider and authentication errors (version conflicts, lockfile mismatches, expired or wrong credentials, and API rate limits); plan/apply errors (cycle errors, for_each/count on unknown values, count vs for_each churn, and type mismatches); drift detection and reconciliation; and how to read what Terraform is actually doing with TF_LOG — plus a CI/CD failure playbook. Everything here runs against a free local provider, so you can reproduce every fault on your laptop without a cloud bill. (It applies equally to OpenTofu, the open-source fork — tofu mirrors these commands; differences are flagged.)

Learning objectives

By the end of this lesson you can:

Prerequisites & where this fits

You need a working Terraform 1.x (or OpenTofu) CLI and a directory you can break. You do not need a cloud account — the lab uses the hashicorp/null, hashicorp/random, and hashicorp/local providers, which create fake/local resources for free. You should already have the core model from earlier in this course: HCL and the Terraform fundamentals — providers, state and the core workflow, how modules are structured, and ideally the IaC core concepts — state, drift and idempotency. We re-explain each failure from first principles, so it is fine if some of that is still fresh. This is the Troubleshooting lesson of the Terraform/Terragrunt Zero-to-Hero track — the bridge between writing Terraform and operating it under pressure. The next lesson steps up to the Terraform architecting ladder, from a single module to an enterprise platform. Everything here is free and local — no cloud account, no charges.

The method: a loop, not a guess

Almost every Terraform problem yields to the same loop. The discipline is to follow it in order rather than jumping to a destructive command you used once before.

  1. Observe — read the error to the end. Terraform’s errors are unusually good: they name the file, the line, the resource address, and very often the fix. Read the whole message (the last lines of a long apply matter most), then run terraform plan to see the intended changes before touching anything. Resist the urge to change state.
  2. Isolate the layer. Terraform failures live in one of four layers — configuration (HCL/types/expressions and the dependency graph), provider/auth (plugin version, credentials, API), state (the mapping between config and real resources), or the real world (the cloud rejected or already changed something). Naming the layer eliminates most of the search space: a cycle error is configuration; 403 is provider/auth; “state lock” is state; “already exists” is the real world drifting from state.
  3. Compare desired vs actual. Terraform is declarative: it makes real infrastructure match configuration, tracked through state. Almost every surprising plan is a gap between two of those three — config changed, the cloud changed out-of-band (drift), or state is wrong (stale, imported badly, or pointing at a deleted resource). Put them side by side: terraform plan, terraform state list, terraform state show <addr>.
  4. Form one hypothesis. State it as a falsifiable sentence: “the plan wants to replace the bucket because force_destroy is count-indexed and I removed item 0, so everything re-indexed.” A vague hunch (“state is broken”) cannot be tested; a specific claim can.
  5. Fix — the smallest, safest, reversible change. Prefer a config change (moved block, fixing a type) over state surgery; prefer -target to scope an apply over editing state; and always back up state before surgery (cp terraform.tfstate terraform.tfstate.bak, or for remote, terraform state pull > backup.tfstate). If the fix proves the hypothesis, you have also confirmed the root cause.
  6. Verify, then prevent. Run terraform plan and confirm it shows No changes (or exactly the intended diff) — that is “fixed”, not “the apply stopped erroring”. Then ask the prevention question: what guardrail (a CI plan on every PR, state locking, version pinning, a prevent_destroy lifecycle, a drift-detection schedule) stops this recurring?

Terraform troubleshooting decision tree

The decision tree above encodes step 2: start from the symptom at the top, branch on “is it a state error, a provider/auth error, a plan/config error, or unexpected drift?”, and each leaf points you at the playbook below. Under pressure, walking the tree keeps you honest about which layer you are in before you touch state.

The commands that answer everything

You can diagnose the large majority of problems with a handful of commands. Know precisely what each one tells you — this is what stops the flailing.

Command Question it answers Where to look
terraform plan What does Terraform want to do, and why? The +/-/~/-/+ symbols, the # (because …) reasons, the “forces replacement” notes
terraform state list What does state think exists? The full list of resource addresses Terraform is tracking
terraform state show <addr> What attributes does state hold for one resource? The recorded attribute values — compare to the real cloud and to config
terraform validate Is the configuration internally valid (syntax, types, references)? Errors with file + line, before any provider/API call
terraform providers Which providers/versions does this config require and use? The provider tree + required versions
terraform version Which CLI and provider versions are running? CLI version + each provider’s resolved version
TF_LOG=DEBUG terraform <cmd> What is Terraform/the provider actually doing (API calls, retries)? The detailed log stream (set TF_LOG_PATH to capture it)

A few high-leverage habits: run terraform plan before every apply and read the reasons, not just the counts; reach for terraform validate the instant an error mentions syntax, a type, or an unknown reference (it is fast and needs no credentials); use terraform state show to see what state believes rather than guessing; and remember that plan/apply perform an implicit refresh (they read the real world) — so a plan diff can come from config or from the cloud having drifted. When you need to see the wire, TF_LOG is the truth serum; just remember it can print secrets, so capture it to a file and delete it after.

State problems: locks, drift, corruption & surgery

State is where Terraform troubleshooting earns its reputation, because state is the one component you can damage permanently. The golden rule comes first: back up before you touch it. For local state, cp terraform.tfstate terraform.tfstate.bak; for remote, terraform state pull > backup.tfstate. With a backup, every operation below is reversible.

Think of state in three failure modes: it is locked (someone/something holds the lock), it is wrong (it disagrees with the real world — drift, or a bad import), or it is damaged/misaligned (corrupt JSON, a version mismatch, or resources at the wrong address after a refactor). The fix differs sharply by mode, so name the mode first.

Symptom Likely cause Diagnostic Fix
Error acquiring the state lock (with a Lock Info block: ID, Who, Created) A previous run was interrupted (Ctrl-C, crashed CI job, killed agent) and never released the lock; or a genuine concurrent run Read the Lock Info: the ID, Who, and Created time. Check whether another apply is actually running (CI, a colleague) If no run is active, terraform force-unlock <LOCK_ID> (use the ID from the error). Never force-unlock while a real apply is in progress — you risk concurrent writes and a corrupt state
Lock won’t release even after force-unlock; backend-specific lock stuck Backend lock object orphaned — e.g. a stale DynamoDB lock item (S3 backend) or an Azure blob lease still held Inspect the backend: the DynamoDB lock table item, or the blob’s lease state Remove the orphaned lock at the backend (delete the DynamoDB lock item / break the blob lease) only after confirming nothing is running; then re-run
Plan wants to create a resource that already exists in the cloud (Error: … already exists on apply) The resource exists in the real world but not in state — created manually, or state rm’d, or never imported terraform state list (it’s absent); confirm it exists in the cloud console/CLI Import it: terraform import <addr> <cloud-id> (or an import {} block + apply), then plan should show No changes
Plan wants to destroy/recreate a resource you didn’t change Drift (someone changed it out-of-band) or state holds a stale/wrong value or a count/for_each re-index terraform plan and read the # (because …)/“forces replacement”; terraform state show <addr> vs the real resource If drift you want to keep: update config to match (or apply to reset to config). If state is stale: terraform apply -refresh-only. If re-index: see the for_each/count playbook below
Plan shows resources you already deleted manually (wants to “create” them back, or errors on refresh) State still references a resource that no longer exists in the cloud terraform plan (or terraform refresh/-refresh-only); terraform state show <addr> then check the cloud Let refresh reconcile: terraform apply -refresh-only marks it gone; or terraform state rm <addr> to drop the stale entry if refresh can’t (e.g. provider can’t read it)
Error: state snapshot was created by Terraform vX, which is newer than current vY The state was written by a newer Terraform/OpenTofu version than the CLI you’re running terraform version vs the version in CI / a colleague’s machine Upgrade your CLI to match (pin it in CI and via required_version). Do not hand-edit the version field — that risks corruption
Error: Failed to load state: … invalid character / unexpected end of JSON (corrupt state) The state file was truncated/garbled — interrupted write, partial upload, manual edit, merge conflict committed Try terraform state pull (does it parse?); inspect the file; check backend versioning Restore the last good version from your backup or backend versioning (S3 object versions, Terraform Cloud state history, blob snapshots). Re-run plan to confirm No changes
After a module/resource rename or move, plan wants to destroy the old and create the new Terraform keys resources by address; renaming changes the address, so it reads as delete-then-create terraform plan shows a - for the old address and + for the new, identical resource Tell Terraform it’s the same object: a moved {} block (declarative, reviewable, preferred) or terraform state mv <old> <new>; then plan shows No changes
Resource in the wrong module / needs to move between state files Refactor split a config, or you moved a resource into/out of a module terraform state list shows it at the old address Same as above — moved {} block within a config; across state files, terraform state mv -state-out=… (or pull/push) with backups
You want to stop managing a resource without destroying it Hand-off to another team/config, or it should now be unmanaged Decide: forget (keep the real resource) vs destroy removed {} block with lifecycle { destroy = false } (Terraform 1.7+) to forget it declaratively, or terraform state rm <addr>. (Plain state rm only forgets — it never deletes the real resource)

State surgery: the four scalpels (and the two declarative blocks)

“State surgery” sounds scary; done deliberately it is routine. There are four imperative scalpels and two newer declarative blocks. Know exactly what each does to state versus the real world — this is the distinction that prevents accidents.

Operation What it does Touches real infra? Use it when The gotcha
terraform import <addr> <id> (or import {} block) Records an existing cloud resource into state at <addr> No (read-only on the cloud) A resource exists but Terraform doesn’t track it Import only writes state — you must write matching config too, then plan until it’s No changes. import {} blocks (1.5+) are reviewable and can even generate config
terraform state mv <src> <dst> Renames/moves a resource’s entry within (or across) state No A rename/refactor changed an address and you don’t want destroy+create Prefer a moved {} block in config — it’s reviewed in the PR and survives for collaborators; state mv is a local, unreviewed mutation
terraform state rm <addr> Forgets a resource — removes it from state only No (resource keeps running) Hand a resource to another config, or drop a stale/duplicate entry It does not destroy the resource; the real thing keeps costing money and is now unmanaged. To delete, use terraform destroy/remove from config instead
terraform apply -replace=<addr> (replaces terraform taint) Forces destroy-and-recreate of one resource on the next apply Yes (destroys + recreates) A resource is in a bad runtime state and must be rebuilt It will delete and recreate — expect downtime/new IDs. taint/untaint are the older, two-step equivalents
moved {} block (config) Declares old→new address equivalence so refactors don’t destroy/create No Renaming resources/modules, or moving into/out of modules Keep it in config (you can prune it after everyone has applied); it’s the reviewable alternative to state mv
removed {} block (config, 1.7+) Declares a resource removed from config, with destroy = false to forget (or true to destroy) Optional (you choose) Stop managing a resource without deleting it, in a reviewable way The declarative, PR-reviewed alternative to state rm; pair with deleting the resource block

Two reflexes to burn in. First, state rm is “forget”, not “delete” — the real resource keeps running and now nobody manages it; reach for it only to hand off or drop a stale entry, never to “clean up” something you actually want gone (use destroy). Second, prefer the declarative blocks (moved, removed, import) over the imperative state subcommands wherever the version supports them: they go through code review, run in CI, and apply identically for every teammate — whereas terraform state mv/rm is a silent local mutation that the next person’s plan won’t understand.

Providers & authentication: versions, lockfiles & credentials

Provider problems split cleanly into versioning (which plugin, pinned how, recorded in the lockfile) and authentication (can the provider talk to the cloud, and as whom). The fix is completely different, so read the error to decide which.

Symptom Likely cause Diagnostic Fix
Error: Failed to query available provider packages … no available releases match the given constraints A required_providers version constraint can’t be satisfied (typo, impossible range, yanked version) Read the constraint in terraform { required_providers { … } }; terraform providers Fix the constraint (e.g. ~> 5.0); run terraform init
Error: provider … released a new version that is no longer compatible / unexpected major upgrade churn No version pin, so init pulled a new major with breaking changes terraform version; check required_providers for a missing/loose constraint Pin with a sensible constraint (~> 5.40 for patch/minor, >= 5.0, < 6.0 to cap the major); commit .terraform.lock.hcl
Error: the cached package … does not match any of the checksums recorded in the dependency lock file The committed .terraform.lock.hcl lacks the checksum for your platform/arch, or the lock is stale Inspect .terraform.lock.hcl; note your OS/arch vs CI’s terraform providers lock -platform=linux_amd64 -platform=darwin_arm64 … to record all needed platforms; commit it. To intentionally bump, terraform init -upgrade
Error: Inconsistent dependency lock file … provider … is not in the lock file (often in CI) CI ran init against a lockfile that doesn’t include a required provider/platform, or someone added a provider without re-locking CI log; compare local vs committed .terraform.lock.hcl Re-run terraform init -upgrade (or providers lock for all CI platforms) locally and commit the updated lockfile
Error: Could not load plugin / Failed to install provider Network/proxy/mirror blocked the registry download, or a corrupt plugin cache TF_LOG=DEBUG terraform init (shows the download URL/HTTP error) Fix proxy/registry access or configure a provider_installation/network mirror; clear .terraform/providers and re-init
Error: building AzureRM Client / NoCredentialProviders / Unable to locate credentials / google: could not find default credentials The provider has no credentials — env vars unset, no CLI login, wrong profile, missing OIDC The error names the provider; check az account show / aws sts get-caller-identity / gcloud auth list Provide credentials the provider expects (env vars, az login/aws sso login, a profile, or OIDC in CI); confirm with the cloud’s “who am I” command
Error: … AccessDenied / 403 Forbidden / AuthorizationFailed despite being logged in Authenticated but the identity lacks IAM permission for that action/resource aws sts get-caller-identity (who?) then check the IAM policy / Azure role assignment / GCP role for the action in the error Grant the missing permission to that principal (least privilege); re-run. (This is authorisation, not authentication)
Error: … expired token / InvalidClientTokenId / 401 mid-apply Short-lived credentials (SSO/STS/OIDC) expired during a long apply Token TTL; how long the apply ran Refresh the session (aws sso login, re-auth) and re-run; for long applies, use credentials with adequate TTL or a CI identity that auto-refreshes
Error: … Throttling / Rate exceeded / TooManyRequests / 429 The cloud API rate-limited you — large apply, high -parallelism, or a noisy account The error is explicit; correlate with a big plan or parallel runs Lower terraform apply -parallelism=N (default 10); retry (many providers back off automatically); split the apply with -target; request a quota increase
Error: Inconsistent provider configuration / missing required provider configuration in module A child module needs an explicit provider passed (e.g. aliased/multi-region) and didn’t get one terraform validate; check the module’s required_providers and providers = { … } wiring Pass providers explicitly to the module (providers = { aws = aws.useast1 }); declare aliases at the root

The two reflexes here. First, commit .terraform.lock.hcl and treat it like package-lock.json: it pins exact provider versions and checksums so every machine and CI runner resolves identically — most “works on my laptop, fails in CI” provider errors are a missing platform entry in the lockfile, fixed with terraform providers lock -platform=…. Second, separate authentication from authorisation: a credentials/NoCredentials/401 error means you aren’t logged in as anyone the provider can use; a 403/AccessDenied/AuthorizationFailed means you are logged in, but that identity lacks the IAM permission — confirm “who am I” with the cloud CLI before you change anything.

Plan & apply errors: cycles, for_each/count & types

These are configuration-layer failures — Terraform refuses or mis-plans because of the HCL, the type system, or the dependency graph. They are also the richest source of interview questions, because they test whether you understand how Terraform builds its graph and resolves values.

Symptom Likely cause Diagnostic Fix
Error: Cycle: a → b → a (dependency cycle) Resources/modules reference each other in a loop (often via depends_on, or two security groups referencing each other) terraform graph (or TF_LOG=DEBUG) to see the edges; read the cycle the error prints Break the loop: remove the unnecessary depends_on; for mutual SG references use separate *_security_group_rule/*_sg_egress resources instead of inline rules; introduce an intermediate resource
Error: Invalid for_each argument … the "for_each" value depends on resource attributes that cannot be determined until apply for_each/count is keyed on a value that is unknown at plan time (an attribute of a not-yet-created resource) Look at what feeds for_each/count; if it references another resource’s computed attribute, that’s it Key on values known at plan time (input vars, locals, names) — not computed IDs. If unavoidable, -target the dependency first, or restructure so keys are static
Error: Invalid count argument … value depends on resource attributes that cannot be determined until apply Same root cause as above, but for count (e.g. count = length(aws_subnet.x) where x doesn’t exist yet) Inspect the count expression for a computed dependency Base count on known input (a variable/local), not on the length of another resource’s not-yet-known output; or apply the dependency first
Plan wants to destroy and recreate many resources after you added/removed one list item The collection uses count (index-based), so removing item N re-indexes everything after it ([2] becomes [1]…) terraform plan shows a cascade of -/+ on …[n] addresses you didn’t intend to change Switch the collection to for_each keyed by a stable string (map/set), so each instance has a stable identity and only the changed key moves. Migrate existing instances with moved {}/state mv
Error: The given "for_each" argument value is unsuitable: … must be a map, or set of strings for_each was given a list (lists aren’t allowed — duplicates/ordering) or a value of the wrong type Check the expression’s type (terraform console: type(var.x)) Convert to a set (toset(var.list)) or a map; for_each requires map or set-of-strings, and keys must be known at plan time
Error: Duplicate object key / two different items produced the key "…" in for_each The expression building the for_each map produced duplicate keys Inspect the key expression; terraform console to evaluate it Make keys unique (include a discriminator); if using toset on non-unique values, derive a unique key map instead
Error: Invalid index / … is not a valid index for … (accessing [count.index] or a key that doesn’t exist) Indexing into a count resource that’s empty, or a for_each key that isn’t present terraform console to inspect the collection; check whether count/for_each produced that index/key Guard with length()/try()/lookup(); reference for_each instances by key (aws_x.this["name"]), count ones by index (aws_x.this[0])
Error: Inconsistent conditional result types / Invalid value for … : … must be a … (type mismatch) An expression yields mismatched types (both branches of ? : must match), or a value doesn’t match the variable’s declared type terraform validate; terraform console to check type(...) of each side Align the types (cast with tostring/tonumber/tolist; make both ternary branches the same type); fix the variable’s type constraint
Error: Unsupported attribute / This object does not have an attribute named "…" Referencing an attribute that doesn’t exist on that resource/output (typo, wrong provider version, wrong resource) terraform providers schema -json or the registry docs for the real attribute names Use the correct attribute; if it disappeared, check the provider version’s changelog (a major upgrade may have renamed/removed it)
Error: Reference to undeclared resource / input variable / Module not installed A name is referenced before it’s declared, or init wasn’t run after adding a module terraform validate; check the name exists; was terraform init run? Declare the variable/resource; run terraform init after adding modules/providers
Error: Provider produced inconsistent final plan / Provider produced inconsistent result after apply A provider bug (planned value ≠ applied value) — usually a known issue in a specific provider version terraform version; search the provider’s GitHub issues for the resource + message Upgrade/downgrade the provider to a fixed version; sometimes a second apply reconciles; report upstream if novel

The single most valuable idea in this table is count vs for_each identity. count gives each instance a positional address ([0], [1], [2]); remove the middle one and everything after it shifts, so Terraform thinks you destroyed and recreated a pile of resources. for_each gives each instance a named address keyed by a string (["alice"], ["bob"]); add or remove a key and only that one moves, everyone else is untouched. Rule of thumb: use count only for “N identical copies” or a simple on/off toggle (count = var.enabled ? 1 : 0); use for_each whenever the instances are distinct things you’ll add to or remove from over time. The second idea is “known at plan time”: for_each/count keys, and anything that determines how many resources exist, must be computable during plan — so key them on inputs and locals, never on another resource’s computed attributes (IDs, ARNs) that only exist after apply.

Drift: detection & reconciliation

Drift is when the real infrastructure no longer matches what Terraform recorded in state — almost always because something changed it out-of-band: a console click, an autoscaler, another tool, or a cloud-side default. Terraform discovers drift during the refresh that plan/apply perform automatically (it reads each resource’s real state and compares). The skill is deciding, per drift, whether to accept reality into your config or re-impose your config onto reality — and never to blindly apply over a change you don’t understand.

Symptom Likely cause Diagnostic Fix
plan shows changes you didn’t make (an attribute reverting, a tag missing) Out-of-band change — someone edited the resource in the console/another tool terraform plan and read the ~ diff; terraform state show <addr> vs the live resource Decide intent. Keep the change → update config to match it. Reject the change → apply to reset it to config
You want to see drift without planning new config changes You need to know “what changed in reality?” separate from “what does my new code want?” terraform plan -refresh-only (shows only real-world vs state differences) Review; then terraform apply -refresh-only to update state to match reality (without changing any infrastructure)
State is stale — it holds old values though the cloud is correct A change happened out-of-band that you want, and state hasn’t caught up terraform apply -refresh-only (proposes syncing state) Apply the refresh-only run to record reality into state; then a normal plan is clean
A resource was deleted out-of-band; plan now wants to recreate it The real resource is gone; state still references it terraform plan/-refresh-only; check the cloud If you want it back, apply recreates it; if it should stay gone, remove it from config (or state rm) so plan is clean
Drift reappears after every apply (the same diff returns) A controller/policy keeps re-changing the resource, or your config fights a cloud-managed default Identify what writes the attribute (autoscaler, Azure Policy, a sidecar process) Stop fighting: add lifecycle { ignore_changes = [that_attribute] } so Terraform stops managing it, or remove it from config and let the controller own it
Constant noisy diff on an attribute Terraform shouldn’t own (e.g. autoscaled desired_count) Terraform and another system both manage the same field terraform plan shows the same ~ every time lifecycle { ignore_changes = [desired_count] } — declare that field externally owned

Two reflexes. First, -refresh-only is your safe lens on reality: terraform plan -refresh-only answers “what changed out-of-band?” with zero risk, and terraform apply -refresh-only syncs state to the real world without altering infrastructure — use it to absorb intentional manual changes before they collide with your next deploy. (Note: since Terraform 0.15.4 a plain plan no longer writes the refreshed state, and -refresh=false skips refresh entirely for speed when you’re certain nothing drifted.) Second, decide intent before you apply: every drift is a fork — accept reality (update config or ignore_changes) or re-impose config (apply) — and an apply you run without understanding the diff is how Terraform reverts a colleague’s emergency hotfix at 2 a.m. The deeper treatment, including a scheduled detection pipeline, is in detecting and reconciling Terraform drift.

TF_LOG: seeing what Terraform actually does

When the error message isn’t enough — a provider failing opaquely, a hang, a mysterious API rejection, a CI-only failure — TF_LOG turns on Terraform’s internal logging and shows you the real work: graph building, provider plugin handshakes, and every API request/response. It is the single best escalation when “read the error” runs out.

Set the level via the TF_LOG environment variable, and capture it with TF_LOG_PATH:

Setting What it gives you When to use
TF_LOG=TRACE Everything — the most verbose (graph walk, every plugin RPC). Firehose Deep/last-resort debugging, bug reports to HashiCorp/provider maintainers
TF_LOG=DEBUG Detailed flow incl. provider API requests/responses The usual choice for “why is the provider doing this?” / auth / rate-limit issues
TF_LOG=INFO / WARN / ERROR Progressively less — high-level info, warnings, or errors only Lighter-touch insight without the firehose
TF_LOG=JSON Machine-readable structured logs Parsing logs in tooling/CI dashboards
TF_LOG_CORE / TF_LOG_PROVIDER Split logging: core (Terraform itself) vs provider only Isolate whether the issue is in core or in the plugin
TF_LOG_PATH=./tf.log Writes the log to a file (works with any level above) Always, when capturing — keeps the terminal usable and gives you a file to grep/attach

A practical recipe: TF_LOG=DEBUG TF_LOG_PATH=./tf-debug.log terraform apply, then grep the file for the failing resource, the HTTP status (403, 429, 500), or the provider’s request body. For a provider that hangs, TF_LOG=TRACE shows whether it’s stuck waiting on an API call. To pin a problem to core-vs-plugin, use TF_LOG_CORE=ERROR TF_LOG_PROVIDER=DEBUG so the provider’s chatter isn’t drowned out by Terraform’s graph walk.

Security warning: TF_LOG=DEBUG/TRACE logs can contain secrets — tokens, credentials, request bodies with sensitive values. Always write to a file you control (TF_LOG_PATH), never paste raw debug logs into a public issue or chat, and delete the log when done. In CI, do not enable TF_LOG on shared pipelines by default, and scrub artifacts.

CI/CD failure playbook

The same failures look different in a pipeline, where there’s no human at the keyboard, credentials come from OIDC/secrets, and the runner is ephemeral. Most CI-only Terraform failures are one of four things: the lockfile doesn’t include the runner’s platform, auth isn’t wired (OIDC/role), state locking collides between concurrent jobs, or the working directory/init is misconfigured.

Symptom (in CI) Likely cause Diagnostic Fix
Error: Inconsistent dependency lock file / checksum mismatch — green locally, red in CI .terraform.lock.hcl lacks the CI runner’s platform (e.g. linux_amd64) Compare the lockfile’s recorded platforms to the runner OS/arch terraform providers lock -platform=linux_amd64 -platform=darwin_arm64 … locally; commit the updated lockfile
NoCredentials / 401 / could not find default credentials only in CI OIDC/role not assumed, secret not injected, wrong region/profile env Pipeline logs for the auth step; aws sts get-caller-identity as a debug step Wire OIDC (GitHub id-token: write + role-to-assume) or inject credentials; confirm “who am I” before plan
Error acquiring the state lock from a pipeline A previous CI job was cancelled and left the lock; or two jobs run concurrently on the same state Check for a stuck/cancelled prior run; the Lock Info shows it Serialise jobs on that state (concurrency group); force-unlock <ID> the orphaned lock once you’ve confirmed nothing runs
Plan/apply hangs or times out in CI Waiting on input (no -input=false), an API hang, or interactive approval Job log shows it stalled at a prompt; TF_LOG=DEBUG Always run terraform … -input=false; use -auto-approve only behind a real approval gate; set step timeouts
Error: Backend initialization required / Module not installed terraform init not run (or run in the wrong dir), or backend config missing The init step / working-directory setting Run terraform init in the correct directory with backend config; cache .terraform between steps carefully (not the lockfile-sensitive bits)
Apply succeeds locally, the plan/apply differ in CI (different result) A different provider/Terraform version between laptop and runner terraform version in both; check required_version and the lockfile Pin required_version, pin providers, commit the lockfile, and use the same CLI version in CI (e.g. a setup-terraform action with a fixed version)

The meta-lesson: a CI pipeline should make failures reproducible. Pin the Terraform version and providers, commit the lockfile with every platform CI uses, run with -input=false, separate plan (on PRs, no creds-to-write) from apply (gated, with write creds), and serialise applies per state. The dedicated DevOps lesson on troubleshooting pipelines, builds and runners generalises this beyond Terraform.

Hands-on lab: break it, diagnose it, fix it

You’ll plant several classic faults using free, local providers (null, random, local) — no cloud account, no charges — then walk each through the loop.

1. Set up a throwaway config. Create a directory and a main.tf:

mkdir tf-ts-lab && cd tf-ts-lab
cat > main.tf <<'EOF'
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    random = { source = "hashicorp/random", version = "~> 3.6" }
    local  = { source = "hashicorp/local",  version = "~> 2.5" }
  }
}

variable "names" {
  type    = list(string)
  default = ["alpha", "beta", "gamma"]
}

# Deliberately using count — we'll feel the re-index pain, then fix it.
resource "random_pet" "server" {
  count = length(var.names)
}
EOF
terraform init        # downloads providers, writes .terraform.lock.hcl
terraform apply -auto-approve
terraform state list  # random_pet.server[0..2]

2. Fault A — count re-index churn. Remove the first name and plan:

sed -i.bak 's/\["alpha", "beta", "gamma"\]/["beta", "gamma"]/' main.tf
terraform plan        # Note: it destroys [2] and CHANGES [0] and [1] — re-indexing!

You only deleted one item, yet count re-indexed everything. Fix by switching to for_each keyed by the name (stable identity), and migrate the existing instances with moved/state mv so nothing is destroyed:

cat > main.tf <<'EOF'
terraform {
  required_version = ">= 1.5.0"
  required_providers {
    random = { source = "hashicorp/random", version = "~> 3.6" }
    local  = { source = "hashicorp/local",  version = "~> 2.5" }
  }
}

variable "names" {
  type    = set(string)
  default = ["beta", "gamma"]
}

resource "random_pet" "server" {
  for_each = var.names
}
EOF
# Migrate the surviving instances from count-index to for_each-key in state:
terraform state mv 'random_pet.server[1]' 'random_pet.server["beta"]'
terraform state mv 'random_pet.server[2]' 'random_pet.server["gamma"]'
terraform plan        # Now: No changes (alpha was legitimately removed)

The lesson: for_each gives stable, named identity; the migration is pure state surgery with no infra impact. (A for_each type error is one console call away — terraform console then type(["a","b"]) shows it’s a list, which is why for_each rejects it until you toset(...) it.)

3. Fault B — drift, via refresh-only. Add a local_file, apply, then change it on disk out-of-band:

cat >> main.tf <<'EOF'

resource "local_file" "note" {
  filename = "${path.module}/note.txt"
  content  = "managed by terraform"
}
EOF
terraform apply -auto-approve
echo "edited out of band" > note.txt          # simulate an out-of-band change
terraform plan -refresh-only                   # shows the drift: state vs reality
terraform plan                                 # a normal plan wants to reset content to config

Decide intent: to keep the manual edit, update content in config to match; to reject it, terraform apply -auto-approve resets the file to the managed content. Run your choice and confirm a subsequent terraform plan shows No changes.

4. Fault C — TF_LOG to see the work. Capture a debug log of a plan and grep it:

TF_LOG=DEBUG TF_LOG_PATH=./tf-debug.log terraform plan
grep -i "provider" tf-debug.log | head        # provider plugin handshakes
rm -f tf-debug.log                              # delete — debug logs can hold secrets

For the stuck-lock drill there’s nothing to run with local state — the muscle memory is the point: when an apply prints Error acquiring the state lock with a Lock Info block, read the ID, confirm nothing is actually running, then terraform force-unlock <ID> — never force-unlock blindly.

Validation. A clean run shows: the count change re-indexing in the plan (then No changes after the for_each migration); -refresh-only surfacing the file drift and your chosen reconciliation yielding No changes; and a TF_LOG file you read and then deleted.

Cleanup (so nothing is left on disk):

terraform destroy -auto-approve
cd .. && rm -rf tf-ts-lab

Cost note: free / local. The null, random, and local providers create nothing in any cloud — the entire lab runs on your laptop with no account and no charges.

Common mistakes & troubleshooting

A meta-table — the mistakes engineers make while troubleshooting Terraform, which keep them stuck or cause outages:

Mistake Why it bites Do this instead
Running terraform state rm to “clean up” a resource you want gone It only forgets; the real resource keeps running, costing money, now unmanaged Use terraform destroy / remove from config to actually delete; state rm is for hand-off only
force-unlock while an apply might be running Concurrent writes can corrupt state Confirm nothing is running (CI, colleagues) first; only then unlock by the printed ID
Editing terraform.tfstate by hand One bad character corrupts it; you lose the resource mapping Use state mv/rm/import or moved/removed/import blocks; never hand-edit
Not committing .terraform.lock.hcl “Works locally, breaks in CI”; surprise provider upgrades Commit it; record all CI platforms with providers lock -platform=…
Using count for a list of distinct things Removing/adding an item re-indexes and destroys/recreates others Use for_each keyed by a stable string; reserve count for N-copies / on-off
Keying for_each/count on a computed attribute “value … cannot be determined until apply” Key on inputs/locals known at plan time; -target the dependency if truly needed
apply-ing over a diff you don’t understand You may revert someone’s intentional out-of-band change (e.g. a hotfix) plan -refresh-only first; decide accept reality vs re-impose config
Pasting TF_LOG=DEBUG output into a ticket/chat Debug logs can contain secrets Capture to TF_LOG_PATH, scrub, and delete; share only the relevant scrubbed lines
Re-running apply to “fix” an error without diagnosing Wastes time, can compound damage (half-applied state) Read the error, name the layer, form one hypothesis, then act
Renaming a resource/module and accepting the destroy/create plan You delete and recreate real infra for a pure rename Add a moved {} block (or state mv) so it’s recognised as the same object

Best practices

Security notes

Troubleshooting Terraform repeatedly touches your most sensitive assets — credentials and state — so do it safely. State files contain secrets in plaintext (passwords, keys, generated values are stored as-is in JSON); never commit state to git, restrict access to the backend, enable encryption at rest, and treat any state pull/backup file as a secret you must delete. TF_LOG=DEBUG/TRACE can leak those same secrets into logs — always write to a controlled TF_LOG_PATH, scrub before sharing, and never enable verbose logging by default in shared CI. When an error is 403/AccessDenied, resist granting broad permissions (or *:*/Owner) just to make it pass — confirm the principal with the cloud’s “who am I” command and grant the minimum missing action; over-broad CI roles are a real escalation path. Be deliberate with force-unlock (a wrong call mid-apply corrupts shared state) and with -replace/destroy (they delete real infrastructure — gate them behind review and prevent_destroy on critical resources). Finally, use OIDC/short-lived credentials in CI rather than long-lived keys, so a leaked log or artifact isn’t a standing breach. These themes are developed in secrets in IaC — Vault and dynamic credentials.

Interview & exam questions

  1. terraform apply fails with Error acquiring the state lock. Walk me through your response. Read the Lock Info (ID, Who, Created). Determine whether an apply is actually running (CI, a colleague). If a real run is in progress, wait — never break a live lock. If it’s orphaned (a cancelled/crashed run), terraform force-unlock <LOCK_ID> with the ID from the error, then retry. Prevent it by serialising applies per state (CI concurrency groups).
  2. What’s the difference between terraform state rm, terraform destroy, and terraform import? state rm forgets a resource (removes it from state; the real resource keeps running — now unmanaged). destroy deletes the real resource. import adopts an existing real resource into state (and you must then write matching config). The classic trap: using state rm expecting deletion — it doesn’t touch the cloud.
  3. When do you use count vs for_each, and why does it matter for drift/churn? count is positional ([0], [1]) — removing a middle item re-indexes everyone after it, causing destroy/recreate of unchanged resources. for_each is named by a stable key (["x"]) — add/remove a key and only that instance moves. Use count for N identical copies / on-off (var.enabled ? 1 : 0); use for_each for distinct things you’ll add to or remove from.
  4. Explain the error “for_each value depends on resource attributes that cannot be determined until apply.” How do you fix it? for_each/count keys must be known at plan time; you’ve keyed them on a computed attribute (an ID/ARN of a not-yet-created resource). Fix by keying on inputs/locals/names that are static at plan time; if genuinely unavoidable, -target the dependency to create it first, or restructure so the keys don’t depend on apply-time values.
  5. You get a dependency Cycle error between two security groups. What’s happening and how do you break it? Two resources reference each other (often inline rules referencing the other group), so the graph has a loop. Break it by extracting the rules into separate rule resources (aws_security_group_rule / the newer aws_vpc_security_group_ingress_rule) instead of inline blocks, removing the mutual reference; or drop an unnecessary depends_on.
  6. A plan wants to destroy and recreate a resource you only renamed in code. Why, and what’s the right fix? Terraform keys resources by address; a rename is a new address, read as delete-old + create-new. Tell Terraform it’s the same object with a moved {} block (preferred — reviewed in the PR) or terraform state mv old new; then plan shows No changes.
  7. What does terraform plan -refresh-only do, and when do you use it? It compares state to the real world only (it does not consider config changes), showing pure drift. You use it to answer “what changed out-of-band?” safely; terraform apply -refresh-only then updates state to match reality without changing any infrastructure — the safe way to absorb an intentional manual change before your next deploy.
  8. Your config applies fine locally but fails in CI with an inconsistent-lock-file / checksum error. Cause and fix? The committed .terraform.lock.hcl doesn’t include the CI runner’s platform (e.g. linux_amd64). Fix with terraform providers lock -platform=linux_amd64 -platform=darwin_arm64 … locally and commit the updated lockfile. (Always commit the lockfile and pin versions.)
  9. A provider call returns 403 AccessDenied even though you’re authenticated. What’s the distinction and how do you debug? Authentication (who you are) succeeded; authorisation (what you may do) failed — the identity lacks the IAM permission for that action/resource. Confirm the principal (aws sts get-caller-identity / az account show), check its policy/role for the action named in the error, and grant the minimum missing permission. (TF_LOG=DEBUG shows the exact API call.)
  10. How do you debug a provider that’s behaving opaquely or hanging? Set TF_LOG=DEBUG (or TRACE) and TF_LOG_PATH=./tf.log, reproduce, then grep the log for the resource and the HTTP status/request body. Use TF_LOG_CORE/TF_LOG_PROVIDER to isolate core-vs-plugin. Remember debug logs can contain secrets — capture to a file, scrub, and delete.
  11. State got corrupted (invalid JSON) after a failed write. How do you recover? Restore the last good version from your backup or backend versioning (S3 object versions / Terraform Cloud state history / blob snapshots / your state pull backup), then run terraform plan to confirm No changes. Never hand-edit state to “repair” it. Prevent it with a versioned, locking remote backend.
  12. What’s the difference between terraform taint/-replace and changing config? -replace=<addr> (modern) / taint (legacy) forces a destroy-and-recreate of that one resource on the next apply, without any config change — used when a resource is in a bad runtime state. A config change re-plans normally. -replace is destructive (downtime, new IDs), so reserve it for “rebuild this exact thing”.

Quick check

  1. You run terraform state rm aws_s3_bucket.logs to get rid of a bucket. What actually happens to the bucket, and what should you have used to delete it?
  2. A plan destroys and recreates five subnets after you removed one entry from a list. Which meta-argument is in use, what’s the underlying cause, and what’s the fix that preserves the survivors?
  3. You see Error acquiring the state lock with a Lock Info block. What’s the one thing you must confirm before running force-unlock, and which value do you pass to it?
  4. What does terraform apply -refresh-only change — infrastructure, state, or both — and when would you run it?
  5. A provider returns 403 AccessDenied while you’re logged in. Is this an authentication or an authorisation problem, and which command tells you who you are?

Answers

  1. Nothing happens to the bucketstate rm only forgets it (removes it from state); the bucket keeps running and is now unmanaged. To actually delete it, remove the resource from config and terraform apply (or terraform destroy). state rm is for handing a resource off, not deleting it.
  2. count is in use; it’s positional, so removing a middle item re-indexes every later instance, making Terraform plan destroy/recreate of unchanged subnets. Switch the collection to for_each keyed by a stable string, and migrate existing instances with moved {} blocks or terraform state mv so only the genuinely removed one goes.
  3. Confirm that no apply is actually running (no CI job, no colleague) — breaking a live lock can corrupt shared state. Then pass the lock ID from the error’s Lock Info block: terraform force-unlock <ID>.
  4. It updates state only (to match the real world); it does not change any infrastructure. Run it to absorb an intentional out-of-band change into state, or to inspect pure drift with plan -refresh-only first.
  5. Authorisation — you’re authenticated (logged in) but the identity lacks the IAM permission for that action. aws sts get-caller-identity (or az account show / gcloud auth list) tells you which principal you are, so you can grant the minimum missing permission.

Exercise

Build your own Terraform break-and-fix runbook (timed, free, local). On a fresh directory using only the null/random/local providers, plant one fault per layer and prove you can diagnose each from observation alone — before you fix it.

  1. Scaffold a config (terraform init with random + local).

  2. Plant five faults, one per layer:

    • Config (re-index): a random_pet collection on count; remove a middle item and capture the destructive plan.
    • Config (type): point a for_each at a list (not a set) and capture the error.
    • State (rename): rename a resource and capture the destroy/create plan.
    • Drift: create a local_file, edit it out-of-band, and capture plan -refresh-only.
    • Provider/version: loosen a provider constraint, init -upgrade, and observe the resolved version change in terraform version.
  3. For each, write down the layer, the diagnostic command, the exact error/diff, the root cause, and the fix — before fixing. Time yourself: under 6 minutes per fault.

  4. Fix each with the smallest safe change (a moved {}/state mv for the rename; for_each + migration for the re-index; toset() for the type; your chosen reconciliation for drift; a version pin for the provider) and verify each yields terraform planNo changes (except where a change is intended).

  5. Self-assess:

    Criterion Target
    Identified the correct layer before touching anything All 5
    Found the root cause from plan/state show/validate/console (not guessing) All 5
    Fixed with the smallest, reversible change (declarative where possible) All 5
    Verified plan is clean (No changes) afterwards All 5
    Whole drill completed Under 30 minutes
  6. Cleanup: terraform destroy -auto-approve then delete the directory.

Cost note: free / local — the whole exercise uses providers that create nothing in any cloud.

Certification mapping

Glossary

Next steps

You can now diagnose the everyday Terraform failures across every layer — state, providers/auth, configuration, and drift — and read what the engine is actually doing with TF_LOG. The next lesson steps up from “fix this error” to “design an IaC platform that rarely produces these errors in the first place”:

TerraformTroubleshootingStateDriftOpenTofuDevOps
Need this built for real?

Vinod is a Senior Cloud Architect (22+ yrs) — available for Azure / AWS / GCP architecture, landing zones, and migrations.

Work with me

Comments

Keep Reading