Terraform State Surgery: Recovering from Corruption, Locks, and Split-Brain

State is the one piece of Terraform you cannot regenerate from your .tf files. The configuration is reproducible; the mapping from configuration to real-world resource IDs is not. When that mapping is locked by a dead process, half-truncated by an interrupted apply, or duplicated across two backends that both think they are authoritative, you are no longer doing infrastructure-as-code. You are doing surgery. This guide is the playbook I reach for during a state incident: how to read the patient’s vitals first, then make the smallest precise cut that restores agreement between configuration, state, and reality.

The cardinal rule of state surgery: back up the state before you touch it, every time, no exceptions. A terraform state pull > backup-$(date +%s).tfstate costs nothing and is the difference between a recoverable mistake and a rebuild.

1. Anatomy of the state file: serial, lineage, and resource addressing

Before editing state you must be able to read it. A state file is JSON with a small set of fields that govern correctness. Pull the current state and inspect the top-level keys:

terraform state pull > current.tfstate
jq '{version, terraform_version, serial, lineage}' current.tfstate

{
  "version": 4,
  "terraform_version": "1.9.5",
  "serial": 187,
  "lineage": "8f2a1c9e-4b3d-4a77-9f12-2d6e5a0b1c34"
}

Three fields drive every decision you will make:

Field	Meaning	Why it matters in an incident
`serial`	Monotonic counter, incremented on every write	Backends use it for optimistic locking. A lower serial overwriting a higher one means lost writes.
`lineage`	UUID generated when state is first created	Two states with different lineage are not versions of each other. Pushing across a lineage boundary is the canonical split-brain trigger.
`version`	State format version (4 for Terraform 0.13+)	Do not confuse with `terraform_version`. Editing the format incorrectly corrupts the file.

Resource addresses are the coordinates you operate on. The address module.network.aws_subnet.private[0] decomposes into module path (module.network), type (aws_subnet), name (private), and index key ([0] for count, ["a"] for for_each). List every address Terraform currently tracks:

terraform state list
# module.network.aws_vpc.this
# module.network.aws_subnet.private[0]
# module.network.aws_subnet.private[1]
# aws_db_instance.primary

Inspect a single resource to see its real-world id, provider, and attributes:

terraform state show aws_db_instance.primary

You now have the vocabulary. Everything below manipulates these addresses, the serial, or the lineage.

2. Breaking stale locks safely with force-unlock

A lock prevents two operations from writing state simultaneously. When a CI job is killed mid-apply, the lock can survive the process. The next run fails:

Error: Error acquiring the state lock

Lock Info:
  ID:        f4c2b3a1-6d5e-4f8a-9b2c-1e7d3a0f5c91
  Operation: OperationTypeApply
  Who:       runner@ci-agent-07
  Created:   2026-06-08 09:14:22.13 +0000 UTC

Do not reflexively force-unlock. First confirm the holding process is actually dead. If a teammate’s apply is still running and you break their lock, you will get two concurrent writers and corrupt state. Verify the CI job is terminated, the laptop is closed, the pipeline shows failed. Then, and only then:

terraform force-unlock f4c2b3a1-6d5e-4f8a-9b2c-1e7d3a0f5c91

The lock ID comes straight from the error. force-unlock removes the lock entry without modifying state itself, so it is low-risk once you have confirmed no live writer.

Backend-specific lock cleanup

When force-unlock cannot reach the backend (corrupted lock table, deleted lock object), clean the lock at the source.

AWS S3 + DynamoDB. Terraform 1.10+ supports native S3 lockfiles via use_lockfile = true, but the long-standing pattern uses a DynamoDB table. The lock item’s key is <bucket>/<key>-md5:

aws dynamodb delete-item \
  --table-name terraform-locks \
  --key '{"LockID": {"S": "my-tfstate-bucket/prod/network.tfstate-md5"}}'

If you are on native S3 lockfiles, the lock is a sibling object instead:

aws s3 rm s3://my-tfstate-bucket/prod/network.tfstate.tflock

Azure Blob Storage. The lock is a blob lease, not a separate object. Break the lease directly:

az storage blob lease break \
  --account-name tfstateprod \
  --container-name tfstate \
  --blob-name prod/network.tfstate \
  --auth-mode login

Google Cloud Storage. GCS uses a .tflock object alongside the state:

gsutil rm gs://my-tfstate-bucket/prod/network.tfstate.tflock

After clearing the lock, run terraform plan immediately. A plan that succeeds and shows the expected delta confirms the lock was the only problem and state itself is intact.

3. Surgical edits with state mv, rm, replace-provider, and pull/push

These are your scalpels. Every one of them rewrites state, so the backup from the intro is mandatory before each.

state mv — rename or move a resource without destroying it. Use this after a refactor renames a resource or wraps it in a module. It updates the address; the real resource is never touched.

# Resource pulled into a module
terraform state mv aws_s3_bucket.logs module.logging.aws_s3_bucket.this

# count -> for_each conversion: re-key by hand
terraform state mv 'aws_instance.web[0]' 'aws_instance.web["az-a"]'

Note: modern Terraform prefers moved {} blocks in configuration for refactors, since they are reviewable and apply automatically. Reach for state mv during an incident or for one-off corrections that should not live in the config.

state rm — forget a resource without deleting it. This removes the resource from state while leaving the real infrastructure running. Essential when state has a phantom entry whose backing resource was deleted out of band, or when you are handing a resource off to another state.

terraform state rm aws_db_instance.legacy_replica

After rm, Terraform no longer manages that object. If the resource still exists and you want it back under management, you will import it (Section 7).

state replace-provider — rewrite provider source addresses. Needed after the registry namespace of a provider changes (the classic terraform-providers/aws to hashicorp/aws migration), which otherwise blocks init:

terraform state replace-provider \
  registry.terraform.io/-/aws \
  registry.terraform.io/hashicorp/aws

state pull / state push — read and write the raw file. pull emits state to stdout for inspection or backup. push uploads a local file to the backend and is the most dangerous command in Terraform. It enforces the serial and lineage guards by default; never use -force to bypass them unless you have personally reconciled both files and understand exactly what you are overwriting.

terraform state pull > backup.tfstate
# ... edit a local copy with jq, validate it ...
terraform state push edited.tfstate

For targeted attribute surgery, edit a pulled copy with jq rather than a text editor (a text editor invites whitespace and quoting corruption), then bump the serial so the backend accepts it as a newer write:

jq '(.resources[] | select(.type=="aws_db_instance") | .instances[0].attributes.deletion_protection) = true | .serial += 1' \
  backup.tfstate > edited.tfstate
terraform state push edited.tfstate

4. Reconciling state vs reality after out-of-band console changes

Someone changed a resource in the cloud console. Now state disagrees with reality. Terraform calls this drift, and the first move is always to see it, not to clobber it. Refresh state into a separate plan file rather than mutating it:

terraform plan -refresh-only -out=refresh.tfplan
terraform show refresh.tfplan

-refresh-only reconciles state with the provider’s view of the world and shows you exactly which attributes changed, without proposing to revert anything. You then make an explicit decision per attribute:

Adopt reality (the console change was correct): apply the refresh so state records the new values.
```
terraform apply -refresh-only
```
Revert to code (the console change was unauthorized): run a normal terraform apply, which plans the resource back to its declared configuration.
Ignore a volatile attribute (tags injected by a policy engine, autoscaling-managed capacity): add it to lifecycle.ignore_changes so it stops generating noise.
```
resource "aws_autoscaling_group" "app" {
  # ...
  lifecycle {
    ignore_changes = [desired_capacity]
  }
}
```

The trap here is running a bare terraform apply in a panic and reverting a legitimate emergency hotfix that the on-call engineer made at 3 a.m. -refresh-only first, decide second.

5. Recovering a deleted or truncated state from versioning and snapshots

An interrupted apply over a flaky network, or a fat-fingered terraform state rm of the wrong addresses, can leave you with a truncated or emptied state. If your backend has versioning enabled (and it must — see Section 8), recovery is a rollback, not a rebuild.

Detect the damage. A near-empty state with a high serial is the signature of a truncated write:

terraform state pull | jq '{serial, resource_count: (.resources | length)}'
# {"serial": 188, "resource_count": 0}   <- was 40-something yesterday

AWS S3. List object versions newest-first and pull the last good one:

aws s3api list-object-versions \
  --bucket my-tfstate-bucket \
  --prefix prod/network.tfstate \
  --query 'reverse(sort_by(Versions, &LastModified))[:5].[VersionId,LastModified,Size]' \
  --output table

aws s3api get-object \
  --bucket my-tfstate-bucket \
  --key prod/network.tfstate \
  --version-id 3HL4kqC... \
  recovered.tfstate

Azure Blob Storage. With blob versioning enabled, list versions and download the chosen one:

az storage blob list \
  --account-name tfstateprod --container-name tfstate \
  --prefix prod/network.tfstate --include v \
  --query "[].{name:name, version:versionId, modified:properties.lastModified}" -o table

az storage blob download \
  --account-name tfstateprod --container-name tfstate \
  --name prod/network.tfstate --version-id 2026-06-07T22:14:03.1Z \
  --file recovered.tfstate --auth-mode login

Google Cloud Storage. With object versioning, list generations and copy one back:

gsutil ls -a gs://my-tfstate-bucket/prod/network.tfstate
gsutil cp gs://my-tfstate-bucket/prod/network.tfstate#1717797243000000 recovered.tfstate

Validate the recovered file’s resource count and lineage before pushing. The recovered state has an older serial than the truncated one currently in the backend, so a plain push will be rejected by the serial guard. Bump the serial above the current backend value, then push:

CURRENT=$(terraform state pull | jq '.serial')
jq ".serial = $CURRENT + 1" recovered.tfstate > restore.tfstate
terraform state push restore.tfstate
terraform plan   # expect: No changes. Your infrastructure matches the configuration.

A clean plan after restore is the only acceptable end state.

6. Resolving lineage mismatch and split-brain after a bad migration

Split-brain is the worst state incident because nothing is “broken” in an obvious way — two valid state files both claim authority over the same infrastructure. It usually follows a botched backend migration: a terraform init -migrate-state that half-completed, or a team that ran applies against an old backend while another ran against the new one.

Diagnose by lineage. Pull both candidate states and compare:

diff <(jq -r '{lineage, serial, n: (.resources|length)}' a.tfstate) \
     <(jq -r '{lineage, serial, n: (.resources|length)}' b.tfstate)

< {"lineage":"8f2a1c9e-...","serial":187,"n":42}
> {"lineage":"d1b7e4f0-...","serial":35,"n":40}

Different lineage UUIDs confirm split-brain: these are two independent state histories, not two versions of one. The symptom in the wild is Terraform proposing to create resources that already exist (because one state never learned about them) or to destroy resources another state created.

Resolution strategy. You cannot merge by overwriting — a state push -force of one over the other discards everything the loser tracked, and Terraform will then try to recreate or destroy real infrastructure. Instead:

Pick the authoritative state. Choose the one with the higher resource count and the serial that reflects the most recent real applies. Back up both.

Identify resources the authoritative state is missing. Diff the address lists:

comm -13 <(terraform state list) <(jq -r '.resources[].instances[] | .attributes.id' other.tfstate | sort)

Bring the missing resources into the authoritative state with import (Section 7), keyed by their real IDs from the losing state. Do not copy JSON between files by hand; import regenerates the entry correctly under the right lineage.
Decommission the losing state. Once every resource lives in the authoritative state and a plan is clean, delete the orphaned state object so no future run can target it.

The discipline that prevents this: exactly one backend block per state, migrations done with terraform init -migrate-state and verified by a clean plan before any apply touches the new backend, and an immediate freeze on the old backend the moment migration starts.

7. Rebuilding state from scratch with import when backups are gone

Sometimes there is no versioning and no backup — the worst case. The infrastructure is running; the state is gone. You rebuild the mapping resource by resource with import. The configuration in .tf still exists, so you are reconstructing only the state side.

Prefer import blocks (Terraform 1.5+) over the imperative terraform import command. Import blocks are declarative, reviewable in a PR, plannable as a batch, and can generate configuration:

import {
  to = aws_vpc.this
  id = "vpc-0a1b2c3d4e5f67890"
}

import {
  to = aws_subnet.private["az-a"]
  id = "subnet-0123456789abcdef0"
}

import {
  to = aws_db_instance.primary
  id = "prod-primary-db"
}

Plan the imports to preview what will be brought under management without writing state yet:

terraform plan

For resources whose HCL you do not yet have, let Terraform scaffold it (it writes the missing resource blocks to the file you name):

terraform plan -generate-config-out=generated.tf

Review the generated HCL, fold it into your real modules, then apply to commit the import to state:

terraform apply

Each resource type has its own import ID format, and getting it wrong is the main source of friction — an aws_route53_record imports as ZONEID_name_type, an aws_iam_role_policy_attachment as role-name/policy-arn. Check the provider docs’ “Import” section for the exact format per resource before writing the block. Work in dependency order (VPC before subnets before instances) so references resolve. After the last import, the acceptance test is identical to every other recovery: terraform plan reports no changes.

Verify

Run this sequence after any state surgery. All four must pass before you call the incident resolved and unlock the pipeline.

# 1. State is well-formed and non-empty
terraform state pull | jq -e '.resources | length > 0' >/dev/null \
  && echo "OK: state has resources"

# 2. Lineage and serial are sane (single lineage, plausible serial)
terraform state pull | jq '{lineage, serial, version, terraform_version}'

# 3. The decisive check: configuration, state, and reality all agree
terraform plan -detailed-exitcode
#   exit 0 = no changes (clean)   exit 2 = changes pending (investigate)
#   exit 1 = error

# 4. No stale lock remains
terraform plan   # must acquire and release the lock without error

A -detailed-exitcode of 0 is the gold standard: it proves state matches both the code and the live cloud. An exit code of 2 means there is still a delta to reconcile (return to Section 4). Only after a clean plan should you re-enable CI and announce recovery.

Checklist

Backed up state with terraform state pull > backup-<timestamp>.tfstate before any mutation
Confirmed the lock-holding process is genuinely dead before force-unlock or backend lock deletion
Used state mv / state rm only with full addresses, and verified the real resource was untouched afterward
Ran terraform plan -refresh-only to inspect drift before any apply that could revert a legitimate change
Restored from backend versioning with the serial bumped above the current backend value, never with push -force
Verified single, consistent lineage across the surviving state; decommissioned any split-brain orphan
Rebuilt missing resources with import blocks (not hand-edited JSON) in dependency order
Confirmed recovery with terraform plan -detailed-exitcode returning 0
Enabled bucket/blob versioning, server-side encryption, and least-privilege backend access to prevent recurrence

8. Hardening: versioning, encryption, and access controls

Every incident in this guide is preventable, and the prevention is cheap. The single highest-leverage control is backend versioning — it turns “rebuild from scratch” into “roll back one version.”

AWS S3 backend. Enable versioning and default encryption on the bucket, use native S3 locking, and let the bucket policy enforce TLS:

terraform {
  backend "s3" {
    bucket       = "my-tfstate-bucket"
    key          = "prod/network.tfstate"
    region       = "us-east-1"
    encrypt      = true
    use_lockfile = true        # native S3 locking, Terraform 1.10+
    kms_key_id   = "arn:aws:kms:us-east-1:111122223333:key/abcd-..."
  }
}

aws s3api put-bucket-versioning --bucket my-tfstate-bucket \
  --versioning-configuration Status=Enabled

Azure backend. Enable blob versioning and soft delete on the storage account, and authenticate with Azure AD rather than a shared key:

terraform {
  backend "azurerm" {
    resource_group_name  = "rg-tfstate"
    storage_account_name = "tfstateprod"
    container_name       = "tfstate"
    key                  = "prod/network.tfstate"
    use_azuread_auth     = true
  }
}

Lock down access regardless of cloud:

Least privilege. The pipeline identity gets read/write on its one state key only — never a wildcard over the whole bucket. A blast-radius limit is what stops one compromised pipeline from corrupting every environment’s state.
One state per blast radius. Separate state per environment and per bounded component. Small states recover faster and fail in isolation.
Encryption at rest and in transit. SSE/KMS on the object, TLS enforced by bucket policy. State files contain resource attributes that are frequently sensitive (connection strings, generated passwords), so treat the backend as a secrets store.
No state in version control, ever. .tfstate and .tfstate.backup belong in .gitignore. A state file in a Git history is a credential leak and a split-brain waiting to happen.

The teams that never need this playbook are not lucky. They enabled versioning on day one, scoped their backends tightly, and treated state push -force as a command that does not exist. Do that, and the worst state incident you will ever face is a force-unlock after a killed CI job — a thirty-second fix instead of an afternoon of surgery.

Terraform State Surgery: Recovering from Corruption, Locks, and Split-Brain

1. Anatomy of the state file: serial, lineage, and resource addressing

2. Breaking stale locks safely with force-unlock

Backend-specific lock cleanup

3. Surgical edits with state mv, rm, replace-provider, and pull/push

4. Reconciling state vs reality after out-of-band console changes

5. Recovering a deleted or truncated state from versioning and snapshots

6. Resolving lineage mismatch and split-brain after a bad migration

7. Rebuilding state from scratch with import when backups are gone

Verify

Checklist

8. Hardening: versioning, encryption, and access controls

Written by Vinod

Comments

Keep Reading

Dynamic Inventory and Secure Secrets for Ansible at Cloud Scale

Engineering Idempotent Ansible Collections with Molecule Testing

Programmatic Infrastructure with CDK for Terraform in TypeScript