Skip to main content

rawops.dev

P2

Terraform State Lock Stuck — Safe Force-Unlock

Recover from a stuck DynamoDB/Consul/S3 state lock without corrupting state. Covers identifying the holder, verifying the prior run actually died, and when force-unlock is safe.

15 min7 steps
Progress: 0/7 steps
0%

Terraform's lock error names the holder, lock ID, timestamp, and backend path. Don't force-unlock without these.

terraform plan 2>&1 | tail -25
Expected: Output contains: `Lock Info: ID=<uuid>, Path=<backend path>, Operation=<op>, Who=<user@host>, Created=<iso ts>`. Note the ID — it's the argument to force-unlock.

A live CI job that's merely slow will keep extending the lock. Killing it first is safer than force-unlocking under it.

# GitHub Actions: check the run that matches the Who=<actor> + Created=<ts>
# GitLab: https://<gitlab>/<project>/-/pipelines
# Jenkins: https://<jenkins>/job/<name>/<build>/console

# If the runner is reachable over SSH:
ps -ef | grep terraform | grep -v grep
# Or the container:
docker ps --filter name=tf-runner
Expected: No running terraform process anywhere, CI job marked failed/cancelled, runner not executing. Anything else and you risk corrupting state.
Force-unlocking a live run can cause two writers to update the state file simultaneously. S3 versioning will save you, but Consul / local state won't.

Look at the raw lock object to cross-check TTL / ownership before acting.

# S3 + DynamoDB backend:
aws dynamodb get-item --table-name terraform-locks --key '{"LockID":{"S":"<bucket>/path/to/state.tfstate"}}'

# Consul backend:
consul kv get terraform/<project>/.terraform.lock.lock

# GCS backend:
gsutil stat gs://<bucket>/path/to/state.tfstate.tflock
Expected: The lock's `Who`, `Created`, and `Operation` should match Terraform's error. Timestamp tells you how long the lock has been held — >30 min with no active runner = safe candidate for force-unlock.

Before any destructive action, pull a copy of the current state. This is the single most important step.

terraform state pull > state.backup.$(date +%s).json
ls -la state.backup.*.json

# S3 extra insurance: inspect versioning
aws s3api list-object-versions --bucket <bucket> --prefix path/to/state.tfstate --max-items 5
Expected: Backup file written locally. If S3 versioning is on, any recovery is later just `aws s3api copy-object` from a known-good version id.

Release the lock with the ID from step 1. Terraform re-reads the state, nothing mutates yet.

terraform force-unlock -force <lock-uuid>
Expected: `Terraform state has been successfully unlocked!`. Your next `terraform plan` works normally.

Diff the current state against what Terraform expects to confirm nothing was corrupted.

terraform plan -detailed-exitcode 2>&1 | tail -10
# exit 0 = no changes, 2 = changes (expected if real drift), 1 = error (bad)
Expected: Exit code 0 or 2 means the state is internally consistent. Exit code 1 with parsing errors = restore from backup.

Most stuck locks come from CI jobs killed by timeouts. Add a trap + short lock TTL monitor.

# GitHub Actions: always clean up on cancel
# - name: Terraform
#   run: terraform apply -auto-approve
#   timeout-minutes: 20
# - name: Release lock on cancel
#   if: cancelled()
#   run: terraform force-unlock -force $(terraform show -json | jq -r '.lock_info.id') || true

# CloudWatch alarm on DynamoDB items older than 30 min:
# alert: TerraformLockStuck
# expr: aws_dynamodb_item_age{table="terraform-locks"} > 1800
Expected: Future cancellations release locks automatically. Alerting flags stuck locks before the next deploy hits them.