-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
Starting with Terragrunt v0.86.0, S3 state locks are not properly released when pressing Ctrl-C at the terraform apply confirmation prompt. The Terraform process is killed by signal before it can complete the lock cleanup operation, leaving orphaned .tflock files in S3.
Bisection Results
Through systematic testing, we identified that this regression was introduced in v0.86.0:
v0.72.2: ✅ Working
v0.82.0: ✅ Working
v0.85.0: ✅ Working
v0.85.1: ✅ Working
v0.86.0: ❌ BROKEN (lock not released)
v0.87.0: ❌ Broken
v0.92.1: ❌ Broken
Evidence from Trace Logs
Terragrunt v0.85.1 (Working):
2025-12-02T21:45:09.485-0500 [INFO] backend-s3: Attempting to unlock remote state (S3 Native only)...
2025-12-02T21:45:09.597-0500 [DEBUG] backend-s3: HTTP Request Sent: ... rpc.method=DeleteObject ...
2025-12-02T21:45:09.645-0500 [DEBUG] backend-s3: HTTP Response Received: ... http.status_code=204 ...
2025-12-02T21:45:09.645-0500 [DEBUG] backend-s3: Deleted lock file: ...
2025-12-02T21:45:09.645-0500 [INFO] backend-s3: Unlocked remote state (S3 Native only): ...
exit status 1
Terragrunt v0.86.0 (Broken):
2025-12-02T21:41:07.733-0500 [INFO] backend-s3: Attempting to unlock remote state (S3 Native only)...
signal: killed
The logs clearly show that in v0.86.0, Terraform begins the unlock process but receives signal: killed before it can send the DeleteObject request to S3.
Reproducing bugs
Reproduced on versions v0.86.0 to v0.92.1, but wasn't able to test newer versions due to a breaking change in the configs.
Configuration Details
Backend Configuration (generated by Terragrunt):
terraform {
backend "s3" {
bucket = "xxxxxxxxxxxxxxxxxxx"
encrypt = true
key = "xxxxxxxxxxxxxxxxxxx.tfstate"
profile = "xxxxxxxxxxxxxxxxxxx"
region = "us-east-1"
use_lockfile = true
workspace_key_prefix = null
}
}
Steps To Reproduce
- Configure S3 backend with use_lockfile = true
- Use Terragrunt v0.86.0 or higher
- Run terragrunt apply
- Let the plan complete fully
- Press Ctrl-C at the confirmation prompt
- Observe that the .tflock file remains in S3
- Run terragrunt apply again - it will fail with a lock error
Expected behavior
The S3 lock file should be automatically removed when:
- Interrupting the operation with Ctrl+C
- Any other graceful termination of the Terraform process
Actual Behavior
The .tflock file remains in S3 and must be manually removed using:
aws s3 rm s3://bucket-name/path/to/terraform.tfstate.tflock
or:
terraform force-unlock <LOCK_ID>
Additional context (AI Generated)
Root Cause
The issue was introduced by changes in internal/os/exec/cmd.go where the Command() function signature was changed from:
// v0.85.1
func Command(name string, args ...string) *Cmd
to:
// v0.86.0
func Command(ctx context.Context, name string, args ...string) *Cmd
This change switched from using exec.Command() to exec.CommandContext(). When the user interrupts the operation with Ctrl+C, the context is cancelled, which immediately sends SIGKILL to the Terraform subprocess before it can complete its cleanup routine.
Additional Evidence from Code Diff
The test files were also updated to accept "signal: killed" as a valid outcome, indicating the Terragrunt team was aware this change would cause processes to be killed by signal:
From shell/run_cmd_unix_test.go:
// v0.85.1
expectedErr := fmt.Sprintf("Failed to execute \"%s 5\" in .\n\nexit status %d", cmdPath, expectedWait)
assert.EqualError(t, actualErr, expectedErr)
// v0.86.0
expectedExitStatusErr := fmt.Sprintf("Failed to execute \"%s 5\" in .\n\nexit status %d", cmdPath, expectedWait)
expectedKilledErr := fmt.Sprintf("Failed to execute \"%s 5\" in .\n\nsignal: killed", cmdPath)
// Now accepts EITHER error
Impact
This is a critical regression that affects all users of Terragrunt v0.86.0+ who:
- Use S3 backend with use_lockfile = true
- Ever need to interrupt a Terraform apply operation with Ctrl+C
- Work in team environments where orphaned locks can block other users
Every interrupted operation requires manual intervention to clean up the orphaned lock file.
Proposed Fix
Terragrunt should handle subprocess termination gracefully by:
- Using a separate context for subprocess execution that isn't immediately cancelled on interrupt
- Forwarding SIGINT/SIGTERM to the Terraform process and waiting for it to exit cleanly
- Only sending SIGKILL if Terraform doesn't exit after a reasonable timeout period
- Ensuring the cleanup phase completes before terminating the subprocess
The current implementation ties the subprocess lifecycle directly to the parent context cancellation, which prevents Terraform from performing necessary cleanup operations.
References
- Root cause: internal/os/exec/cmd.go line 31
- Usage: shell/run_cmd.go line 115
- Test changes: shell/run_cmd_unix_test.go and shell/run_cmd_windows_test.go