Skip to content

Ctrl-C on Terraform's apply prompt fails to release S3 lock #5167

@ordenull

Description

@ordenull

Describe the bug

Starting with Terragrunt v0.86.0, S3 state locks are not properly released when pressing Ctrl-C at the terraform apply confirmation prompt. The Terraform process is killed by signal before it can complete the lock cleanup operation, leaving orphaned .tflock files in S3.

Bisection Results

Through systematic testing, we identified that this regression was introduced in v0.86.0:

v0.72.2: ✅ Working
v0.82.0: ✅ Working
v0.85.0: ✅ Working
v0.85.1: ✅ Working
v0.86.0: ❌ BROKEN (lock not released)
v0.87.0: ❌ Broken
v0.92.1: ❌ Broken

Evidence from Trace Logs

Terragrunt v0.85.1 (Working):

2025-12-02T21:45:09.485-0500 [INFO]  backend-s3: Attempting to unlock remote state (S3 Native only)...  
2025-12-02T21:45:09.597-0500 [DEBUG] backend-s3: HTTP Request Sent: ... rpc.method=DeleteObject ...  
2025-12-02T21:45:09.645-0500 [DEBUG] backend-s3: HTTP Response Received: ... http.status_code=204 ...  
2025-12-02T21:45:09.645-0500 [DEBUG] backend-s3: Deleted lock file: ...  
2025-12-02T21:45:09.645-0500 [INFO]  backend-s3: Unlocked remote state (S3 Native only): ...  
exit status 1  

Terragrunt v0.86.0 (Broken):

2025-12-02T21:41:07.733-0500 [INFO]  backend-s3: Attempting to unlock remote state (S3 Native only)...  
signal: killed  

The logs clearly show that in v0.86.0, Terraform begins the unlock process but receives signal: killed before it can send the DeleteObject request to S3.

Reproducing bugs

Reproduced on versions v0.86.0 to v0.92.1, but wasn't able to test newer versions due to a breaking change in the configs.

Configuration Details

Backend Configuration (generated by Terragrunt):

terraform {  
  backend "s3" {  
    bucket               = "xxxxxxxxxxxxxxxxxxx"  
    encrypt              = true  
    key                  = "xxxxxxxxxxxxxxxxxxx.tfstate"  
    profile              = "xxxxxxxxxxxxxxxxxxx"  
    region               = "us-east-1"  
    use_lockfile         = true  
    workspace_key_prefix = null  
  }  
}  

Steps To Reproduce

  1. Configure S3 backend with use_lockfile = true
  2. Use Terragrunt v0.86.0 or higher
  3. Run terragrunt apply
  4. Let the plan complete fully
  5. Press Ctrl-C at the confirmation prompt
  6. Observe that the .tflock file remains in S3
  7. Run terragrunt apply again - it will fail with a lock error

Expected behavior

The S3 lock file should be automatically removed when:

  • Interrupting the operation with Ctrl+C
  • Any other graceful termination of the Terraform process

Actual Behavior

The .tflock file remains in S3 and must be manually removed using:

aws s3 rm s3://bucket-name/path/to/terraform.tfstate.tflock

or:

terraform force-unlock <LOCK_ID>

Additional context (AI Generated)

Root Cause

The issue was introduced by changes in internal/os/exec/cmd.go where the Command() function signature was changed from:

// v0.85.1  
func Command(name string, args ...string) *Cmd  

to:

// v0.86.0  
func Command(ctx context.Context, name string, args ...string) *Cmd  

This change switched from using exec.Command() to exec.CommandContext(). When the user interrupts the operation with Ctrl+C, the context is cancelled, which immediately sends SIGKILL to the Terraform subprocess before it can complete its cleanup routine.

Additional Evidence from Code Diff

The test files were also updated to accept "signal: killed" as a valid outcome, indicating the Terragrunt team was aware this change would cause processes to be killed by signal:

From shell/run_cmd_unix_test.go:

// v0.85.1  
expectedErr := fmt.Sprintf("Failed to execute \"%s 5\" in .\n\nexit status %d", cmdPath, expectedWait)  
assert.EqualError(t, actualErr, expectedErr)
  
// v0.86.0  
expectedExitStatusErr := fmt.Sprintf("Failed to execute \"%s 5\" in .\n\nexit status %d", cmdPath, expectedWait)  
expectedKilledErr := fmt.Sprintf("Failed to execute \"%s 5\" in .\n\nsignal: killed", cmdPath)  
// Now accepts EITHER error  

Impact

This is a critical regression that affects all users of Terragrunt v0.86.0+ who:

  • Use S3 backend with use_lockfile = true
  • Ever need to interrupt a Terraform apply operation with Ctrl+C
  • Work in team environments where orphaned locks can block other users

Every interrupted operation requires manual intervention to clean up the orphaned lock file.

Proposed Fix

Terragrunt should handle subprocess termination gracefully by:

  1. Using a separate context for subprocess execution that isn't immediately cancelled on interrupt
  2. Forwarding SIGINT/SIGTERM to the Terraform process and waiting for it to exit cleanly
  3. Only sending SIGKILL if Terraform doesn't exit after a reasonable timeout period
  4. Ensuring the cleanup phase completes before terminating the subprocess

The current implementation ties the subprocess lifecycle directly to the parent context cancellation, which prevents Terraform from performing necessary cleanup operations.

References

  • Root cause: internal/os/exec/cmd.go line 31
  • Usage: shell/run_cmd.go line 115
  • Test changes: shell/run_cmd_unix_test.go and shell/run_cmd_windows_test.go

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions