Skip to content

fix: prevent JIT source TTL from wiping varfiles/backend mid-execution#2253

Draft
Copilot wants to merge 2 commits intomainfrom
copilot/fix-jit-source-provisioning-ttl-expiry
Draft

fix: prevent JIT source TTL from wiping varfiles/backend mid-execution#2253
Copilot wants to merge 2 commits intomainfrom
copilot/fix-jit-source-provisioning-ttl-expiry

Conversation

Copy link
Contributor

Copilot AI commented Mar 26, 2026

AutoProvisionSource is called twice per command invocation — once directly from resolveAndProvisionComponentPath, and again via the before.terraform.init hook in prepareInitExecution. With ttl: "0s", the second call treats the workdir as always-expired, invokes os.RemoveAll(targetDir), and wipes the varfiles and backend configs written between the two calls. The subprocess then fails with "file does not exist".

Changes

  • pkg/provisioner/source/provision_hook.go — adds an in-memory idempotency guard (invocationDoneKey = "_atmos_source_provisioned") to AutoProvisionSource. A named-return defer sets the marker in componentConfig on successful return. Any subsequent call with the same map (same in-memory invocation) short-circuits immediately. The guard is scoped to the per-invocation componentConfig; separate atmos runs are unaffected.

  • pkg/provisioner/source/provision_hook_test.go — two regression tests:

    • TestAutoProvisionSource_InvocationGuard_PreventsDoubleProvisioning: asserts the guard short-circuits a second call even with ttl: "0s"
    • TestAutoProvisionSource_InvocationGuard_SetAfterProvisioning: asserts the marker is written to componentConfig after a skipped provision (TTL not expired), ensuring the hook path is a no-op
Original prompt

This section details on the original issue you should resolve

<issue_title># Bug: JIT source provisioning TTL expiry deletes varfiles/backend, then runs tofu, causing error</issue_title>
<issue_description>### Describe the Bug

When using Just-In-Time (JIT) source provisioning, the source.ttl cleanup runs concurrently with — or before — the tofu subprocess, not after it completes. If the TTL expires at any point while tofu init, tofu plan, or any other tofu command is executing, Atmos deletes the varfiles and backend configuration out from under the running process.

The most reliable way to trigger this is ttl: "0s", which expires immediately and causes a deterministic failure every time. However, any positive TTL short enough to expire before the tofu subprocess finishes (e.g. "30s" on a slow network or large module download) will produce the same failure.

The result is a hard failure from tofu because the generated varfile (and/or backend file) no longer exists on disk:

Error: Failed to read variables file
│
│ Given variables file /tmp/atmos-workdir-*/component.tfvars.json does not exist.

Expected Behavior

The TTL cleanup should be scoped to between invocations, not during one. Provisioned files should never be deleted while the subprocess that depends on them is still running. Specifically:

  • TTL expiry should only be evaluated before provisioning (stale cache check), not during or after subprocess execution.
  • The provisioned workdir should be treated as a lock for the duration of the current command — held open until the subprocess exits, then subject to TTL-based cleanup on the next invocation.

A source.ttl: "0s" is the degenerate case that makes this deterministic, but the fix must cover all TTL values.


Actual Behavior

Atmos generates the varfiles and backend, the TTL of 0s immediately expires them, Atmos wipes them, and tofu fails:

│ Error: Failed to read variables file
│ 
│ Given variables file demo-null-label.terraform.tfvars.json does not exist.

Steps to Reproduce

The script below is fully self-contained. It requires only atmos and tofu on PATH and network access to GitHub. Save it as repro.sh and run it.

#!/usr/bin/env bash
# ============================================================
# REPRO: JIT ttl:"0s" deletes varfiles before tofu can read them
# ============================================================

set -euo pipefail

WORKDIR="$(mktemp -d -t atmos-repro-XXXXXX)"
echo "Working in: ${WORKDIR}"
cd "${WORKDIR}"

# --- 1) atmos.yaml ---
cat <<'EOF' > atmos.yaml
base_path: "."

components:
  terraform:
    base_path: "components/terraform"
    command: "tofu"
    workspaces_enabled: true
    apply_auto_approve: false
    deploy_run_init: true
    init_run_reconfigure: true
    auto_generate_backend_file: true

stacks:
  name_template: "{{ .vars.name }}"
  base_path: "stacks"
  included_paths:
    - "**/*"
EOF

# --- 2) Stack with ttl: "0s" on the JIT source ---
mkdir -p stacks
cat <<'EOF' > stacks/demo.yaml
vars:
  name: demo

terraform:
  backend_type: local

components:
  terraform:
    null-label:
      vars:
        # terraform-null-label variables
        namespace: "eg"
        stage: "test"
        name: "demo"
        enabled: true
      source:
        uri: "git::https://github.com/cloudposse/terraform-null-label.git"
        version: "0.25.0"
        ttl: "0s"    # <-- triggers the bug: files are wiped before tofu reads them
      provision:
        workdir:
          enabled: true
EOF

echo
echo "== tree =="
find . -maxdepth 4 -type f -print | sed 's|^\./||'

echo
echo "== discovered stacks =="
atmos describe stacks

echo
echo "== describe component =="
atmos describe component null-label -s demo

echo
echo "== init (this is where the failure occurs with ttl:0s) =="
atmos terraform init null-label -s demo

echo
echo "== plan =="
atmos terraform plan null-label -s demo

echo "Done. Workspace preserved at: ${WORKDIR}"

Run:

bash repro.sh 2>&1 | tee repro.log

Screenshots

No response

Environment

Atmos 1.212.0 on darwin/arm64

Additional Context

No response</issue_description>

Comments on the Issue (you are @copilot in this section)


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

@mergify mergify bot added triage Needs triage wip Work in Progress: Not ready for final review or merge labels Mar 26, 2026
…process execution

When AutoProvisionSource is called twice within the same command invocation
(once from resolveAndProvisionComponentPath and once via the before.terraform.init
hook in prepareInitExecution), a zero TTL would cause VendorSource to call
os.RemoveAll on the workdir, deleting varfiles and backend configs written
between the two calls.

Fix: add an invocationDoneKey guard that marks componentConfig after the first
call completes. Any subsequent call with the same componentConfig returns
immediately without re-provisioning, preserving the workdir contents for the
subprocess.

The defer-based marker ensures the guard is set for both the "provisioned" and
"skipped" paths without duplicating code.

Co-authored-by: osterman <52489+osterman@users.noreply.github.com>
Agent-Logs-Url: https://github.com/cloudposse/atmos/sessions/2c2dec41-5920-40f9-9420-137f7970ff42
Copilot AI changed the title [WIP] Fix JIT source provisioning TTL expiry causing varfile deletion fix: prevent JIT source TTL from wiping varfiles/backend mid-execution Mar 26, 2026
Copilot AI requested a review from osterman March 26, 2026 01:00
@mergify mergify bot removed the wip Work in Progress: Not ready for final review or merge label Mar 26, 2026
@osterman osterman added the patch A minor, backward compatible change label Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch A minor, backward compatible change triage Needs triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

# Bug: JIT source provisioning TTL expiry deletes varfiles/backend, then runs tofu, causing error

2 participants