Skip to content

fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods#2143

Merged
aknysh merged 13 commits intomainfrom
osterman/auth-web-identity-irsa
Mar 26, 2026
Merged

fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods#2143
aknysh merged 13 commits intomainfrom
osterman/auth-web-identity-irsa

Conversation

@osterman
Copy link
Copy Markdown
Member

@osterman osterman commented Mar 4, 2026

what

  • Prevent IRSA/pod-injected AWS env vars from overriding Atmos-managed credentials in subprocess execution
  • Pass os.Environ() through PrepareShellEnvironment to sanitize it (delete problematic vars), then pass the sanitized env to subprocess via WithBaseEnv — avoiding re-reading os.Environ() which would reintroduce IRSA vars
  • Add SanitizedBaseEnv field to ConfigAndStacksInfo to carry sanitized environment through the hooks→terraform/helmfile/packer pipeline
  • Add WithBaseEnv variadic option to ExecuteShellCommand for backward-compatible sanitized env injection
  • Fix auth exec and auth shell to use sanitized env directly instead of re-reading os.Environ()

why

On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME into the pod environment. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars leaked into terraform subprocesses because three code paths re-read os.Environ() after auth sanitization:

  1. Hooks path (terraform/helmfile/packer): authenticateAndWriteEnv only passed ComponentEnvSection (stack YAML vars) to PrepareShellEnvironment — IRSA vars weren't in the input so delete() was a no-op. Then ExecuteShellCommand re-read os.Environ() as the base.
  2. auth exec: executeCommandWithEnv re-read os.Environ() to build subprocess env.
  3. auth shell: ExecAuthShellCommandMergeSystemEnvSimpleWithGlobal re-read os.Environ().

AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role, causing AccessDenied errors.

Approach

Instead of setting cleared vars to empty string (which pollutes the subprocess env), we pass a clean, sanitized environment:

  1. authenticateAndWriteEnv now passes os.Environ() + ComponentEnvSection to PrepareShellEnvironment, which deletes problematic keys
  2. The sanitized result is stored as SanitizedBaseEnv on ConfigAndStacksInfo
  3. ExecuteShellCommand accepts WithBaseEnv(info.SanitizedBaseEnv) to use the sanitized env instead of re-reading os.Environ()
  4. auth exec and auth shell pass sanitized env directly to subprocess, bypassing the re-read

references

Fixes credential precedence conflict where IRSA vars override Atmos-managed credentials on EKS pods running ARC (DEV-4216)

Summary by CodeRabbit

  • Bug Fixes

    • Prevented AWS IRSA env vars from leaking into subprocesses by sanitizing auth-related variables (overridden with empty values) so spawned commands use Atmos credentials.
    • Ensured credential-chain caching no longer skips the final role, forcing proper re-authentication when needed.
  • Refactor

    • Preserve and propagate a sanitized environment end-to-end for shell/exec paths so child processes receive the corrected env list.
  • Tests

    • Updated and added tests to validate env sanitization and subprocess propagation.
  • Documentation

    • Added guidance describing the credential-chain caching fix and expected behavior.

…ls on EKS pods

On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook
injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME. When
using Atmos auth on ARC (Actions Runner Controller), these IRSA vars were leaking
into the terraform subprocess because PrepareEnvironment only cleared vars from
ComponentEnvSection (stack YAML env vars), not from os.Environ() where the pod
vars live. AWS SDK credential chain gives web identity tokens higher precedence
than shared credential files, so the pod's runner role was used instead of the
Atmos-managed tfplan role.

## Changes

1. Add IRSA vars to problematicAWSEnvVars so they're cleared during the auth flow itself
2. Change PrepareEnvironment to set cleared vars to empty string (not delete) so they
   appear in ComponentEnvList and override inherited IRSA values in the subprocess
3. Update tests to expect empty strings (which override os.Environ()) instead of absent keys
4. Add TestPrepareEnvironment_IRSALeakPrevention to reproduce the full ARC/IRSA scenario

## How it works

When subprocess env is built as os.Environ() + ComponentEnvList, Go's exec.Cmd
respects the last occurrence of each key. Setting IRSA vars to empty string in
ComponentEnvList ensures they override the pod's injected values.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
@osterman osterman requested a review from a team as a code owner March 4, 2026 23:40
@github-actions github-actions bot added the size/m Medium size PR label Mar 4, 2026
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 4, 2026

Dependency Review

✅ No vulnerabilities or license issues found.

Snapshot Warnings

⚠️: No snapshots were found for the head SHA 908093a.
Ensure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice.

Scanned Files

None

@osterman osterman added the patch A minor, backward compatible change label Mar 4, 2026
@goruha goruha added the release/feature Create release from this PR label Mar 4, 2026
@goruha goruha temporarily deployed to feature-releases March 4, 2026 23:48 — with GitHub Actions Inactive
@goruha goruha marked this pull request as draft March 4, 2026 23:48
@goruha goruha had a problem deploying to feature-releases March 4, 2026 23:48 — with GitHub Actions Failure
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

Adds IRSA-related AWS vars to the problematic list and switches PrepareEnvironment to set problematic AWS env vars to empty strings. Propagates a pre-sanitized []string environment through exec plumbing so subprocesses receive the sanitized env unchanged. Tests updated.

Changes

Cohort / File(s) Summary
AWS IRSA Credential Isolation
pkg/auth/cloud/aws/env.go
Adds AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, AWS_ROLE_SESSION_NAME to problematic AWS vars and changes cleanup to set them to "" instead of deleting keys.
AWS env tests
pkg/auth/cloud/aws/env_test.go, pkg/auth/cloud/aws/setup_test.go
Adds IRSA-leak prevention test and updates expectations to assert empty-string overrides for AWS credential and IRSA vars instead of removal.
Auth exec / shell plumbing
cmd/auth_exec.go, cmd/auth_shell.go, cmd/auth_exec_test.go
Passes sanitized []string env through auth execution instead of converting to map[string]string; tests updated to use []string subprocess patterns.
Internal exec API & callers
internal/exec/shell_utils.go, internal/exec/shell_utils_test.go, internal/exec/...
Introduces WithEnvironment([]string) and processEnv []string; updates ExecuteShellCommand/ExecAuthShellCommand to accept/use sanitized env; forwards info.SanitizedEnv from callers.
Command pipeline changes
internal/exec/terraform_execute_helpers.go, internal/exec/terraform_execute_helpers_exec.go
Propagates variadic ShellCommandOption through init/workspace setup, forwards options into ExecuteShellCommand, adds workspace fallback (workspace new) logic.
Exec callers minor updates
internal/exec/helmfile.go, internal/exec/packer.go, internal/exec/terraform.go
Pass WithEnvironment(info.SanitizedEnv) into final ExecuteShellCommand calls.
Test runtime helper
cmd/testing_main_test.go
Adds early-exit test branch (_ATMOS_TEST_EXIT_ONE) used by subprocess tests.
Docs
docs/fixes/2026-03-23-auth-credential-chain-skipping-assume-role.md
New documentation describing credential-chain caching bug and fix (doc-only).
Manifest
go.mod
Minor manifest edits.

Sequence Diagram(s)

sequenceDiagram
    rect rgba(0,128,0,0.5)
    participant User
    end
    rect rgba(0,0,255,0.5)
    participant Atmos
    end
    rect rgba(255,165,0,0.5)
    participant AuthMgr
    end
    rect rgba(128,0,128,0.5)
    participant Exec
    end
    rect rgba(255,0,0,0.5)
    participant Subprocess
    end

    User->>Atmos: invoke auth/shell/terraform command
    Atmos->>AuthMgr: PrepareEnvironment() -> sanitizedEnv ([]string)
    AuthMgr-->>Atmos: return sanitizedEnv
    Atmos->>Exec: ExecuteShellCommand / ExecAuthShellCommand(with sanitizedEnv)
    Exec->>Subprocess: start subprocess with Env = sanitizedEnv + ATMOS_* mutations
    Subprocess-->>User: run tool/shell with sanitized environment
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested reviewers

  • aknysh
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main fix: preventing IRSA credentials from overriding Atmos-managed credentials on EKS pods, which is central to the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 80.95% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch osterman/auth-web-identity-irsa

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
pkg/auth/cloud/aws/env_test.go (1)

634-641: Replace custom indexOf with strings.IndexByte.

The indexOf function duplicates strings.IndexByte — no need for a custom helper. Your Go version (1.26) supports the for i := range len(s) syntax without issue.

Replace the function and its usage:

  • Line 612: change indexOf(entry, '=') to strings.IndexByte(entry, '=')
  • Lines 634-641: remove the custom function
  • Add "strings" to imports if needed
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/auth/cloud/aws/env_test.go` around lines 634 - 641, The custom indexOf
function duplicates standard library functionality; replace calls to
indexOf(entry, '=') with strings.IndexByte(entry, '=') and delete the indexOf
function definition (function name: indexOf). Ensure the "strings" package is
imported (add to imports if missing) and remove the now-unused indexOf symbol so
the test file builds cleanly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/auth/cloud/aws/env_test.go`:
- Around line 634-641: The custom indexOf function duplicates standard library
functionality; replace calls to indexOf(entry, '=') with
strings.IndexByte(entry, '=') and delete the indexOf function definition
(function name: indexOf). Ensure the "strings" package is imported (add to
imports if missing) and remove the now-unused indexOf symbol so the test file
builds cleanly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 964c0740-0aca-46ba-9c1d-f8be609d9e89

📥 Commits

Reviewing files that changed from the base of the PR and between 6d1f475 and a0ec9e6.

📒 Files selected for processing (3)
  • pkg/auth/cloud/aws/env.go
  • pkg/auth/cloud/aws/env_test.go
  • pkg/auth/cloud/aws/setup_test.go

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 4, 2026
@goruha goruha temporarily deployed to feature-releases March 5, 2026 00:09 — with GitHub Actions Inactive
@goruha goruha temporarily deployed to feature-releases March 5, 2026 00:46 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 5, 2026

These changes were released in v1.208.1-test.0.

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 86.53846% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 77.22%. Comparing base (ce67b78) to head (908093a).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
internal/exec/shell_utils.go 71.42% 2 Missing and 2 partials ⚠️
internal/exec/helmfile.go 0.00% 1 Missing ⚠️
internal/exec/terraform_execute_helpers_exec.go 85.71% 0 Missing and 1 partial ⚠️
pkg/list/utils/utils.go 50.00% 1 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #2143      +/-   ##
==========================================
+ Coverage   77.19%   77.22%   +0.03%     
==========================================
  Files        1015     1015              
  Lines       96065    96087      +22     
==========================================
+ Hits        74158    74204      +46     
+ Misses      17717    17696      -21     
+ Partials     4190     4187       -3     
Flag Coverage Δ
unittests 77.22% <86.53%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
cmd/auth_exec.go 84.00% <100.00%> (-1.33%) ⬇️
cmd/auth_shell.go 57.30% <100.00%> (-2.70%) ⬇️
internal/exec/packer.go 63.75% <100.00%> (+0.24%) ⬆️
internal/exec/terraform.go 79.48% <100.00%> (+0.26%) ⬆️
internal/exec/terraform_execute_helpers.go 75.86% <100.00%> (+0.09%) ⬆️
pkg/auth/cloud/aws/env.go 100.00% <ø> (ø)
pkg/auth/hooks.go 83.84% <100.00%> (+2.10%) ⬆️
pkg/auth/manager_chain.go 89.04% <100.00%> (-0.55%) ⬇️
pkg/schema/schema.go 87.70% <ø> (ø)
internal/exec/helmfile.go 9.52% <0.00%> (-0.05%) ⬇️
... and 3 more

... and 6 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…trings

Replace the empty-string override approach with a clean, sanitized environment.
Pass os.Environ() through PrepareShellEnvironment (which deletes problematic
IRSA/credential vars), store the result as SanitizedEnv, and pass it to
subprocess execution via WithEnvironment — preventing os.Environ() re-reads
that reintroduce pod-injected IRSA vars on EKS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 5, 2026

These changes were released in v1.208.1-test.1.

…redentials are cached

When findFirstValidCachedCredentials() found valid cached credentials at the
last identity in the chain (e.g., index 1 in [provider, assume-role]),
fetchCachedCredentials advanced startIndex past the end of the chain, causing
authenticateIdentityChain's loop to never execute. This returned stale cached
credentials without performing the actual AssumeRole API call.

In GitHub Actions on EKS runners, this caused Terraform to use the runner's
pod credentials instead of the Atmos-authenticated planner role, because the
credential file contained provider-level credentials that were never replaced
by a fresh AssumeRole call.

The fix skips cached credentials at the target (last) identity and continues
scanning earlier in the chain, ensuring the identity's Authenticate() method
is always called. This aligns with the existing comment: "CRITICAL: Always
re-authenticate through the full chain, even if the target identity has
cached credentials."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 24, 2026

💥 This pull request now has conflicts. Could you fix it @osterman? 🙏

@mergify mergify bot added the conflict This PR has conflicts label Mar 24, 2026
osterman and others added 3 commits March 23, 2026 21:15
Resolve conflicts in shell_utils.go, terraform.go, and workflow_utils.go:
- shell_utils.go: Merge ShellCommandOption (main's capture/override options) with
  WithEnvironment (branch's sanitized env option) into unified shellCommandConfig
- terraform.go: Use main's refactored executeCommandPipeline, passing
  WithEnvironment(info.SanitizedEnv) to ensure IRSA env var scrubbing
  propagates through init, workspace, and main command phases
- workflow_utils.go: Use main's retry.Do closure pattern

Extract createWorkspaceFallback to keep cyclomatic complexity within bounds.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Documents the bug where findFirstValidCachedCredentials() returns the
last chain index, causing fetchCachedCredentials to advance past the
chain end and skip the actual AssumeRole API call. Also documents the
relationship with PR #2143 (IRSA env var scrubbing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@osterman osterman temporarily deployed to feature-releases March 24, 2026 03:07 — with GitHub Actions Inactive
@mergify mergify bot removed the conflict This PR has conflicts label Mar 24, 2026
@github-actions
Copy link
Copy Markdown

These changes were released in v1.211.1-test.3.

@osterman osterman temporarily deployed to feature-releases March 24, 2026 21:48 — with GitHub Actions Inactive
…/arm64

Gomonkey binary patching crashes with SIGBUS on Apple Silicon (darwin/arm64)
with Go 1.26. Extract componentExistsInStacks() from CheckComponentExists()
and test the logic directly, removing the gomonkey dependency from this package.

Also fix pre-existing lint issues in describe_dependents.go (nestif, gocritic).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mergify
Copy link
Copy Markdown

mergify bot commented Mar 25, 2026

💥 This pull request now has conflicts. Could you fix it @osterman? 🙏

@mergify mergify bot added the conflict This PR has conflicts label Mar 25, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
pkg/list/utils/check_component_test.go (2)

9-13: Add at least one behavior-level test for CheckComponentExists (non-empty path).

Right now most coverage is on componentExistsInStacks. Consider adding a seam (DI/function var) so the exported function can be tested as the contract surface, not just its helper.

Based on learnings "Test behavior, not implementation. Never test stub functions. Avoid tautological tests. Make code testable via dependency injection."

Also applies to: 15-140

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/list/utils/check_component_test.go` around lines 9 - 13, The exported
CheckComponentExists currently only has an empty-name test; add a behavior-level
test for non-empty paths by introducing a seam (e.g., a function variable or
dependency injection) so you can stub componentExistsInStacks during tests and
assert CheckComponentExists returns true/false based on the stub; modify the
implementation to call an injectable function (keep original
componentExistsInStacks as default) and add a test that sets the injected
function to return both true and false to verify the exported contract via
CheckComponentExists (referencing CheckComponentExists and
componentExistsInStacks).

15-140: Consolidate these cases into a table-driven test.

The scenarios are solid, but they’re repetitive. A single table-driven test will be shorter and easier to extend.

As per coding guidelines "Use table-driven tests for testing multiple scenarios in Go".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/list/utils/check_component_test.go` around lines 15 - 140, The tests are
repetitive—replace the multiple TestComponentExistsInStacks_* functions with a
single table-driven test named TestComponentExistsInStacks that defines a slice
of test cases (fields: name, stacks map[string]any, component string, want
bool), include all existing scenarios (EmptyMap, InvalidStackData,
NoComponentsKey, InvalidComponentsType, InvalidComponentTypeMap,
ComponentNotFound, ComponentFound, ComponentFoundInSecondStack,
MixedValidInvalidStacks) as cases, iterate cases with t.Run(case.name, func(t
*testing.T) { result := componentExistsInStacks(case.stacks, case.component);
assert.Equal(t, case.want, result) }) and delete the old individual test
functions; keep referencing componentExistsInStacks to locate the logic to test.
pkg/list/utils/utils.go (1)

39-39: Use shared constant for the components key.

Prefer cfg.ComponentsSectionName instead of the string literal "components" so this stays aligned with stack-shape producers in internal/exec.

♻️ Suggested tweak
 import (
+	cfg "github.com/cloudposse/atmos/pkg/config"
 	e "github.com/cloudposse/atmos/internal/exec"
 	"github.com/cloudposse/atmos/pkg/list/errors"
 	"github.com/cloudposse/atmos/pkg/schema"
 )
@@
-		componentsMap, ok := stackMap["components"].(map[string]interface{})
+		componentsMap, ok := stackMap[cfg.ComponentsSectionName].(map[string]interface{})
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/list/utils/utils.go` at line 39, Replace the hard-coded "components" key
lookup with the shared constant by using cfg.ComponentsSectionName when
extracting componentsMap from stackMap (the expression that currently reads
stackMap["components"].(map[string]interface{})); update the lookup where
componentsMap is assigned so it uses cfg.ComponentsSectionName to keep key usage
consistent with internal/exec producers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@pkg/list/utils/check_component_test.go`:
- Around line 9-13: The exported CheckComponentExists currently only has an
empty-name test; add a behavior-level test for non-empty paths by introducing a
seam (e.g., a function variable or dependency injection) so you can stub
componentExistsInStacks during tests and assert CheckComponentExists returns
true/false based on the stub; modify the implementation to call an injectable
function (keep original componentExistsInStacks as default) and add a test that
sets the injected function to return both true and false to verify the exported
contract via CheckComponentExists (referencing CheckComponentExists and
componentExistsInStacks).
- Around line 15-140: The tests are repetitive—replace the multiple
TestComponentExistsInStacks_* functions with a single table-driven test named
TestComponentExistsInStacks that defines a slice of test cases (fields: name,
stacks map[string]any, component string, want bool), include all existing
scenarios (EmptyMap, InvalidStackData, NoComponentsKey, InvalidComponentsType,
InvalidComponentTypeMap, ComponentNotFound, ComponentFound,
ComponentFoundInSecondStack, MixedValidInvalidStacks) as cases, iterate cases
with t.Run(case.name, func(t *testing.T) { result :=
componentExistsInStacks(case.stacks, case.component); assert.Equal(t, case.want,
result) }) and delete the old individual test functions; keep referencing
componentExistsInStacks to locate the logic to test.

In `@pkg/list/utils/utils.go`:
- Line 39: Replace the hard-coded "components" key lookup with the shared
constant by using cfg.ComponentsSectionName when extracting componentsMap from
stackMap (the expression that currently reads
stackMap["components"].(map[string]interface{})); update the lookup where
componentsMap is assigned so it uses cfg.ComponentsSectionName to keep key usage
consistent with internal/exec producers.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 83d5a33c-7d4e-4c1b-9391-8c799f6581d3

📥 Commits

Reviewing files that changed from the base of the PR and between a46a976 and 280f1e4.

📒 Files selected for processing (3)
  • internal/exec/describe_dependents.go
  • pkg/list/utils/check_component_test.go
  • pkg/list/utils/utils.go

coderabbitai[bot]
coderabbitai bot previously approved these changes Mar 25, 2026
…ntity-irsa

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@osterman osterman temporarily deployed to feature-releases March 25, 2026 15:33 — with GitHub Actions Inactive
@osterman osterman temporarily deployed to feature-releases March 25, 2026 15:34 — with GitHub Actions Inactive
@mergify mergify bot removed the conflict This PR has conflicts label Mar 25, 2026
@osterman osterman temporarily deployed to feature-releases March 25, 2026 16:15 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

These changes were released in v1.213.0-test.0.

@mergify mergify bot temporarily deployed to feature-releases March 25, 2026 17:08 Inactive
@mergify mergify bot temporarily deployed to feature-releases March 25, 2026 17:09 Inactive
@mergify mergify bot temporarily deployed to feature-releases March 25, 2026 17:47 Inactive
@github-actions
Copy link
Copy Markdown

These changes were released in v1.213.0-test.3.

@aknysh aknysh temporarily deployed to feature-releases March 26, 2026 01:06 — with GitHub Actions Inactive
@aknysh aknysh temporarily deployed to feature-releases March 26, 2026 01:07 — with GitHub Actions Inactive
@aknysh aknysh temporarily deployed to feature-releases March 26, 2026 01:44 — with GitHub Actions Inactive
@github-actions
Copy link
Copy Markdown

These changes were released in v1.213.0-test.4.

@aknysh aknysh merged commit dbcba35 into main Mar 26, 2026
64 of 65 checks passed
@aknysh aknysh deleted the osterman/auth-web-identity-irsa branch March 26, 2026 02:35
osterman added a commit that referenced this pull request Mar 26, 2026
…ls on EKS pods (#2143)

* fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods

On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook
injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME. When
using Atmos auth on ARC (Actions Runner Controller), these IRSA vars were leaking
into the terraform subprocess because PrepareEnvironment only cleared vars from
ComponentEnvSection (stack YAML env vars), not from os.Environ() where the pod
vars live. AWS SDK credential chain gives web identity tokens higher precedence
than shared credential files, so the pod's runner role was used instead of the
Atmos-managed tfplan role.

## Changes

1. Add IRSA vars to problematicAWSEnvVars so they're cleared during the auth flow itself
2. Change PrepareEnvironment to set cleared vars to empty string (not delete) so they
   appear in ComponentEnvList and override inherited IRSA values in the subprocess
3. Update tests to expect empty strings (which override os.Environ()) instead of absent keys
4. Add TestPrepareEnvironment_IRSALeakPrevention to reproduce the full ARC/IRSA scenario

## How it works

When subprocess env is built as os.Environ() + ComponentEnvList, Go's exec.Cmd
respects the last occurrence of each key. Setting IRSA vars to empty string in
ComponentEnvList ensures they override the pod's injected values.

Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>

* fix: scrub IRSA env vars via sanitized environment instead of empty strings

Replace the empty-string override approach with a clean, sanitized environment.
Pass os.Environ() through PrepareShellEnvironment (which deletes problematic
IRSA/credential vars), store the result as SanitizedEnv, and pass it to
subprocess execution via WithEnvironment — preventing os.Environ() re-reads
that reintroduce pod-injected IRSA vars on EKS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: auth credential chain skipping AssumeRole when target identity credentials are cached

When findFirstValidCachedCredentials() found valid cached credentials at the
last identity in the chain (e.g., index 1 in [provider, assume-role]),
fetchCachedCredentials advanced startIndex past the end of the chain, causing
authenticateIdentityChain's loop to never execute. This returned stale cached
credentials without performing the actual AssumeRole API call.

In GitHub Actions on EKS runners, this caused Terraform to use the runner's
pod credentials instead of the Atmos-authenticated planner role, because the
credential file contained provider-level credentials that were never replaced
by a fresh AssumeRole call.

The fix skips cached credentials at the target (last) identity and continues
scanning earlier in the chain, ensuring the identity's Authenticate() method
is always called. This aligns with the existing comment: "CRITICAL: Always
re-authenticate through the full chain, even if the target identity has
cached credentials."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* docs: add fix doc for auth credential chain skipping AssumeRole

Documents the bug where findFirstValidCachedCredentials() returns the
last chain index, causing fetchCachedCredentials to advance past the
chain end and skip the actual AssumeRole API call. Also documents the
relationship with PR #2143 (IRSA env var scrubbing).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* docs: remove self-referential links from fix doc

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: forward shell options (CI capture) through ExecuteTerraform to executeCommandPipeline

The 0fc44f4 refactoring extracted executeCommandPipeline but forgot to
forward the opts parameter, silently dropping CI stdout/stderr capture
buffers and producing empty CI summaries.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: use UpdateEnvVar for os.StartProcess env dedup and cross-platform tests

Address CodeRabbit review comments on PR #2143:

- Use envpkg.UpdateEnvVar in ExecAuthShellCommand to prevent duplicate env
  keys that os.StartProcess resolves with "first value wins" semantics
- Replace Unix-only echo/sh test cases with cross-platform os.Executable()
  subprocess pattern, matching the established convention in internal/exec/

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: replace gomonkey with extracted function to fix SIGBUS on darwin/arm64

Gomonkey binary patching crashes with SIGBUS on Apple Silicon (darwin/arm64)
with Go 1.26. Extract componentExistsInStacks() from CheckComponentExists()
and test the logic directly, removing the gomonkey dependency from this package.

Also fix pre-existing lint issues in describe_dependents.go (nestif, gocritic).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com>
Co-authored-by: Alexander Matveev <26750966+AleksandrMatveev@users.noreply.github.com>
Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown

These changes were released in v1.212.1-rc.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

patch A minor, backward compatible change release/feature Create release from this PR size/m Medium size PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants