fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods#2143
fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods#2143
Conversation
…ls on EKS pods On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars were leaking into the terraform subprocess because PrepareEnvironment only cleared vars from ComponentEnvSection (stack YAML env vars), not from os.Environ() where the pod vars live. AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role. ## Changes 1. Add IRSA vars to problematicAWSEnvVars so they're cleared during the auth flow itself 2. Change PrepareEnvironment to set cleared vars to empty string (not delete) so they appear in ComponentEnvList and override inherited IRSA values in the subprocess 3. Update tests to expect empty strings (which override os.Environ()) instead of absent keys 4. Add TestPrepareEnvironment_IRSALeakPrevention to reproduce the full ARC/IRSA scenario ## How it works When subprocess env is built as os.Environ() + ComponentEnvList, Go's exec.Cmd respects the last occurrence of each key. Setting IRSA vars to empty string in ComponentEnvList ensures they override the pod's injected values. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
📝 WalkthroughWalkthroughAdds IRSA-related AWS vars to the problematic list and switches PrepareEnvironment to set problematic AWS env vars to empty strings. Propagates a pre-sanitized []string environment through exec plumbing so subprocesses receive the sanitized env unchanged. Tests updated. Changes
Sequence Diagram(s)sequenceDiagram
rect rgba(0,128,0,0.5)
participant User
end
rect rgba(0,0,255,0.5)
participant Atmos
end
rect rgba(255,165,0,0.5)
participant AuthMgr
end
rect rgba(128,0,128,0.5)
participant Exec
end
rect rgba(255,0,0,0.5)
participant Subprocess
end
User->>Atmos: invoke auth/shell/terraform command
Atmos->>AuthMgr: PrepareEnvironment() -> sanitizedEnv ([]string)
AuthMgr-->>Atmos: return sanitizedEnv
Atmos->>Exec: ExecuteShellCommand / ExecAuthShellCommand(with sanitizedEnv)
Exec->>Subprocess: start subprocess with Env = sanitizedEnv + ATMOS_* mutations
Subprocess-->>User: run tool/shell with sanitized environment
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
pkg/auth/cloud/aws/env_test.go (1)
634-641: Replace customindexOfwithstrings.IndexByte.The
indexOffunction duplicatesstrings.IndexByte— no need for a custom helper. Your Go version (1.26) supports thefor i := range len(s)syntax without issue.Replace the function and its usage:
- Line 612: change
indexOf(entry, '=')tostrings.IndexByte(entry, '=')- Lines 634-641: remove the custom function
- Add
"strings"to imports if needed🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/auth/cloud/aws/env_test.go` around lines 634 - 641, The custom indexOf function duplicates standard library functionality; replace calls to indexOf(entry, '=') with strings.IndexByte(entry, '=') and delete the indexOf function definition (function name: indexOf). Ensure the "strings" package is imported (add to imports if missing) and remove the now-unused indexOf symbol so the test file builds cleanly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/auth/cloud/aws/env_test.go`:
- Around line 634-641: The custom indexOf function duplicates standard library
functionality; replace calls to indexOf(entry, '=') with
strings.IndexByte(entry, '=') and delete the indexOf function definition
(function name: indexOf). Ensure the "strings" package is imported (add to
imports if missing) and remove the now-unused indexOf symbol so the test file
builds cleanly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 964c0740-0aca-46ba-9c1d-f8be609d9e89
📒 Files selected for processing (3)
pkg/auth/cloud/aws/env.gopkg/auth/cloud/aws/env_test.gopkg/auth/cloud/aws/setup_test.go
|
These changes were released in v1.208.1-test.0. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2143 +/- ##
==========================================
+ Coverage 77.19% 77.22% +0.03%
==========================================
Files 1015 1015
Lines 96065 96087 +22
==========================================
+ Hits 74158 74204 +46
+ Misses 17717 17696 -21
+ Partials 4190 4187 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
…trings Replace the empty-string override approach with a clean, sanitized environment. Pass os.Environ() through PrepareShellEnvironment (which deletes problematic IRSA/credential vars), store the result as SanitizedEnv, and pass it to subprocess execution via WithEnvironment — preventing os.Environ() re-reads that reintroduce pod-injected IRSA vars on EKS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
These changes were released in v1.208.1-test.1. |
…redentials are cached When findFirstValidCachedCredentials() found valid cached credentials at the last identity in the chain (e.g., index 1 in [provider, assume-role]), fetchCachedCredentials advanced startIndex past the end of the chain, causing authenticateIdentityChain's loop to never execute. This returned stale cached credentials without performing the actual AssumeRole API call. In GitHub Actions on EKS runners, this caused Terraform to use the runner's pod credentials instead of the Atmos-authenticated planner role, because the credential file contained provider-level credentials that were never replaced by a fresh AssumeRole call. The fix skips cached credentials at the target (last) identity and continues scanning earlier in the chain, ensuring the identity's Authenticate() method is always called. This aligns with the existing comment: "CRITICAL: Always re-authenticate through the full chain, even if the target identity has cached credentials." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
💥 This pull request now has conflicts. Could you fix it @osterman? 🙏 |
Resolve conflicts in shell_utils.go, terraform.go, and workflow_utils.go: - shell_utils.go: Merge ShellCommandOption (main's capture/override options) with WithEnvironment (branch's sanitized env option) into unified shellCommandConfig - terraform.go: Use main's refactored executeCommandPipeline, passing WithEnvironment(info.SanitizedEnv) to ensure IRSA env var scrubbing propagates through init, workspace, and main command phases - workflow_utils.go: Use main's retry.Do closure pattern Extract createWorkspaceFallback to keep cyclomatic complexity within bounds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ntity-irsa # Conflicts: # pkg/auth/hooks.go
Documents the bug where findFirstValidCachedCredentials() returns the last chain index, causing fetchCachedCredentials to advance past the chain end and skip the actual AssumeRole API call. Also documents the relationship with PR #2143 (IRSA env var scrubbing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
These changes were released in v1.211.1-test.3. |
…/arm64 Gomonkey binary patching crashes with SIGBUS on Apple Silicon (darwin/arm64) with Go 1.26. Extract componentExistsInStacks() from CheckComponentExists() and test the logic directly, removing the gomonkey dependency from this package. Also fix pre-existing lint issues in describe_dependents.go (nestif, gocritic). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
💥 This pull request now has conflicts. Could you fix it @osterman? 🙏 |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
pkg/list/utils/check_component_test.go (2)
9-13: Add at least one behavior-level test forCheckComponentExists(non-empty path).Right now most coverage is on
componentExistsInStacks. Consider adding a seam (DI/function var) so the exported function can be tested as the contract surface, not just its helper.Based on learnings "Test behavior, not implementation. Never test stub functions. Avoid tautological tests. Make code testable via dependency injection."
Also applies to: 15-140
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/list/utils/check_component_test.go` around lines 9 - 13, The exported CheckComponentExists currently only has an empty-name test; add a behavior-level test for non-empty paths by introducing a seam (e.g., a function variable or dependency injection) so you can stub componentExistsInStacks during tests and assert CheckComponentExists returns true/false based on the stub; modify the implementation to call an injectable function (keep original componentExistsInStacks as default) and add a test that sets the injected function to return both true and false to verify the exported contract via CheckComponentExists (referencing CheckComponentExists and componentExistsInStacks).
15-140: Consolidate these cases into a table-driven test.The scenarios are solid, but they’re repetitive. A single table-driven test will be shorter and easier to extend.
As per coding guidelines "Use table-driven tests for testing multiple scenarios in Go".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/list/utils/check_component_test.go` around lines 15 - 140, The tests are repetitive—replace the multiple TestComponentExistsInStacks_* functions with a single table-driven test named TestComponentExistsInStacks that defines a slice of test cases (fields: name, stacks map[string]any, component string, want bool), include all existing scenarios (EmptyMap, InvalidStackData, NoComponentsKey, InvalidComponentsType, InvalidComponentTypeMap, ComponentNotFound, ComponentFound, ComponentFoundInSecondStack, MixedValidInvalidStacks) as cases, iterate cases with t.Run(case.name, func(t *testing.T) { result := componentExistsInStacks(case.stacks, case.component); assert.Equal(t, case.want, result) }) and delete the old individual test functions; keep referencing componentExistsInStacks to locate the logic to test.pkg/list/utils/utils.go (1)
39-39: Use shared constant for the components key.Prefer
cfg.ComponentsSectionNameinstead of the string literal"components"so this stays aligned with stack-shape producers ininternal/exec.♻️ Suggested tweak
import ( + cfg "github.com/cloudposse/atmos/pkg/config" e "github.com/cloudposse/atmos/internal/exec" "github.com/cloudposse/atmos/pkg/list/errors" "github.com/cloudposse/atmos/pkg/schema" ) @@ - componentsMap, ok := stackMap["components"].(map[string]interface{}) + componentsMap, ok := stackMap[cfg.ComponentsSectionName].(map[string]interface{})🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/list/utils/utils.go` at line 39, Replace the hard-coded "components" key lookup with the shared constant by using cfg.ComponentsSectionName when extracting componentsMap from stackMap (the expression that currently reads stackMap["components"].(map[string]interface{})); update the lookup where componentsMap is assigned so it uses cfg.ComponentsSectionName to keep key usage consistent with internal/exec producers.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/list/utils/check_component_test.go`:
- Around line 9-13: The exported CheckComponentExists currently only has an
empty-name test; add a behavior-level test for non-empty paths by introducing a
seam (e.g., a function variable or dependency injection) so you can stub
componentExistsInStacks during tests and assert CheckComponentExists returns
true/false based on the stub; modify the implementation to call an injectable
function (keep original componentExistsInStacks as default) and add a test that
sets the injected function to return both true and false to verify the exported
contract via CheckComponentExists (referencing CheckComponentExists and
componentExistsInStacks).
- Around line 15-140: The tests are repetitive—replace the multiple
TestComponentExistsInStacks_* functions with a single table-driven test named
TestComponentExistsInStacks that defines a slice of test cases (fields: name,
stacks map[string]any, component string, want bool), include all existing
scenarios (EmptyMap, InvalidStackData, NoComponentsKey, InvalidComponentsType,
InvalidComponentTypeMap, ComponentNotFound, ComponentFound,
ComponentFoundInSecondStack, MixedValidInvalidStacks) as cases, iterate cases
with t.Run(case.name, func(t *testing.T) { result :=
componentExistsInStacks(case.stacks, case.component); assert.Equal(t, case.want,
result) }) and delete the old individual test functions; keep referencing
componentExistsInStacks to locate the logic to test.
In `@pkg/list/utils/utils.go`:
- Line 39: Replace the hard-coded "components" key lookup with the shared
constant by using cfg.ComponentsSectionName when extracting componentsMap from
stackMap (the expression that currently reads
stackMap["components"].(map[string]interface{})); update the lookup where
componentsMap is assigned so it uses cfg.ComponentsSectionName to keep key usage
consistent with internal/exec producers.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 83d5a33c-7d4e-4c1b-9391-8c799f6581d3
📒 Files selected for processing (3)
internal/exec/describe_dependents.gopkg/list/utils/check_component_test.gopkg/list/utils/utils.go
…ntity-irsa Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
These changes were released in v1.213.0-test.0. |
|
These changes were released in v1.213.0-test.3. |
|
These changes were released in v1.213.0-test.4. |
…ls on EKS pods (#2143) * fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars were leaking into the terraform subprocess because PrepareEnvironment only cleared vars from ComponentEnvSection (stack YAML env vars), not from os.Environ() where the pod vars live. AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role. ## Changes 1. Add IRSA vars to problematicAWSEnvVars so they're cleared during the auth flow itself 2. Change PrepareEnvironment to set cleared vars to empty string (not delete) so they appear in ComponentEnvList and override inherited IRSA values in the subprocess 3. Update tests to expect empty strings (which override os.Environ()) instead of absent keys 4. Add TestPrepareEnvironment_IRSALeakPrevention to reproduce the full ARC/IRSA scenario ## How it works When subprocess env is built as os.Environ() + ComponentEnvList, Go's exec.Cmd respects the last occurrence of each key. Setting IRSA vars to empty string in ComponentEnvList ensures they override the pod's injected values. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * fix: scrub IRSA env vars via sanitized environment instead of empty strings Replace the empty-string override approach with a clean, sanitized environment. Pass os.Environ() through PrepareShellEnvironment (which deletes problematic IRSA/credential vars), store the result as SanitizedEnv, and pass it to subprocess execution via WithEnvironment — preventing os.Environ() re-reads that reintroduce pod-injected IRSA vars on EKS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: auth credential chain skipping AssumeRole when target identity credentials are cached When findFirstValidCachedCredentials() found valid cached credentials at the last identity in the chain (e.g., index 1 in [provider, assume-role]), fetchCachedCredentials advanced startIndex past the end of the chain, causing authenticateIdentityChain's loop to never execute. This returned stale cached credentials without performing the actual AssumeRole API call. In GitHub Actions on EKS runners, this caused Terraform to use the runner's pod credentials instead of the Atmos-authenticated planner role, because the credential file contained provider-level credentials that were never replaced by a fresh AssumeRole call. The fix skips cached credentials at the target (last) identity and continues scanning earlier in the chain, ensuring the identity's Authenticate() method is always called. This aligns with the existing comment: "CRITICAL: Always re-authenticate through the full chain, even if the target identity has cached credentials." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add fix doc for auth credential chain skipping AssumeRole Documents the bug where findFirstValidCachedCredentials() returns the last chain index, causing fetchCachedCredentials to advance past the chain end and skip the actual AssumeRole API call. Also documents the relationship with PR #2143 (IRSA env var scrubbing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: remove self-referential links from fix doc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: forward shell options (CI capture) through ExecuteTerraform to executeCommandPipeline The 0fc44f4 refactoring extracted executeCommandPipeline but forgot to forward the opts parameter, silently dropping CI stdout/stderr capture buffers and producing empty CI summaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use UpdateEnvVar for os.StartProcess env dedup and cross-platform tests Address CodeRabbit review comments on PR #2143: - Use envpkg.UpdateEnvVar in ExecAuthShellCommand to prevent duplicate env keys that os.StartProcess resolves with "first value wins" semantics - Replace Unix-only echo/sh test cases with cross-platform os.Executable() subprocess pattern, matching the established convention in internal/exec/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace gomonkey with extracted function to fix SIGBUS on darwin/arm64 Gomonkey binary patching crashes with SIGBUS on Apple Silicon (darwin/arm64) with Go 1.26. Extract componentExistsInStacks() from CheckComponentExists() and test the logic directly, removing the gomonkey dependency from this package. Also fix pre-existing lint issues in describe_dependents.go (nestif, gocritic). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com> Co-authored-by: Alexander Matveev <26750966+AleksandrMatveev@users.noreply.github.com> Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>
|
These changes were released in v1.212.1-rc.1. |
what
os.Environ()throughPrepareShellEnvironmentto sanitize it (delete problematic vars), then pass the sanitized env to subprocess viaWithBaseEnv— avoiding re-readingos.Environ()which would reintroduce IRSA varsSanitizedBaseEnvfield toConfigAndStacksInfoto carry sanitized environment through the hooks→terraform/helmfile/packer pipelineWithBaseEnvvariadic option toExecuteShellCommandfor backward-compatible sanitized env injectionauth execandauth shellto use sanitized env directly instead of re-readingos.Environ()why
On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects
AWS_WEB_IDENTITY_TOKEN_FILE,AWS_ROLE_ARN, andAWS_ROLE_SESSION_NAMEinto the pod environment. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars leaked into terraform subprocesses because three code paths re-reados.Environ()after auth sanitization:authenticateAndWriteEnvonly passedComponentEnvSection(stack YAML vars) toPrepareShellEnvironment— IRSA vars weren't in the input sodelete()was a no-op. ThenExecuteShellCommandre-reados.Environ()as the base.auth exec:executeCommandWithEnvre-reados.Environ()to build subprocess env.auth shell:ExecAuthShellCommand→MergeSystemEnvSimpleWithGlobalre-reados.Environ().AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role, causing
AccessDeniederrors.Approach
Instead of setting cleared vars to empty string (which pollutes the subprocess env), we pass a clean, sanitized environment:
authenticateAndWriteEnvnow passesos.Environ()+ComponentEnvSectiontoPrepareShellEnvironment, which deletes problematic keysSanitizedBaseEnvonConfigAndStacksInfoExecuteShellCommandacceptsWithBaseEnv(info.SanitizedBaseEnv)to use the sanitized env instead of re-readingos.Environ()auth execandauth shellpass sanitized env directly to subprocess, bypassing the re-readreferences
Fixes credential precedence conflict where IRSA vars override Atmos-managed credentials on EKS pods running ARC (DEV-4216)
Summary by CodeRabbit
Bug Fixes
Refactor
Tests
Documentation