fix: treat S3 NoSuchBucket as not-provisioned in state reads#2257
fix: treat S3 NoSuchBucket as not-provisioned in state reads#2257
Conversation
…ls on EKS pods (#2143) * fix: prevent IRSA credentials from overriding Atmos-managed credentials on EKS pods On EKS pods with IRSA (IAM Roles for Service Accounts), the pod identity webhook injects AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME. When using Atmos auth on ARC (Actions Runner Controller), these IRSA vars were leaking into the terraform subprocess because PrepareEnvironment only cleared vars from ComponentEnvSection (stack YAML env vars), not from os.Environ() where the pod vars live. AWS SDK credential chain gives web identity tokens higher precedence than shared credential files, so the pod's runner role was used instead of the Atmos-managed tfplan role. ## Changes 1. Add IRSA vars to problematicAWSEnvVars so they're cleared during the auth flow itself 2. Change PrepareEnvironment to set cleared vars to empty string (not delete) so they appear in ComponentEnvList and override inherited IRSA values in the subprocess 3. Update tests to expect empty strings (which override os.Environ()) instead of absent keys 4. Add TestPrepareEnvironment_IRSALeakPrevention to reproduce the full ARC/IRSA scenario ## How it works When subprocess env is built as os.Environ() + ComponentEnvList, Go's exec.Cmd respects the last occurrence of each key. Setting IRSA vars to empty string in ComponentEnvList ensures they override the pod's injected values. Co-Authored-By: Claude Haiku 4.5 <noreply@anthropic.com> * fix: scrub IRSA env vars via sanitized environment instead of empty strings Replace the empty-string override approach with a clean, sanitized environment. Pass os.Environ() through PrepareShellEnvironment (which deletes problematic IRSA/credential vars), store the result as SanitizedEnv, and pass it to subprocess execution via WithEnvironment — preventing os.Environ() re-reads that reintroduce pod-injected IRSA vars on EKS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: auth credential chain skipping AssumeRole when target identity credentials are cached When findFirstValidCachedCredentials() found valid cached credentials at the last identity in the chain (e.g., index 1 in [provider, assume-role]), fetchCachedCredentials advanced startIndex past the end of the chain, causing authenticateIdentityChain's loop to never execute. This returned stale cached credentials without performing the actual AssumeRole API call. In GitHub Actions on EKS runners, this caused Terraform to use the runner's pod credentials instead of the Atmos-authenticated planner role, because the credential file contained provider-level credentials that were never replaced by a fresh AssumeRole call. The fix skips cached credentials at the target (last) identity and continues scanning earlier in the chain, ensuring the identity's Authenticate() method is always called. This aligns with the existing comment: "CRITICAL: Always re-authenticate through the full chain, even if the target identity has cached credentials." Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * docs: add fix doc for auth credential chain skipping AssumeRole Documents the bug where findFirstValidCachedCredentials() returns the last chain index, causing fetchCachedCredentials to advance past the chain end and skip the actual AssumeRole API call. Also documents the relationship with PR #2143 (IRSA env var scrubbing). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: remove self-referential links from fix doc Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: forward shell options (CI capture) through ExecuteTerraform to executeCommandPipeline The 0fc44f4 refactoring extracted executeCommandPipeline but forgot to forward the opts parameter, silently dropping CI stdout/stderr capture buffers and producing empty CI summaries. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: use UpdateEnvVar for os.StartProcess env dedup and cross-platform tests Address CodeRabbit review comments on PR #2143: - Use envpkg.UpdateEnvVar in ExecAuthShellCommand to prevent duplicate env keys that os.StartProcess resolves with "first value wins" semantics - Replace Unix-only echo/sh test cases with cross-platform os.Executable() subprocess pattern, matching the established convention in internal/exec/ Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: replace gomonkey with extracted function to fix SIGBUS on darwin/arm64 Gomonkey binary patching crashes with SIGBUS on Apple Silicon (darwin/arm64) with Go 1.26. Extract componentExistsInStacks() from CheckComponentExists() and test the logic directly, removing the gomonkey dependency from this package. Also fix pre-existing lint issues in describe_dependents.go (nestif, gocritic). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Haiku 4.5 <noreply@anthropic.com> Co-authored-by: Alexander Matveev <26750966+AleksandrMatveev@users.noreply.github.com> Co-authored-by: Andriy Knysh <aknysh@users.noreply.github.com>
When using `--all` with `provision.backend.enabled: true`, YAML functions like `!terraform.state` are evaluated before provisioners create the S3 bucket. NoSuchBucket was not handled gracefully like NoSuchKey, causing the entire operation to crash instead of treating it as "state not provisioned." Handle both typed (*types.NoSuchBucket) and generic API error forms to support AWS and S3-compatible backends (MinIO, Wasabi). Also avoids unnecessary retries on a bucket that cannot appear between attempts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Dependency Review✅ No vulnerabilities or license issues found.Snapshot WarningsEnsure that dependencies are being submitted on PR branches and consider enabling retry-on-snapshot-warnings. See the documentation for more information and troubleshooting advice. Scanned FilesNone |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughPre-sanitized environment slices ( []string ) are threaded end-to-end into subprocess execution, removing re-reading/merging of the host env. Auth hooks now persist sanitized env to schema, auth manager skips terminal cached credentials, and S3 backend treats missing buckets as unprovisioned (nil, nil). Changes
Sequence Diagram(s)sequenceDiagram
participant Hooks as Auth Hooks
participant Manager as Auth Manager
participant Sanitizer as PrepareShellEnvironment
participant Exec as ExecuteShellCommand
participant Sub as Subprocess
Hooks->>Manager: Request Authenticate for identity chain
Manager->>Sanitizer: PrepareShellEnvironment (remove AWS/IRSA vars)
Sanitizer-->>Manager: Return sanitized env ([]string)
Manager-->>Hooks: Provide auth env list
Hooks->>Hooks: Persist sanitized env into ConfigAndStacksInfo.SanitizedEnv
Hooks->>Exec: ExecuteShellCommand with WithEnvironment(SanitizedEnv)
Exec->>Exec: Build cmd.Env from SanitizedEnv (no os.Environ re-read)
Exec->>Sub: Launch subprocess with sanitized env
Sub-->>Exec: Exit code / output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
pkg/list/utils/utils.go (1)
17-29: 🛠️ Refactor suggestion | 🟠 MajorAdd perf tracking to this public function.
CheckComponentExistsdoes I/O on Line 23, so it should be instrumented withperf.Trackfor consistency and observability.Proposed diff
import ( e "github.com/cloudposse/atmos/internal/exec" + "github.com/cloudposse/atmos/internal/perf" "github.com/cloudposse/atmos/pkg/list/errors" "github.com/cloudposse/atmos/pkg/schema" ) @@ func CheckComponentExists(atmosConfig *schema.AtmosConfiguration, componentName string) bool { + defer perf.Track(atmosConfig, "list.utils.CheckComponentExists")() + if componentName == "" { return false }As per coding guidelines, "Add
defer perf.Track(atmosConfig, "pkg.FuncName")()plus blank line to all public functions ...".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/list/utils/utils.go` around lines 17 - 29, Add perf tracking to the public function CheckComponentExists: at the top of the function (immediately after the parameter checks or start of body) add a defer perf.Track(atmosConfig, "pkg.CheckComponentExists")() and ensure a blank line follows that defer for readability; this will instrument the I/O call made by ExecuteDescribeStacks and provide consistent observability for CheckComponentExists.
🧹 Nitpick comments (5)
pkg/list/utils/check_component_test.go (2)
15-140: Rebalance coverage toward exported behavior, not mostly private helper internals.Most new coverage targets
componentExistsInStacksdirectly; consider adding broader contract tests throughCheckComponentExists(beyond empty-name) so refactors don’t force test rewrites.Based on learnings, "Applies to
**/*_test.go: Test behavior, not implementation. Never test stub functions. Avoid tautological tests. Make code testable via dependency injection.".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/list/utils/check_component_test.go` around lines 15 - 140, The tests currently target the internal helper componentExistsInStacks; replace or add tests that exercise the exported CheckComponentExists function (not the stub/private helper) to verify the public contract: call CheckComponentExists with realistic inputs (nil/empty stacks, stacks missing components, stacks with components present in different component types and different stacks) and assert true/false outcomes; remove tautological tests of internal helpers and, if CheckComponentExists depends on external state, refactor it to accept the stacks map as an argument (or inject a provider) so tests can pass controlled inputs and avoid poking private functions like componentExistsInStacks.
15-140: Refactor these cases into a single table-driven test.These scenarios are good, but the current one-test-per-case pattern is repetitive and harder to maintain.
As per coding guidelines, "
**/*_test.go: Use table-driven tests for testing multiple scenarios in Go".🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/list/utils/check_component_test.go` around lines 15 - 140, Combine the many individual tests into one table-driven test named like TestComponentExistsInStacks_TableDriven: create a slice of test cases each with a name, input stacks (map[string]any) and expected bool, then loop with t.Run(tc.name, func(t *testing.T){ result := componentExistsInStacks(tc.stacks, "test-component"); assert.Equal(t, tc.want, result) }) to cover EmptyMap, InvalidStackData, NoComponentsKey, InvalidComponentsType, InvalidComponentTypeMap, ComponentNotFound, ComponentFound, ComponentFoundInSecondStack and MixedValidInvalidStacks cases; remove the old duplicated test functions.pkg/auth/cloud/aws/env_test.go (1)
588-596: Consider using strings.IndexByte.The
indexOfhelper duplicatesstrings.IndexBytefunctionality. Could simplify with:if idx := strings.IndexByte(entry, '='); idx >= 0 {This is a minor nitpick - the current implementation works correctly.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/auth/cloud/aws/env_test.go` around lines 588 - 596, The custom indexOf(s string, b byte) function duplicates standard library behavior; replace its usage with strings.IndexByte (importing "strings" if missing) and remove the indexOf function; e.g., where indexOf(entry, '=') is used, call strings.IndexByte(entry, '=') and handle the >= 0 check the same way to preserve behavior in the tests in pkg/auth/cloud/aws/env_test.go.internal/terraform_backend/terraform_backend_s3_test.go (1)
212-228: Add a direct assertion thatNoSuchBucketdoes not retry.These cases validate
(nil, nil)correctly, but they don’t assert the new short-circuit behavior. Please trackGetObjectcall count and assert exactly one call for bothNoSuchBucketvariants.As per coding guidelines, "Every new feature must include comprehensive unit tests targeting >80% code coverage for all packages."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/terraform_backend/terraform_backend_s3_test.go` around lines 212 - 228, Add a counter to the test cases using the erroringS3Client to track calls to GetObject and assert it is invoked exactly once for both NoSuchBucket variants; specifically, modify the test entries named "no such bucket - generic API error (backend not provisioned)" and "no such bucket - typed error (backend not provisioned)" to wrap the client with a call-counting erroringS3Client (or add a counter field on erroringS3Client) that increments on GetObject, run the operation under test, and add assertions that the returned body is nil and that the GetObject call count equals 1 for each case to verify the short-circuit behavior.pkg/auth/manager_test.go (1)
705-723: Nice update—add one explicit terminal-only cache case for completeness.Consider adding a case where only the last identity has valid cache and earlier steps are missing/invalid, asserting
-1. That makes the new skip-last contract explicit and regression-resistant.Proposed test addition.
func TestManager_findFirstValidCachedCredentials(t *testing.T) { @@ require.Equal(t, 1, idx) + // Only target identity valid, no earlier valid cache -> should return -1. + delete(s.data, "id1") + s.data["id2"] = &testCreds{exp: &validExp} + idx = m.findFirstValidCachedCredentials() + require.Equal(t, -1, idx) + // id2 expired, id1 still valid -> should pick id1. s.data["id2"] = &testCreds{exp: &expiredExp} + s.data["id1"] = &testCreds{exp: &validExp} idx = m.findFirstValidCachedCredentials() require.Equal(t, 1, idx)As per coding guidelines "Every new feature must include comprehensive unit tests targeting >80% code coverage for all packages."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/auth/manager_test.go` around lines 705 - 723, Add a terminal-only cache test: modify the existing test around s.data and m.findFirstValidCachedCredentials() to include a case where only the last identity (the target) has valid cached credentials (e.g., set s.data["id2"] = &testCreds{exp: &validExp} and s.data["id1"] = &testCreds{exp: &expiredExp} or nil), then call m.findFirstValidCachedCredentials() and assert it returns -1; this makes the "skip last identity" contract explicit and prevents regressions.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/fixes/2026-03-26-all-flag-nosuchbucket-backend-provisioning.md`:
- Around line 36-54: Add the missing fenced code block language tags ("text")
for the two markdown code blocks that illustrate the S3/GetTerraform state flows
(the block starting with "S3 GetObject returns NoSuchBucket" and the block
starting with "nil content from S3 reader") so the linter MD040 is satisfied;
update those triple-backtick fences to ```text while leaving the content
unchanged, ensuring the examples around ReadTerraformBackendS3Internal,
ErrGetObjectFromS3, GetTerraformBackend, GetTerraformState,
isRecoverableTerraformError, and ErrTerraformStateNotProvisioned remain intact.
---
Outside diff comments:
In `@pkg/list/utils/utils.go`:
- Around line 17-29: Add perf tracking to the public function
CheckComponentExists: at the top of the function (immediately after the
parameter checks or start of body) add a defer perf.Track(atmosConfig,
"pkg.CheckComponentExists")() and ensure a blank line follows that defer for
readability; this will instrument the I/O call made by ExecuteDescribeStacks and
provide consistent observability for CheckComponentExists.
---
Nitpick comments:
In `@internal/terraform_backend/terraform_backend_s3_test.go`:
- Around line 212-228: Add a counter to the test cases using the
erroringS3Client to track calls to GetObject and assert it is invoked exactly
once for both NoSuchBucket variants; specifically, modify the test entries named
"no such bucket - generic API error (backend not provisioned)" and "no such
bucket - typed error (backend not provisioned)" to wrap the client with a
call-counting erroringS3Client (or add a counter field on erroringS3Client) that
increments on GetObject, run the operation under test, and add assertions that
the returned body is nil and that the GetObject call count equals 1 for each
case to verify the short-circuit behavior.
In `@pkg/auth/cloud/aws/env_test.go`:
- Around line 588-596: The custom indexOf(s string, b byte) function duplicates
standard library behavior; replace its usage with strings.IndexByte (importing
"strings" if missing) and remove the indexOf function; e.g., where
indexOf(entry, '=') is used, call strings.IndexByte(entry, '=') and handle the
>= 0 check the same way to preserve behavior in the tests in
pkg/auth/cloud/aws/env_test.go.
In `@pkg/auth/manager_test.go`:
- Around line 705-723: Add a terminal-only cache test: modify the existing test
around s.data and m.findFirstValidCachedCredentials() to include a case where
only the last identity (the target) has valid cached credentials (e.g., set
s.data["id2"] = &testCreds{exp: &validExp} and s.data["id1"] = &testCreds{exp:
&expiredExp} or nil), then call m.findFirstValidCachedCredentials() and assert
it returns -1; this makes the "skip last identity" contract explicit and
prevents regressions.
In `@pkg/list/utils/check_component_test.go`:
- Around line 15-140: The tests currently target the internal helper
componentExistsInStacks; replace or add tests that exercise the exported
CheckComponentExists function (not the stub/private helper) to verify the public
contract: call CheckComponentExists with realistic inputs (nil/empty stacks,
stacks missing components, stacks with components present in different component
types and different stacks) and assert true/false outcomes; remove tautological
tests of internal helpers and, if CheckComponentExists depends on external
state, refactor it to accept the stacks map as an argument (or inject a
provider) so tests can pass controlled inputs and avoid poking private functions
like componentExistsInStacks.
- Around line 15-140: Combine the many individual tests into one table-driven
test named like TestComponentExistsInStacks_TableDriven: create a slice of test
cases each with a name, input stacks (map[string]any) and expected bool, then
loop with t.Run(tc.name, func(t *testing.T){ result :=
componentExistsInStacks(tc.stacks, "test-component"); assert.Equal(t, tc.want,
result) }) to cover EmptyMap, InvalidStackData, NoComponentsKey,
InvalidComponentsType, InvalidComponentTypeMap, ComponentNotFound,
ComponentFound, ComponentFoundInSecondStack and MixedValidInvalidStacks cases;
remove the old duplicated test functions.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: c97d8230-a9db-4b63-ba25-fcfa8d286afa
📒 Files selected for processing (24)
cmd/auth_exec.gocmd/auth_exec_test.gocmd/auth_shell.gocmd/testing_main_test.godocs/fixes/2026-03-23-auth-credential-chain-skipping-assume-role.mddocs/fixes/2026-03-26-all-flag-nosuchbucket-backend-provisioning.mdinternal/exec/helmfile.gointernal/exec/packer.gointernal/exec/shell_utils.gointernal/exec/shell_utils_test.gointernal/exec/terraform.gointernal/exec/terraform_execute_helpers.gointernal/exec/terraform_execute_helpers_exec.gointernal/terraform_backend/terraform_backend_s3.gointernal/terraform_backend/terraform_backend_s3_test.gopkg/auth/cloud/aws/env.gopkg/auth/cloud/aws/env_test.gopkg/auth/cloud/aws/setup_test.gopkg/auth/hooks.gopkg/auth/manager_chain.gopkg/auth/manager_test.gopkg/list/utils/check_component_test.gopkg/list/utils/utils.gopkg/schema/schema.go
docs/fixes/2026-03-26-all-flag-nosuchbucket-backend-provisioning.md
Outdated
Show resolved
Hide resolved
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2257 +/- ##
==========================================
+ Coverage 77.19% 77.25% +0.05%
==========================================
Files 1015 1017 +2
Lines 96065 96271 +206
==========================================
+ Hits 74158 74374 +216
+ Misses 17717 17703 -14
- Partials 4190 4194 +4
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Address CodeRabbit MD040 lint warning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
what
NoSuchBucketerrors the same asNoSuchKeyin the Terraform state reader — returnnil, nil(state not provisioned) instead of crashing*types.NoSuchBucket(AWS SDK v2) and genericsmithy.APIErrorwith code"NoSuchBucket"(S3-compatible backends like MinIO, Wasabi)NoSuchBucket— the bucket won't appear between retrieswhy
atmos terraform plan --all -s <stack>withprovision.backend.enabled: true, theExecuteDescribeStacksenumeration call evaluates YAML functions (!terraform.state) for ALL components before any provisioners have runGetObjectreturnsNoSuchBucketwhich was not handled gracefully — it fell through to the retry loop, exhausted retries, returnedErrGetObjectFromS3(not recoverable), and crashed the entire--alloperationNoSuchKey(missing state file) was already handled gracefully by returningnil, nil, which flows throughErrTerraformStateNotProvisioned→isRecoverableTerraformError()→ graceful fallback.NoSuchBucketis a superset of the same condition (if the bucket doesn't exist, the state can't exist either)terraform initbefore any state readsreferences
docs/fixes/2026-03-26-all-flag-nosuchbucket-backend-provisioning.md--allpath)--affectedpath)Summary by CodeRabbit
Bug Fixes
terraform plan --allwith backend provisioning when remote state bucket is missingImprovements
Documentation