Date: 2026-03-23
Reported by: Alexander Matveev (9spokes)
On EKS ARC (Actions Runner Controller) runners with IRSA, atmos auth returns stale cached credentials instead of performing the actual AssumeRole API call. Terraform runs with the runner's pod credentials instead of the Atmos-authenticated planner role.
This is the second bug affecting IRSA on EKS. The first handles env var scrubbing — removing AWS_WEB_IDENTITY_TOKEN_FILE, AWS_ROLE_ARN, and AWS_ROLE_SESSION_NAME from the subprocess environment so the AWS SDK doesn't prefer pod identity over file-based credentials. This second bug is in the credential chain cache lookup and causes the AssumeRole step to be skipped entirely.
In findFirstValidCachedCredentials(), the bottom-up scan finds valid cached credentials at the last identity in the chain (the target). The caller fetchCachedCredentials() then advances startIndex past the end of the chain, causing authenticateIdentityChain's loop to never execute.
Walkthrough with a 2-element chain [github-oidc, core-artifacts/terraform]:
findFirstValidCachedCredentials()scans bottom-up, finds valid cached credentials at index 1 (core-artifacts/terraform) → returns1fetchCachedCredentials(1)loads those cached creds and returnsstartIndex = 1 + 1 = 2authenticateIdentityChain(ctx, 2, cachedCreds)runs the loopfor i := 2; i < 2→ loop never executes- The cached credentials are returned as-is, without ever calling
identity.Authenticate()(the actualAssumeRoleAPI call)
The cached credentials at the last index represent the output of that step. fetchCachedCredentials is designed to return startIndex + 1 so the next step can use them as input. But when there is no next step (last in chain), the index goes past the end and the identity loop is skipped entirely.
This means if the credential file ever contains stale or incorrect credentials (e.g., written by a previous auth step in the same run, or from a provider-level auth that wrote base credentials under the identity's profile), they get returned without validation through the actual AWS STS call.
On EKS runners with IRSA, the credential file can contain provider-level (OIDC) credentials that were cached during the provider authentication step. When findFirstValidCachedCredentials returns the last index, fetchCachedCredentials advances past the chain end, and these provider-level credentials are returned as if they were the assume-role credentials — without the actual AssumeRole call ever happening.
In findFirstValidCachedCredentials(), skip cached credentials at the target (last) identity and continue scanning earlier in the chain. This ensures the identity's Authenticate() method is always called for the final step.
Changed file: pkg/auth/manager_chain.go
// After validating credentials are not expired:
if i == len(m.chain)-1 {
log.Debug("Skipping cached target identity credentials to force re-authentication",
logKeyChainIndex, i, identityNameKey, identityName)
continue
}
return iBehavior with the fix for different chain configurations:
| Chain | Cache state | findFirstValidCachedCredentials returns |
Result |
|---|---|---|---|
[provider, assume-role] |
Cache at index 1 (last) | Skip index 1 → check index 0 → if valid, return 0 | AssumeRole executes with provider creds as input |
[provider, assume-role] |
Cache only at index 1, nothing at 0 | Skip index 1 → index 0 invalid → return -1 | Full re-auth from provider |
[provider, assume-role] |
Cache at index 0 (provider) | Return 0 | Identity chain starts at 1, AssumeRole executes |
[provider, permset, assume-role] |
Cache at index 1 (middle) | Return 1 | fetchCachedCredentials(1) returns startIndex=2, assume-role step executes |
This aligns with the existing comment on lines 30-33 of manager_chain.go:
"CRITICAL: Always re-authenticate through the full chain, even if the target identity has cached credentials."
There are two independent bugs that both affect EKS ARC runners with IRSA. Both fixes are needed for correct behavior:
-
IRSA env var scrubbing: Pod-injected IRSA env vars (
AWS_WEB_IDENTITY_TOKEN_FILE,AWS_ROLE_ARN,AWS_ROLE_SESSION_NAME) leak into the subprocess, causing AWS SDK to prefer web identity token auth over the Atmos-managed credential files. Fix: scrub these vars from the subprocess environment viaPrepareShellEnvironment. -
Credential chain cache lookup (this fix):
findFirstValidCachedCredentialsreturns the last chain index,fetchCachedCredentialsadvances past the chain end, and theAssumeRolecall is skipped entirely. Fix: skip the last index in cache lookup to force re-authentication.
Alexander confirmed that both fixes together resolved the issue for 9spokes.
Updated TestManager_findFirstValidCachedCredentials in pkg/auth/manager_test.go:
- When both
id1andid2have valid credentials, the function now returnsid1(second-to-last, index 1) instead ofid2(last, index 2), forcing re-authentication of the target identity.