Skip to content

Commit dab6d04

Browse files
authored
Add retries to tests when waiting on assuming new role via EKS Pod Identity (#713)
*Issue #, if available:* *Description of changes:* Tests have been failing occasionally with the following error: ``` I0225 13:22:11.375776 22828 credentials.go:1212] Waiting until IAM role for ServiceAccount s3-csi-driver-sa is assumable for EKS Pod Identity (s3-csi-node-9xnvd, 86ff74d5-992b-4ae5-a4e2-6c3de483636a, kube-system) I0225 13:22:11.524430 22828 credentials.go:1156] Unexpected error: <*smithy.OperationError | 0xc000ef2690>: operation error EKS Auth: AssumeRoleForPodIdentity, https response error StatusCode: 404, RequestID: 4a91f663-7fbf-46d1-9188-5c63cccaa7ad, ResourceNotFoundException: The token included in the request has no service account role association for it. ``` This error implies that the EKS Auth service has not yet registered the new assocation. This change adds `ResourceNotFoundException` to the list of error codes that will trigger the SDK to retry in the function when waiting for the role to be "assumable". By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice. Signed-off-by: Daniel Carl Jones <djonesoa@amazon.com>
1 parent 08ee3c6 commit dab6d04

1 file changed

Lines changed: 9 additions & 5 deletions

File tree

tests/e2e-kubernetes/testsuites/credentials.go

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -58,11 +58,14 @@ const (
5858
)
5959

6060
const (
61-
eksauthAssumeRoleRetryCode = "AccessDeniedException"
6261
eksauthAssumeRoleRetryMaxAttempts = 0 // This will cause SDK to retry indefinitely, but we do have a timeout on the operation
6362
eksauthAssumeRoleRetryMaxBackoffDelay = 10 * time.Second
6463
)
6564

65+
var (
66+
eksauthAssumeRoleRetryErrorCodes = []string{"AccessDeniedException", "ResourceNotFoundException"}
67+
)
68+
6669
const (
6770
iamListAttachedRolePoliciesTimeout = 1 * time.Minute
6871
iamListAttachedRolePoliciesPolling = 5 * time.Second
@@ -1139,9 +1142,10 @@ func assumeRole(ctx context.Context, f *framework.Framework, roleArn string) *st
11391142
})
11401143
}
11411144

1142-
// waitUntilRoleIsAssumable waits until the given role is assumable.
1145+
// Waits until the given role is assumable.
1146+
//
11431147
// This is needed because we're creating new roles in our test cases and then trying to assume those roles,
1144-
// but there is a delay between IAM and STS services and newly created roles/policies does not appear on STS immediately.
1148+
// but there's a delay between IAM and the token service (STS or EKS Auth) resulting in errors such as "access denied" or "not found".
11451149
func waitUntilRoleIsAssumable[Input any, Output any, O any](
11461150
ctx context.Context,
11471151
assumeFunc func(context.Context, *Input, ...func(O)) (*Output, error),
@@ -1176,7 +1180,7 @@ func waitUntilRoleIsAssumableEKS[Input any, Output any](
11761180
input *Input,
11771181
) *Output {
11781182
return waitUntilRoleIsAssumable(ctx, assumeFunc, input, func(o *eksauth.Options) {
1179-
o.Retryer = retry.AddWithErrorCodes(o.Retryer, eksauthAssumeRoleRetryCode)
1183+
o.Retryer = retry.AddWithErrorCodes(o.Retryer, eksauthAssumeRoleRetryErrorCodes...)
11801184
o.Retryer = retry.AddWithMaxAttempts(o.Retryer, eksauthAssumeRoleRetryMaxAttempts)
11811185
o.Retryer = retry.AddWithMaxBackoffDelay(o.Retryer, eksauthAssumeRoleRetryMaxBackoffDelay)
11821186
})
@@ -1204,7 +1208,7 @@ func waitUntilRoleIsAssumableWithWebIdentity(ctx context.Context, f *framework.F
12041208
}
12051209

12061210
func waitUntilRoleIsAssumableWithEKS(ctx context.Context, f *framework.Framework, sa *v1.ServiceAccount, pod *v1.Pod) {
1207-
// If you're seeing the following error, then it means you've made a typo in the cluster name when running the tests!
1211+
// If you see the following error, it may mean you've made a typo in the cluster name or the role is being assumed too quickly.
12081212
// [FAILED] operation error EKS Auth: AssumeRoleForPodIdentity, https response error StatusCode: 404, RequestID:
12091213
// ResourceNotFoundException: The token included in the request has no service account role association for it.
12101214

0 commit comments

Comments
 (0)