Description
Acknowledgements
- I have searched (https://github.com/aws/aws-sdk/issues?q=is%3Aissue) for past instances of this issue
- I have verified all of my SDK modules are up-to-date (you can perform a bulk update with
go get -u github.com/aws/aws-sdk-go-v2/...
)
Describe the bug
I have two versions of an application running on AWS EKS along with istio sidecar containers.
The application often boots up faster than istio sidecar, hence it usually hits several "connection refused" from AWS APIs (and also from other APIs) while istio sidecar is initializing.
The older version of the application uses older versions of the SDK modules and runs flawlessly withstanding those "connection refused" errors.
The new version of the application uses newer versions of the SDK modules and is consistently failing to withstand the "connection refused" errors. It produces lots of errors like this:
2024/03/16 00:19:09 AwsConfig: GetCallerIdentity: error: operation error STS: GetCallerIdentity, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.sa-east-1.amazonaws.com/": dial tcp 10.25.8.91:443: connect: connection refused
Two interesting notes:
1 - The istio sidecar takes only a couple of seconds to boot (way sooner than SDK default max 20-sec backoff). So I would not expect the SDK give up and report "exceeded maximum number of attempts, 3". This led me to suspect the SDK is not retrying properly (with 20-sec backoff) on "connection refused" errors.
2 - The application uses the default retrier, it does NOT perform any customization.
In fact, both versions of the application (old and new) initialize SDK clients using this exact code: https://github.com/udhos/boilerplate/blob/main/awsconfig/aws.go#L37
The older application with GOOD results uses these versions:
github.com/aws/aws-sdk-go-v2 v1.21.2 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.4.14 // indirect
github.com/aws/aws-sdk-go-v2/config v1.19.1 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.13.43 // indirect
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.10.43 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.13.13 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.1.43 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.4.37 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.3.45 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.1.6 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.23.0 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.15.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.9.15 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.1.38 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.7.37 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.9.37 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.15.6 // indirect
github.com/aws/aws-sdk-go-v2/service/lambda v1.41.0 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.40.2 // indirect
github.com/aws/aws-sdk-go-v2/service/secretsmanager v1.21.6 // indirect
github.com/aws/aws-sdk-go-v2/service/ssm v1.40.0 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.15.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.17.3 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.23.2 // indirect
I first found this issue with a newer version of the application running these versions:
github.com/aws/aws-sdk-go-v2 v1.24.1 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.5.4 // indirect
github.com/aws/aws-sdk-go-v2/config v1.26.6 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.16.16 // indirect
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.12.17 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.11 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.2.10 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.5.10 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.7.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.2.10 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.27.1 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.18.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.10.4 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.2.10 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.8.11 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.10.10 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.16.10 // indirect
github.com/aws/aws-sdk-go-v2/service/lambda v1.49.7 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.48.1 // indirect
github.com/aws/aws-sdk-go-v2/service/secretsmanager v1.26.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssm v1.45.0 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.18.7 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.21.7 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.26.7 // indirect
Then I tried to update the newer application to latest SDK module versions, but newer versions produce the same effect (choking on transient "connection refused" errors).
Expected Behavior
The new application with newer SDK versions would properly withstand "connection refused" errors by retrying, much like the previous application with previous SDK versions did.
Current Behavior
New application with recent SDK fails with AWS APIs at boot, reporting:
2024/03/16 00:19:09 AwsConfig: GetCallerIdentity: error: operation error STS: GetCallerIdentity, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.sa-east-1.amazonaws.com/": dial tcp 10.25.8.91:443: connect: connection refused
Reproduction Steps
I dont know how to properly setup a lab environment where the SDK hits "connection refused" for some seconds before succeeding.
I am seeing this issue on a live EKS cluster.
Possible Solution
I suppose the application code could be tweaked to re-create the SDK clients explicitly, rebuilding the retry attempts from outside the SDK, but that seems a lot of duplicated effort, since the SDK provides a builtin retrier that should work.
Additional Information/Context
No response
AWS Go SDK V2 Module Versions Used
I attempted to upgrade to these versions but got the same result.
github.com/aws/aws-sdk-go-v2 v1.25.3
github.com/aws/aws-sdk-go-v2/config v1.27.7
github.com/aws/aws-sdk-go-v2/credentials v1.17.7
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.13.9
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.30.4
github.com/aws/aws-sdk-go-v2/service/lambda v1.53.2
github.com/aws/aws-sdk-go-v2/service/s3 v1.52.1
github.com/aws/aws-sdk-go-v2/service/secretsmanager v1.28.3
github.com/aws/aws-sdk-go-v2/service/ssm v1.49.3
github.com/aws/aws-sdk-go-v2/service/sts v1.28.4
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.1 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.15.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.3 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.20.2 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.1 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.5 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.9.4 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.5 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.3 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.20.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.23.2 // indirect
Compiler and Version used
go version go1.22.1 linux/amd64
Operating System and version
Linux 5.10 on amd64 on EC2 on AWS EKS
Activity