Skip to content

Recent SDK retrier not retrying "connection refused" properly? #2560

Closed
@udhos

Description

Acknowledgements

Describe the bug

I have two versions of an application running on AWS EKS along with istio sidecar containers.

The application often boots up faster than istio sidecar, hence it usually hits several "connection refused" from AWS APIs (and also from other APIs) while istio sidecar is initializing.

The older version of the application uses older versions of the SDK modules and runs flawlessly withstanding those "connection refused" errors.

The new version of the application uses newer versions of the SDK modules and is consistently failing to withstand the "connection refused" errors. It produces lots of errors like this:

2024/03/16 00:19:09 AwsConfig: GetCallerIdentity: error: operation error STS: GetCallerIdentity, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.sa-east-1.amazonaws.com/": dial tcp 10.25.8.91:443: connect: connection refused

Two interesting notes:

1 - The istio sidecar takes only a couple of seconds to boot (way sooner than SDK default max 20-sec backoff). So I would not expect the SDK give up and report "exceeded maximum number of attempts, 3". This led me to suspect the SDK is not retrying properly (with 20-sec backoff) on "connection refused" errors.

2 - The application uses the default retrier, it does NOT perform any customization.
In fact, both versions of the application (old and new) initialize SDK clients using this exact code: https://github.com/udhos/boilerplate/blob/main/awsconfig/aws.go#L37

The older application with GOOD results uses these versions:

github.com/aws/aws-sdk-go-v2 v1.21.2 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.4.14 // indirect
github.com/aws/aws-sdk-go-v2/config v1.19.1 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.13.43 // indirect
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.10.43 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.13.13 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.1.43 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.4.37 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.3.45 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.1.6 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.23.0 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.15.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.9.15 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.1.38 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.7.37 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.9.37 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.15.6 // indirect
github.com/aws/aws-sdk-go-v2/service/lambda v1.41.0 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.40.2 // indirect
github.com/aws/aws-sdk-go-v2/service/secretsmanager v1.21.6 // indirect
github.com/aws/aws-sdk-go-v2/service/ssm v1.40.0 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.15.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.17.3 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.23.2 // indirect

I first found this issue with a newer version of the application running these versions:

github.com/aws/aws-sdk-go-v2 v1.24.1 // indirect
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.5.4 // indirect
github.com/aws/aws-sdk-go-v2/config v1.26.6 // indirect
github.com/aws/aws-sdk-go-v2/credentials v1.16.16 // indirect
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.12.17 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.14.11 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.2.10 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.5.10 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.7.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.2.10 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.27.1 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.18.7 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.10.4 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.2.10 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.8.11 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.10.10 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.16.10 // indirect
github.com/aws/aws-sdk-go-v2/service/lambda v1.49.7 // indirect
github.com/aws/aws-sdk-go-v2/service/s3 v1.48.1 // indirect
github.com/aws/aws-sdk-go-v2/service/secretsmanager v1.26.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssm v1.45.0 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.18.7 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.21.7 // indirect
github.com/aws/aws-sdk-go-v2/service/sts v1.26.7 // indirect

Then I tried to update the newer application to latest SDK module versions, but newer versions produce the same effect (choking on transient "connection refused" errors).

Expected Behavior

The new application with newer SDK versions would properly withstand "connection refused" errors by retrying, much like the previous application with previous SDK versions did.

Current Behavior

New application with recent SDK fails with AWS APIs at boot, reporting:

2024/03/16 00:19:09 AwsConfig: GetCallerIdentity: error: operation error STS: GetCallerIdentity, get identity: get credentials: failed to refresh cached credentials, failed to retrieve credentials, operation error STS: AssumeRoleWithWebIdentity, exceeded maximum number of attempts, 3, https response error StatusCode: 0, RequestID: , request send failed, Post "https://sts.sa-east-1.amazonaws.com/": dial tcp 10.25.8.91:443: connect: connection refused

Reproduction Steps

I dont know how to properly setup a lab environment where the SDK hits "connection refused" for some seconds before succeeding.
I am seeing this issue on a live EKS cluster.

Possible Solution

I suppose the application code could be tweaked to re-create the SDK clients explicitly, rebuilding the retry attempts from outside the SDK, but that seems a lot of duplicated effort, since the SDK provides a builtin retrier that should work.

Additional Information/Context

No response

AWS Go SDK V2 Module Versions Used

I attempted to upgrade to these versions but got the same result.

github.com/aws/aws-sdk-go-v2 v1.25.3
github.com/aws/aws-sdk-go-v2/config v1.27.7
github.com/aws/aws-sdk-go-v2/credentials v1.17.7
github.com/aws/aws-sdk-go-v2/feature/dynamodb/attributevalue v1.13.9
github.com/aws/aws-sdk-go-v2/service/dynamodb v1.30.4
github.com/aws/aws-sdk-go-v2/service/lambda v1.53.2
github.com/aws/aws-sdk-go-v2/service/s3 v1.52.1
github.com/aws/aws-sdk-go-v2/service/secretsmanager v1.28.3
github.com/aws/aws-sdk-go-v2/service/ssm v1.49.3
github.com/aws/aws-sdk-go-v2/service/sts v1.28.4
github.com/aws/aws-sdk-go-v2/aws/protocol/eventstream v1.6.1 // indirect
github.com/aws/aws-sdk-go-v2/feature/ec2/imds v1.15.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/configsources v1.3.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2 v2.6.3 // indirect
github.com/aws/aws-sdk-go-v2/internal/ini v1.8.0 // indirect
github.com/aws/aws-sdk-go-v2/internal/v4a v1.3.3 // indirect
github.com/aws/aws-sdk-go-v2/service/dynamodbstreams v1.20.2 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding v1.11.1 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/checksum v1.3.5 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/endpoint-discovery v1.9.4 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url v1.11.5 // indirect
github.com/aws/aws-sdk-go-v2/service/internal/s3shared v1.17.3 // indirect
github.com/aws/aws-sdk-go-v2/service/sso v1.20.2 // indirect
github.com/aws/aws-sdk-go-v2/service/ssooidc v1.23.2 // indirect

Compiler and Version used

go version go1.22.1 linux/amd64

Operating System and version

Linux 5.10 on amd64 on EC2 on AWS EKS

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

closing-soonThis issue will automatically close in 4 days unless further comments are made.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions