Skip to content

Dial fails when function is suspended during eager refresh #982

@devnev

Description

@devnev

Bug Description

I'm experiencing periodic dial errors from cloudsqlconn running in Vercel function.

The function in question experiences near-zero traffic during low periods, except for a cron ping every 5 minutes. When traffic is limited to just these 5-minute pings, the pings fail on db connection errors at multiples of hourly intervals (i.e. sometimes just an hour between, sometimes 2 or 3 hours).

I've added all the logging and tracing, and this is what I get from cloudsqlconn & pgx logs:

START RequestId: c2a894e2-29bb-4d2e-a657-432e881cb461
2025/05/11 10:10:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
END RequestId: c2a894e2-29bb-4d2e-a657-432e881cb461
START RequestId: 45f7d3c1-0c86-42bb-a4c9-c0630725b5a9
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context deadline exceeded
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Removing connection info from cache: refresh failed: context deadline exceeded
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation failed, err = refresh failed: context canceled
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled immediately
2025/05/11 10:15:00 DEBUG: cloudsqlconn: [redacted] Instance is closed, stopping refresh operations
2025/05/11 10:15:00 ERROR [169.254.26.129/iTpgPaTqbJ-000021]: Connect map["database":"redacted" "err":"failed to connect to `redacted`: [redacted]:5432 (): dial error: timeout: context deadline exceeded" "host":"" "port":'ᔸ' "time":"146.621µs"]
END RequestId: 45f7d3c1-0c86-42bb-a4c9-c0630725b5a9

The client's retry request then successfully connects:

START RequestId: 1d5f156e-e6c7-4f52-b6cd-77b5fc322ed0
2025/05/11 10:15:01 DEBUG: cloudsqlconn: [redacted] Connection info added to cache
2025/05/11 10:15:01 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation started
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation complete
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Current certificate expiration = 2025-05-11T11:10:00Z
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Connection info refresh operation scheduled at 2025-05-11T11:06:00Z (now + 51m0s)
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Now = 2025-05-11T10:15:02Z, Current cert expiration = 2025-05-11T11:10:00Z
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Cert is valid = true
2025/05/11 10:15:02 DEBUG: cloudsqlconn: [redacted] Dialing redacted:3307
END RequestId: 1d5f156e-e6c7-4f52-b6cd-77b5fc322ed0

It looks to me like the function is suspended between the first two requests, causing the refresh to fail, and then the subsequent immediate refresh during the second request also fails.

I'm going to switch this to use the lazy refresher to try mitigating, but this feels like something the eager refresher should be able to handle.

Dialer created with

			tokenSource := oauth2adapt.TokenSourceFromTokenProvider(creds)
			d, err := cloudsqlconn.NewDialer(
				context.Background(),
				cloudsqlconn.WithDebugLogger(cloudConnLogger{}),
				// This option might not be necessary as it looks like it's superceded by WithIAMAuthNTokenSources, but
				// no harm in setting it.
				cloudsqlconn.WithTokenSource(tokenSource),
				cloudsqlconn.WithIAMAuthN(),
				cloudsqlconn.WithIAMAuthNTokenSources(tokenSource, tokenSource),
			)

Environment

  1. OS type and version: Vercel deployment so presumably Linux
  2. Go version: 1.24
  3. Cloud SQL Go Connector version: from go.mod:
cloud.google.com/go/cloudsqlconn v1.16.1 h1:0DaCn076Qr7H5TsHm65k/fzjAv4dNUQB2fq9ULzmXMU=
cloud.google.com/go/cloudsqlconn v1.16.1/go.mod h1:T6G4pmABtHqRyTF8M+oUe1KELh6Rs4PVSngAZDBz3zA=

Additional Details

No response

Metadata

Metadata

Assignees

Labels

priority: p2Moderately-important priority. Fix may not be included in next release.type: bugError or flaw in code with unintended results or allowing sub-optimal usage patterns.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions