Feature Description
When running alloydb-auth-proxy as a sidecar in Cloud Run Jobs with container-dependencies, there is no reliable way to ensure the proxy can actually reach AlloyDB before the dependent container starts.
Currently, both /startup and /readiness return 200 as soon as NotifyStarted() is called -- which happens right after the local TCP listener starts, before any actual connection to AlloyDB is established. In environments using PSC, the network path to AlloyDB can take 10+ seconds to become available after container startup, causing the dependent application container to fail with connection timeouts.
--run-connection-test partially addresses this but exits immediately on failure with no retry, making it unsuitable for sidecar use cases where transient delays are expected.
A flag (e.g., --startup-connection-check) that makes Serve() verify actual AlloyDB connectivity (with retries) before calling notify() would solve this problem. This would ensure /startup and /readiness only return 200 after the proxy can actually forward connections to AlloyDB.
Sample code
In internal/proxy/proxy.go, the change would be roughly:
func (c *Client) Serve(ctx context.Context, notify func()) error {
// ... existing listener setup ...
// New: if startup connection check is enabled, retry until AlloyDB is reachable
if c.conf.StartupConnectionCheck {
c.logger.Infof("Startup connection check: verifying backend connectivity...")
for {
if _, err := c.CheckConnections(ctx); err != nil {
c.logger.Infof("Startup connection check: not ready, retrying in 1s: %v", err)
select {
case <-ctx.Done():
return ctx.Err()
case <-time.After(1 * time.Second):
continue
}
}
c.logger.Infof("Startup connection check: backend is reachable")
break
}
}
notify() // /startup and /readiness now return 200
return <-exitCh
}
Alternatives Considered
| Workaround |
Limitation |
--run-connection-test |
Exits on first failure with no retry. Unusable as a sidecar when transient connectivity delays (e.g., PSC cold-start) are expected. |
Switching startup probe from /startup to /readiness |
No difference -- both return 200 at the same timing (NotifyStarted()). Note: the /readiness documentation says "when the proxy can connect to all registered instances" but the implementation does not actually check this. |
Increasing application-side connect_timeout / pool_timeout |
Band-aid that does not prevent the race condition. Requires changes in every application using the proxy. |
Additional Details
- Running v1.13.7. Also reviewed the v1.14.1 source code and confirmed the same behavior.
- Connectivity method: PSC (
--psc)
- Runtime: Cloud Run Jobs with
container-dependencies annotation
- Auth:
--auto-iam-authn with --impersonate-service-account
- The
/readiness endpoint documentation states it checks "when the proxy can connect to all registered instances" but the implementation does not perform any backend connectivity check. This may be a separate documentation issue.
Feature Description
When running
alloydb-auth-proxyas a sidecar in Cloud Run Jobs withcontainer-dependencies, there is no reliable way to ensure the proxy can actually reach AlloyDB before the dependent container starts.Currently, both
/startupand/readinessreturn200as soon asNotifyStarted()is called -- which happens right after the local TCP listener starts, before any actual connection to AlloyDB is established. In environments using PSC, the network path to AlloyDB can take 10+ seconds to become available after container startup, causing the dependent application container to fail with connection timeouts.--run-connection-testpartially addresses this but exits immediately on failure with no retry, making it unsuitable for sidecar use cases where transient delays are expected.A flag (e.g.,
--startup-connection-check) that makesServe()verify actual AlloyDB connectivity (with retries) before callingnotify()would solve this problem. This would ensure/startupand/readinessonly return200after the proxy can actually forward connections to AlloyDB.Sample code
In
internal/proxy/proxy.go, the change would be roughly:Alternatives Considered
--run-connection-test/startupto/readinessNotifyStarted()). Note: the/readinessdocumentation says "when the proxy can connect to all registered instances" but the implementation does not actually check this.connect_timeout/pool_timeoutAdditional Details
--psc)container-dependenciesannotation--auto-iam-authnwith--impersonate-service-account/readinessendpoint documentation states it checks "when the proxy can connect to all registered instances" but the implementation does not perform any backend connectivity check. This may be a separate documentation issue.