Skip to content

Option to verify backend connectivity before health check endpoints return 200 #909

@dekokun

Description

@dekokun

Feature Description

When running alloydb-auth-proxy as a sidecar in Cloud Run Jobs with container-dependencies, there is no reliable way to ensure the proxy can actually reach AlloyDB before the dependent container starts.

Currently, both /startup and /readiness return 200 as soon as NotifyStarted() is called -- which happens right after the local TCP listener starts, before any actual connection to AlloyDB is established. In environments using PSC, the network path to AlloyDB can take 10+ seconds to become available after container startup, causing the dependent application container to fail with connection timeouts.

--run-connection-test partially addresses this but exits immediately on failure with no retry, making it unsuitable for sidecar use cases where transient delays are expected.

A flag (e.g., --startup-connection-check) that makes Serve() verify actual AlloyDB connectivity (with retries) before calling notify() would solve this problem. This would ensure /startup and /readiness only return 200 after the proxy can actually forward connections to AlloyDB.

Sample code

In internal/proxy/proxy.go, the change would be roughly:

func (c *Client) Serve(ctx context.Context, notify func()) error {
    // ... existing listener setup ...

    // New: if startup connection check is enabled, retry until AlloyDB is reachable
    if c.conf.StartupConnectionCheck {
        c.logger.Infof("Startup connection check: verifying backend connectivity...")
        for {
            if _, err := c.CheckConnections(ctx); err != nil {
                c.logger.Infof("Startup connection check: not ready, retrying in 1s: %v", err)
                select {
                case <-ctx.Done():
                    return ctx.Err()
                case <-time.After(1 * time.Second):
                    continue
                }
            }
            c.logger.Infof("Startup connection check: backend is reachable")
            break
        }
    }

    notify() // /startup and /readiness now return 200
    return <-exitCh
}

Alternatives Considered

Workaround Limitation
--run-connection-test Exits on first failure with no retry. Unusable as a sidecar when transient connectivity delays (e.g., PSC cold-start) are expected.
Switching startup probe from /startup to /readiness No difference -- both return 200 at the same timing (NotifyStarted()). Note: the /readiness documentation says "when the proxy can connect to all registered instances" but the implementation does not actually check this.
Increasing application-side connect_timeout / pool_timeout Band-aid that does not prevent the race condition. Requires changes in every application using the proxy.

Additional Details

  • Running v1.13.7. Also reviewed the v1.14.1 source code and confirmed the same behavior.
  • Connectivity method: PSC (--psc)
  • Runtime: Cloud Run Jobs with container-dependencies annotation
  • Auth: --auto-iam-authn with --impersonate-service-account
  • The /readiness endpoint documentation states it checks "when the proxy can connect to all registered instances" but the implementation does not perform any backend connectivity check. This may be a separate documentation issue.

Metadata

Metadata

Assignees

Labels

type: feature request‘Nice-to-have’ improvement, new feature or different behavior or design.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions