Option to verify backend connectivity before health check endpoints return 200

### Feature Description

When running `alloydb-auth-proxy` as a sidecar in Cloud Run Jobs with [`container-dependencies`](https://cloud.google.com/run/docs/configuring/services/containers#container-dependencies), there is no reliable way to ensure the proxy can actually reach AlloyDB before the dependent container starts.

Currently, both `/startup` and `/readiness` return `200` as soon as `NotifyStarted()` is called -- which happens right after the local TCP listener starts, **before any actual connection to AlloyDB is established**. In environments using PSC, the network path to AlloyDB can take 10+ seconds to become available after container startup, causing the dependent application container to fail with connection timeouts.

`--run-connection-test` partially addresses this but exits immediately on failure with no retry, making it unsuitable for sidecar use cases where transient delays are expected.

A flag (e.g., `--startup-connection-check`) that makes `Serve()` verify actual AlloyDB connectivity (with retries) before calling `notify()` would solve this problem. This would ensure `/startup` and `/readiness` only return `200` after the proxy can actually forward connections to AlloyDB.


### Sample code

In `internal/proxy/proxy.go`, the change would be roughly:

```go
func (c *Client) Serve(ctx context.Context, notify func()) error {
    // ... existing listener setup ...

    // New: if startup connection check is enabled, retry until AlloyDB is reachable
    if c.conf.StartupConnectionCheck {
        c.logger.Infof("Startup connection check: verifying backend connectivity...")
        for {
            if _, err := c.CheckConnections(ctx); err != nil {
                c.logger.Infof("Startup connection check: not ready, retrying in 1s: %v", err)
                select {
                case <-ctx.Done():
                    return ctx.Err()
                case <-time.After(1 * time.Second):
                    continue
                }
            }
            c.logger.Infof("Startup connection check: backend is reachable")
            break
        }
    }

    notify() // /startup and /readiness now return 200
    return <-exitCh
}
```



### Alternatives Considered

| Workaround | Limitation |
|---|---|
| `--run-connection-test` | Exits on first failure with no retry. Unusable as a sidecar when transient connectivity delays (e.g., PSC cold-start) are expected. |
| Switching startup probe from `/startup` to `/readiness` | No difference -- both return 200 at the same timing (`NotifyStarted()`). Note: the `/readiness` documentation says "when the proxy can connect to all registered instances" but the implementation does not actually check this. |
| Increasing application-side `connect_timeout` / `pool_timeout` | Band-aid that does not prevent the race condition. Requires changes in every application using the proxy. |


### Additional Details

- Running v1.13.7. Also reviewed the v1.14.1 source code and confirmed the same behavior.
- Connectivity method: PSC (`--psc`)
- Runtime: Cloud Run Jobs with `container-dependencies` annotation
- Auth: `--auto-iam-authn` with `--impersonate-service-account`
- The `/readiness` endpoint documentation states it checks "when the proxy can connect to all registered instances" but the [implementation](https://github.com/GoogleCloudPlatform/alloydb-auth-proxy/blob/v1.13.7/internal/healthcheck/healthcheck.go#L88-L119) does not perform any backend connectivity check. This may be a separate documentation issue.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Option to verify backend connectivity before health check endpoints return 200 #909

Feature Description

Sample code

Alternatives Considered

Additional Details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workaround	Limitation
`--run-connection-test`	Exits on first failure with no retry. Unusable as a sidecar when transient connectivity delays (e.g., PSC cold-start) are expected.
Switching startup probe from `/startup` to `/readiness`	No difference -- both return 200 at the same timing (`NotifyStarted()`). Note: the `/readiness` documentation says "when the proxy can connect to all registered instances" but the implementation does not actually check this.
Increasing application-side `connect_timeout` / `pool_timeout`	Band-aid that does not prevent the race condition. Requires changes in every application using the proxy.

Option to verify backend connectivity before health check endpoints return 200 #909

Description

Feature Description

Sample code

Alternatives Considered

Additional Details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions