fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness) by EDsCODE · Pull Request #753 · PostHog/duckgres

EDsCODE · 2026-06-10T18:58:17Z

Summary

Raises the cold-spawn worker connect budget from 90s to 3 minutes, fixing the e2e flakiness where the control plane reaps a healthy-but-still-warming worker.

This is the root cause of the recurring e2e failures across PRs since ~16:40 UTC today (acquire worker: spawn sized worker: ... timeout connecting to worker at <ip>:8816 ... DeadlineExceeded ... attempts: 37).

Root cause

A freshly-spawned worker's gRPC health handler deliberately blocks until warmup completes — duckdbservice/flight_handler.go: <-h.pool.warmupDone — so the CP won't route to a worker that hasn't loaded extensions + done the DuckLake ATTACH (httpfs/S3 + metadata). Meanwhile:

waitForPodReady returns the instant the pod is Running+IP (~1.5s) — there is no readiness probe gating on warmup — so the connect budget (waitForWorkerTCP), not the 5-minute workerPodReadyTimeout, is what absorbs the entire attach.
That budget was 90s (a loop of ~37 attempts × 2s health RPC).

Under a burst of concurrent cold spawns — the e2e harness's four parallel per-org lanes (#747), on workers packed tighter after #745/#746 — the DuckLake ATTACH contends on S3/metadata and routinely exceeds 90s. Every health attempt times out, the CP reaps the worker mid-attach, the session fails, and with all four lanes hitting it the whole run fails. It's load-dependent, hence intermittent (passes in low-contention windows).

Diagnosed live on mw-dev

Workers reach Running and bind :8816 in ~1.5s, then hang right after Loaded extension ducklake and are reaped before ever logging pre-warmed successfully.
duckgres_worker_acquire_phase_seconds{phase="spawn"} never records a completion — only gate_wait shows requests entering the spawn phase (5 in-flight, 0 completing).
Not node-freshness (failing worker shared a warm node with workers that connected), not a Cilium drop (none captured; cluster health 35/35), not any PR's code (a branch without the latest main fails identically; the worker binary never runs the changed transpiler/catalog code).

Change

New named constant workerSpawnConnectTimeout = 3m; waitForWorkerTCP in the cold-spawn path uses it instead of the magic 90s. 3m gives comfortable headroom over a contended-but-progressing attach while staying well under the engine-side cap (attachMigrateStatementTimeout, 15m).
The hot-idle reuse path keeps its 30s budget (those workers are already warm).
Crash detection is unaffected — dead/crashed workers are caught independently by the pod informer / PodFailed path, not this health loop.
workerSpawnActivateTimeout is now defined as the sum of its phases (pod-ready + connect + activate) so the three deadlines can't drift out of sync.
Regression tests guard both invariants.

Follow-up (not in this PR — platform/test owners)

This makes the CP tolerant of slow attach; the underlying slowness is the concurrent-attach contention introduced by #747/#745/#746. Worth: staggering the lanes' worker demand and/or restoring worker CPU headroom, and confirming warmup isn't doing avoidable network I/O.

🤖 Generated with Claude Code

A freshly-spawned worker's gRPC health handler blocks on warmupDone (duckdbservice doHealthCheck) until extension load + the DuckLake ATTACH (httpfs/S3 + metadata) finish — so the control plane can't route to a not-yet-attached worker. waitForPodReady returns the instant the pod is Running+IP (~1.5s; no readiness probe gates on warmup), which means the connect budget in waitForWorkerTCP — not the 5m pod-ready timeout — is what must absorb the whole attach. That budget was 90s. Under a burst of concurrent cold spawns (the e2e harness's parallel per-org lanes, #747) the ATTACH contends on S3/metadata and routinely exceeds 90s, so every health attempt (37 x ~2s) times out and the CP reaps a healthy-but-still-attaching worker — failing the session and, across all four lanes, the whole e2e run. Diagnosed live on mw-dev: workers reach Running and bind :8816 in ~1.5s, hang right after 'Loaded extension ducklake', and are reaped before logging 'pre-warmed successfully'; the duckgres_worker_acquire_phase_seconds{phase=spawn} histogram never records a completion. Raise the cold-spawn connect budget to 3m (a named constant, workerSpawnConnectTimeout), well under the engine-side attach cap (attachMigrateStatementTimeout, 15m). The hot-idle reuse path keeps its 30s budget (those workers are already warm). Crash detection is unaffected — dead workers are caught independently by the pod informer / PodFailed path. workerSpawnActivateTimeout is now defined as the sum of its phases (pod-ready + connect + activate) so the deadlines can't drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

EDsCODE · 2026-06-10T19:24:50Z

Diagnosis correction after deeper investigation (full repro with the actual worker images from ECR):

The "attach contention under parallel lanes" theory in the PR description was wrong. The real chain behind the e2e worker-reaps on #706 was:

A feat(compat): PostgreSQL builtin-compatibility macros + transforms (48 functions) #706 catalog macro (inet_server_addr() AS CAST(NULL AS INET)) referenced the INET type, which lives in DuckDB's non-statically-linked inet extension.
Worker warmup runs initPgCatalog (via ConfigureDBConnection), so creating that macro triggered DuckDB extension autoinstall at warmup — fetching http://extensions.duckdb.org/... over plain HTTP port 80.
The worker egress CNP allows world :443/:5432 only → the port-80 SYN is silently dropped → connect() blocks ~2 minutes (reproduced).
The worker's health handler blocks on warmupDone, so the CP's 90s connect budget expired and reaped a healthy-but-downloading worker. Deterministically, every spawn.

That's fixed at the source in #706 (macro no longer references INET, plus TestInitPgCatalogIsAirgapSafe which runs the whole catalog init with autoinstall/autoload disabled and fails on any statement needing a non-static extension).

This PR remains valuable as defense-in-depth: the budget asymmetry is real regardless of trigger — waitForPodReady returns at Running+IP, there is no readiness probe gating warmup, and the 90s inner budget must absorb whatever warmup costs (a one-off ~2min stall under the 3m budget here would have degraded one macro instead of failing every session in the run). Two follow-ups worth considering for whoever owns worker config:

Set autoinstall_known_extensions=false / autoload_known_extensions=false (or autoload-from-local-only) on worker DuckDB instances so no runtime statement can ever reach for the network — silent port-80 drops turn a missing extension into a multi-minute hang instead of a clean error.
Either allow or fast-reject (REJECT, not DROP) outbound :80 in the worker egress policy, so anything that does slip through fails in milliseconds rather than minutes.

EDsCODE closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness)#753

fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness)#753
EDsCODE wants to merge 1 commit into
mainfrom
fix/worker-spawn-connect-budget

EDsCODE commented Jun 10, 2026

Uh oh!

EDsCODE commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

EDsCODE commented Jun 10, 2026

Summary

Root cause

Diagnosed live on mw-dev

Change

Follow-up (not in this PR — platform/test owners)

Uh oh!

EDsCODE commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant