fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness)#753
fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness)#753EDsCODE wants to merge 1 commit into
Conversation
A freshly-spawned worker's gRPC health handler blocks on warmupDone (duckdbservice doHealthCheck) until extension load + the DuckLake ATTACH (httpfs/S3 + metadata) finish — so the control plane can't route to a not-yet-attached worker. waitForPodReady returns the instant the pod is Running+IP (~1.5s; no readiness probe gates on warmup), which means the connect budget in waitForWorkerTCP — not the 5m pod-ready timeout — is what must absorb the whole attach. That budget was 90s. Under a burst of concurrent cold spawns (the e2e harness's parallel per-org lanes, #747) the ATTACH contends on S3/metadata and routinely exceeds 90s, so every health attempt (37 x ~2s) times out and the CP reaps a healthy-but-still-attaching worker — failing the session and, across all four lanes, the whole e2e run. Diagnosed live on mw-dev: workers reach Running and bind :8816 in ~1.5s, hang right after 'Loaded extension ducklake', and are reaped before logging 'pre-warmed successfully'; the duckgres_worker_acquire_phase_seconds{phase=spawn} histogram never records a completion. Raise the cold-spawn connect budget to 3m (a named constant, workerSpawnConnectTimeout), well under the engine-side attach cap (attachMigrateStatementTimeout, 15m). The hot-idle reuse path keeps its 30s budget (those workers are already warm). Crash detection is unaffected — dead workers are caught independently by the pod informer / PodFailed path. workerSpawnActivateTimeout is now defined as the sum of its phases (pod-ready + connect + activate) so the deadlines can't drift. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Diagnosis correction after deeper investigation (full repro with the actual worker images from ECR): The "attach contention under parallel lanes" theory in the PR description was wrong. The real chain behind the e2e worker-reaps on #706 was:
That's fixed at the source in #706 (macro no longer references INET, plus This PR remains valuable as defense-in-depth: the budget asymmetry is real regardless of trigger —
|
Summary
Raises the cold-spawn worker connect budget from 90s to 3 minutes, fixing the e2e flakiness where the control plane reaps a healthy-but-still-warming worker.
This is the root cause of the recurring
e2efailures across PRs since ~16:40 UTC today (acquire worker: spawn sized worker: ... timeout connecting to worker at <ip>:8816 ... DeadlineExceeded ... attempts: 37).Root cause
A freshly-spawned worker's gRPC health handler deliberately blocks until warmup completes —
duckdbservice/flight_handler.go:<-h.pool.warmupDone— so the CP won't route to a worker that hasn't loaded extensions + done the DuckLakeATTACH(httpfs/S3 + metadata). Meanwhile:waitForPodReadyreturns the instant the pod isRunning+IP (~1.5s) — there is no readiness probe gating on warmup — so the connect budget (waitForWorkerTCP), not the 5-minuteworkerPodReadyTimeout, is what absorbs the entire attach.Under a burst of concurrent cold spawns — the e2e harness's four parallel per-org lanes (#747), on workers packed tighter after #745/#746 — the DuckLake ATTACH contends on S3/metadata and routinely exceeds 90s. Every health attempt times out, the CP reaps the worker mid-attach, the session fails, and with all four lanes hitting it the whole run fails. It's load-dependent, hence intermittent (passes in low-contention windows).
Diagnosed live on mw-dev
Runningand bind:8816in ~1.5s, then hang right afterLoaded extension ducklakeand are reaped before ever loggingpre-warmed successfully.duckgres_worker_acquire_phase_seconds{phase="spawn"}never records a completion — onlygate_waitshows requests entering the spawn phase (5 in-flight, 0 completing).Change
workerSpawnConnectTimeout = 3m;waitForWorkerTCPin the cold-spawn path uses it instead of the magic90s. 3m gives comfortable headroom over a contended-but-progressing attach while staying well under the engine-side cap (attachMigrateStatementTimeout, 15m).PodFailedpath, not this health loop.workerSpawnActivateTimeoutis now defined as the sum of its phases (pod-ready + connect + activate) so the three deadlines can't drift out of sync.Follow-up (not in this PR — platform/test owners)
This makes the CP tolerant of slow attach; the underlying slowness is the concurrent-attach contention introduced by #747/#745/#746. Worth: staggering the lanes' worker demand and/or restoring worker CPU headroom, and confirming warmup isn't doing avoidable network I/O.
🤖 Generated with Claude Code