Skip to content

fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness)#753

Closed
EDsCODE wants to merge 1 commit into
mainfrom
fix/worker-spawn-connect-budget
Closed

fix(controlplane): raise cold-spawn worker connect budget 90s → 3m (fixes e2e worker-reap flakiness)#753
EDsCODE wants to merge 1 commit into
mainfrom
fix/worker-spawn-connect-budget

Conversation

@EDsCODE

@EDsCODE EDsCODE commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Raises the cold-spawn worker connect budget from 90s to 3 minutes, fixing the e2e flakiness where the control plane reaps a healthy-but-still-warming worker.

This is the root cause of the recurring e2e failures across PRs since ~16:40 UTC today (acquire worker: spawn sized worker: ... timeout connecting to worker at <ip>:8816 ... DeadlineExceeded ... attempts: 37).

Root cause

A freshly-spawned worker's gRPC health handler deliberately blocks until warmup completesduckdbservice/flight_handler.go: <-h.pool.warmupDone — so the CP won't route to a worker that hasn't loaded extensions + done the DuckLake ATTACH (httpfs/S3 + metadata). Meanwhile:

  • waitForPodReady returns the instant the pod is Running+IP (~1.5s) — there is no readiness probe gating on warmup — so the connect budget (waitForWorkerTCP), not the 5-minute workerPodReadyTimeout, is what absorbs the entire attach.
  • That budget was 90s (a loop of ~37 attempts × 2s health RPC).

Under a burst of concurrent cold spawns — the e2e harness's four parallel per-org lanes (#747), on workers packed tighter after #745/#746 — the DuckLake ATTACH contends on S3/metadata and routinely exceeds 90s. Every health attempt times out, the CP reaps the worker mid-attach, the session fails, and with all four lanes hitting it the whole run fails. It's load-dependent, hence intermittent (passes in low-contention windows).

Diagnosed live on mw-dev

  • Workers reach Running and bind :8816 in ~1.5s, then hang right after Loaded extension ducklake and are reaped before ever logging pre-warmed successfully.
  • duckgres_worker_acquire_phase_seconds{phase="spawn"} never records a completion — only gate_wait shows requests entering the spawn phase (5 in-flight, 0 completing).
  • Not node-freshness (failing worker shared a warm node with workers that connected), not a Cilium drop (none captured; cluster health 35/35), not any PR's code (a branch without the latest main fails identically; the worker binary never runs the changed transpiler/catalog code).

Change

  • New named constant workerSpawnConnectTimeout = 3m; waitForWorkerTCP in the cold-spawn path uses it instead of the magic 90s. 3m gives comfortable headroom over a contended-but-progressing attach while staying well under the engine-side cap (attachMigrateStatementTimeout, 15m).
  • The hot-idle reuse path keeps its 30s budget (those workers are already warm).
  • Crash detection is unaffected — dead/crashed workers are caught independently by the pod informer / PodFailed path, not this health loop.
  • workerSpawnActivateTimeout is now defined as the sum of its phases (pod-ready + connect + activate) so the three deadlines can't drift out of sync.
  • Regression tests guard both invariants.

Follow-up (not in this PR — platform/test owners)

This makes the CP tolerant of slow attach; the underlying slowness is the concurrent-attach contention introduced by #747/#745/#746. Worth: staggering the lanes' worker demand and/or restoring worker CPU headroom, and confirming warmup isn't doing avoidable network I/O.

🤖 Generated with Claude Code

A freshly-spawned worker's gRPC health handler blocks on warmupDone
(duckdbservice doHealthCheck) until extension load + the DuckLake ATTACH
(httpfs/S3 + metadata) finish — so the control plane can't route to a
not-yet-attached worker. waitForPodReady returns the instant the pod is
Running+IP (~1.5s; no readiness probe gates on warmup), which means the
connect budget in waitForWorkerTCP — not the 5m pod-ready timeout — is what
must absorb the whole attach.

That budget was 90s. Under a burst of concurrent cold spawns (the e2e
harness's parallel per-org lanes, #747) the ATTACH contends on S3/metadata
and routinely exceeds 90s, so every health attempt (37 x ~2s) times out and
the CP reaps a healthy-but-still-attaching worker — failing the session and,
across all four lanes, the whole e2e run. Diagnosed live on mw-dev: workers
reach Running and bind :8816 in ~1.5s, hang right after 'Loaded extension
ducklake', and are reaped before logging 'pre-warmed successfully'; the
duckgres_worker_acquire_phase_seconds{phase=spawn} histogram never records a
completion.

Raise the cold-spawn connect budget to 3m (a named constant,
workerSpawnConnectTimeout), well under the engine-side attach cap
(attachMigrateStatementTimeout, 15m). The hot-idle reuse path keeps its 30s
budget (those workers are already warm). Crash detection is unaffected — dead
workers are caught independently by the pod informer / PodFailed path.
workerSpawnActivateTimeout is now defined as the sum of its phases
(pod-ready + connect + activate) so the deadlines can't drift.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@EDsCODE

EDsCODE commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Diagnosis correction after deeper investigation (full repro with the actual worker images from ECR):

The "attach contention under parallel lanes" theory in the PR description was wrong. The real chain behind the e2e worker-reaps on #706 was:

  1. A feat(compat): PostgreSQL builtin-compatibility macros + transforms (48 functions) #706 catalog macro (inet_server_addr() AS CAST(NULL AS INET)) referenced the INET type, which lives in DuckDB's non-statically-linked inet extension.
  2. Worker warmup runs initPgCatalog (via ConfigureDBConnection), so creating that macro triggered DuckDB extension autoinstall at warmup — fetching http://extensions.duckdb.org/... over plain HTTP port 80.
  3. The worker egress CNP allows world :443/:5432 only → the port-80 SYN is silently droppedconnect() blocks ~2 minutes (reproduced).
  4. The worker's health handler blocks on warmupDone, so the CP's 90s connect budget expired and reaped a healthy-but-downloading worker. Deterministically, every spawn.

That's fixed at the source in #706 (macro no longer references INET, plus TestInitPgCatalogIsAirgapSafe which runs the whole catalog init with autoinstall/autoload disabled and fails on any statement needing a non-static extension).

This PR remains valuable as defense-in-depth: the budget asymmetry is real regardless of trigger — waitForPodReady returns at Running+IP, there is no readiness probe gating warmup, and the 90s inner budget must absorb whatever warmup costs (a one-off ~2min stall under the 3m budget here would have degraded one macro instead of failing every session in the run). Two follow-ups worth considering for whoever owns worker config:

  • Set autoinstall_known_extensions=false / autoload_known_extensions=false (or autoload-from-local-only) on worker DuckDB instances so no runtime statement can ever reach for the network — silent port-80 drops turn a missing extension into a multi-minute hang instead of a clean error.
  • Either allow or fast-reject (REJECT, not DROP) outbound :80 in the worker egress policy, so anything that does slip through fails in milliseconds rather than minutes.

@EDsCODE EDsCODE closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant