metrics: per-org worker-acquire latency by allocation source + Y-axis fix by fuziontech · Pull Request #865 · PostHog/duckgres

fuziontech · 2026-07-01T22:29:50Z

Two things in one PR (as requested).

1. Worker-acquire latency, by org and by allocation phase

How long a pending session waits to get a worker — sliced by org and by the allocation source (did it reuse a hot worker or pay for a cold spawn?).

The acquire histograms already timed the wait but had no org label, and the end-to-end total wasn't tagged by how the worker was obtained. Now:

org label on all three acquire histograms (duckgres_worker_acquire_total_seconds, …_phase_seconds, …_gate_wait_seconds) — org is threaded from p.orgID / assignment.OrgID at every observe site.
source label on the end-to-end total — idle_reuse | hot_idle_claim | spawn | none. It's bound to the claim before completion, so a failed spawn still attributes its wait to source=spawn (with outcome=error) rather than vanishing.

So histogram_quantile(0.95, sum by (le, source) (rate(duckgres_worker_acquire_total_seconds_bucket{org="X"}[5m]))) answers "p95 wait for org X, per allocation path."

Two allow-listed admin panels surface it on the Metrics page (org-scopable via $ORG):

acquire_p95 — p95 acquire latency by source
acquire_by_source — acquire rate by source (cold-spawn frequency)

Cardinality: orgs are bounded managed-warehouse tenants, and source is only meaningful on success (else none), so the label cross-product stays small.

2. Metrics chart Y-axis fix (the `00000` bug)

The chart Y-axis had no tickFormatter and a narrow fixed width, so large byte-rate values got clipped to a meaningless 00000 (screenshot on the S3 read bytes rate panel). Added a unit-aware compact formatter — binary bytes (19 MB) for B/s, compact SI (20M, 1.5K) otherwise — for both the tick and the tooltip, and widened the axis.

Tests

acquire_metrics_test.go — updated for the new labels; now also asserts a cold-spawn acquire records source=spawn end-to-end through AcquireWorker
metrics_proxy_test.go — its panel loop validates the two new panels render without token corruption
lib/format.test.ts — new vitest for fmtCompact / fmtMetricAxis / fmtMetricValue
harness.sh — asserts the acquire panels are advertised in the allow-list (raw histogram emission is unit-tested; the :9090 metrics port is NetworkPolicy-blocked from the in-Job harness)
Full controlplane + admin Go suites green; UI tsc/lint/44 vitest/build green

🤖 Generated with Claude Code

Two things (per request, same PR): 1. Worker-acquire latency by org and allocation phase. The acquire histograms (duckgres_worker_acquire_{total,phase,gate_wait}_seconds) already timed the wait but had no `org` label, and the end-to-end total wasn't tagged by HOW the worker was obtained. Now: - all three carry an `org` label (sliceable per tenant) - the total carries a `source` label — idle_reuse | hot_idle_claim | spawn | none — so "how long did org X wait, and did it need a cold spawn?" is a dashboard query. source is bound to the claim BEFORE completion so a failed spawn still attributes its wait to source=spawn (outcome=error). org is threaded from p.orgID / assignment.OrgID at every observe site. Two allow-listed admin panels expose it: acquire_p95 (p95 by source) and acquire_by_source (acquire rate by source), both org-scopable via $ORG. 2. Metrics chart Y-axis fix. The axis had no tickFormatter and a narrow fixed width, so large byte-rate values were clipped to a meaningless "00000". Add a unit-aware compact formatter (binary bytes for B/s, compact SI otherwise) for the tick + tooltip, and widen the axis. Tests: acquire_metrics_test.go updated for the new labels (+ asserts a cold spawn records source=spawn end-to-end); metrics_proxy_test validates the new panels render cleanly; format.test.ts covers the compact/axis/value formatters; harness asserts the acquire panels are advertised (raw histogram emission is unit-tested — the :9090 port is NetworkPolicy-blocked in-Job). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NUq2EVxvKQFq3YEDNLF5HP

github-actions · 2026-07-01T22:30:20Z

Test Impact Plan

Deterministic summary of how this PR changes tests, CI runners, and coverage-risk signals.

Summary

Area	Added	Changed	Deleted
Test files	0	2	0
E2E/journey files	0	1	0
Workflow files	0	0	0

Signals

Test cases: +0 / -0
Assertions: +3 / -3
Skips or known failures added: 0
Workflow continue-on-error added: 0
Workflow path filters added: 0
Test commands removed from justfile: 0
E2E/journey retry lines added: 0

Coverage risk: needs review

Warnings

E2E or journey files changed (needs review)
- tests/e2e-mw-dev/harness.sh

fuziontech merged commit 4829eb5 into main Jul 1, 2026
28 checks passed

fuziontech deleted the worker-acquire-metrics branch July 1, 2026 22:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

metrics: per-org worker-acquire latency by allocation source + Y-axis fix#865

metrics: per-org worker-acquire latency by allocation source + Y-axis fix#865
fuziontech merged 1 commit into
mainfrom
worker-acquire-metrics

fuziontech commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fuziontech commented Jul 1, 2026

1. Worker-acquire latency, by org and by allocation phase

2. Metrics chart Y-axis fix (the 00000 bug)

Tests

Uh oh!

github-actions Bot commented Jul 1, 2026

Test Impact Plan

Summary

Signals

Warnings

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

2. Metrics chart Y-axis fix (the `00000` bug)