metrics: per-org worker-acquire latency by allocation source + Y-axis fix#865
Merged
Conversation
Two things (per request, same PR):
1. Worker-acquire latency by org and allocation phase. The acquire histograms
(duckgres_worker_acquire_{total,phase,gate_wait}_seconds) already timed the
wait but had no `org` label, and the end-to-end total wasn't tagged by HOW
the worker was obtained. Now:
- all three carry an `org` label (sliceable per tenant)
- the total carries a `source` label — idle_reuse | hot_idle_claim | spawn |
none — so "how long did org X wait, and did it need a cold spawn?" is a
dashboard query. source is bound to the claim BEFORE completion so a failed
spawn still attributes its wait to source=spawn (outcome=error).
org is threaded from p.orgID / assignment.OrgID at every observe site.
Two allow-listed admin panels expose it: acquire_p95 (p95 by source) and
acquire_by_source (acquire rate by source), both org-scopable via $ORG.
2. Metrics chart Y-axis fix. The axis had no tickFormatter and a narrow fixed
width, so large byte-rate values were clipped to a meaningless "00000". Add a
unit-aware compact formatter (binary bytes for B/s, compact SI otherwise) for
the tick + tooltip, and widen the axis.
Tests: acquire_metrics_test.go updated for the new labels (+ asserts a cold
spawn records source=spawn end-to-end); metrics_proxy_test validates the new
panels render cleanly; format.test.ts covers the compact/axis/value formatters;
harness asserts the acquire panels are advertised (raw histogram emission is
unit-tested — the :9090 port is NetworkPolicy-blocked in-Job).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NUq2EVxvKQFq3YEDNLF5HP
Test Impact PlanDeterministic summary of how this PR changes tests, CI runners, and coverage-risk signals. Summary
Signals
Coverage risk: needs review Warnings
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two things in one PR (as requested).
1. Worker-acquire latency, by org and by allocation phase
How long a pending session waits to get a worker — sliced by org and by the allocation source (did it reuse a hot worker or pay for a cold spawn?).
The acquire histograms already timed the wait but had no
orglabel, and the end-to-end total wasn't tagged by how the worker was obtained. Now:orglabel on all three acquire histograms (duckgres_worker_acquire_total_seconds,…_phase_seconds,…_gate_wait_seconds) —orgis threaded fromp.orgID/assignment.OrgIDat every observe site.sourcelabel on the end-to-end total —idle_reuse | hot_idle_claim | spawn | none. It's bound to the claim before completion, so a failed spawn still attributes its wait tosource=spawn(withoutcome=error) rather than vanishing.So
histogram_quantile(0.95, sum by (le, source) (rate(duckgres_worker_acquire_total_seconds_bucket{org="X"}[5m])))answers "p95 wait for org X, per allocation path."Two allow-listed admin panels surface it on the Metrics page (org-scopable via
$ORG):acquire_p95— p95 acquire latency by sourceacquire_by_source— acquire rate by source (cold-spawn frequency)Cardinality: orgs are bounded managed-warehouse tenants, and
sourceis only meaningful on success (elsenone), so the label cross-product stays small.2. Metrics chart Y-axis fix (the
00000bug)The chart Y-axis had no
tickFormatterand a narrow fixed width, so large byte-rate values got clipped to a meaningless00000(screenshot on the S3 read bytes rate panel). Added a unit-aware compact formatter — binary bytes (19 MB) forB/s, compact SI (20M,1.5K) otherwise — for both the tick and the tooltip, and widened the axis.Tests
acquire_metrics_test.go— updated for the new labels; now also asserts a cold-spawn acquire recordssource=spawnend-to-end throughAcquireWorkermetrics_proxy_test.go— its panel loop validates the two new panels render without token corruptionlib/format.test.ts— new vitest forfmtCompact/fmtMetricAxis/fmtMetricValueharness.sh— asserts the acquire panels are advertised in the allow-list (raw histogram emission is unit-tested; the:9090metrics port is NetworkPolicy-blocked from the in-Job harness)controlplane+adminGo suites green; UI tsc/lint/44 vitest/build green🤖 Generated with Claude Code