You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
metrics: per-org worker-acquire latency by allocation source + axis fix
Two things (per request, same PR):
1. Worker-acquire latency by org and allocation phase. The acquire histograms
(duckgres_worker_acquire_{total,phase,gate_wait}_seconds) already timed the
wait but had no `org` label, and the end-to-end total wasn't tagged by HOW
the worker was obtained. Now:
- all three carry an `org` label (sliceable per tenant)
- the total carries a `source` label — idle_reuse | hot_idle_claim | spawn |
none — so "how long did org X wait, and did it need a cold spawn?" is a
dashboard query. source is bound to the claim BEFORE completion so a failed
spawn still attributes its wait to source=spawn (outcome=error).
org is threaded from p.orgID / assignment.OrgID at every observe site.
Two allow-listed admin panels expose it: acquire_p95 (p95 by source) and
acquire_by_source (acquire rate by source), both org-scopable via $ORG.
2. Metrics chart Y-axis fix. The axis had no tickFormatter and a narrow fixed
width, so large byte-rate values were clipped to a meaningless "00000". Add a
unit-aware compact formatter (binary bytes for B/s, compact SI otherwise) for
the tick + tooltip, and widen the axis.
Tests: acquire_metrics_test.go updated for the new labels (+ asserts a cold
spawn records source=spawn end-to-end); metrics_proxy_test validates the new
panels render cleanly; format.test.ts covers the compact/axis/value formatters;
harness asserts the acquire panels are advertised (raw histogram emission is
unit-tested — the :9090 port is NetworkPolicy-blocked in-Job).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NUq2EVxvKQFq3YEDNLF5HP
Help: "Time a connection spent blocked in the per-org FIFO acquire gate (orgAcquireGate) before owning the slow acquisition path, partitioned by outcome (acquired|canceled).",
61
+
Help: "Time a connection spent blocked in the per-org FIFO acquire gate (orgAcquireGate) before owning the slow acquisition path, partitioned by org and outcome (acquired|canceled).",
Help: "Duration of individual worker-acquire phases on the remote/k8s backend, partitioned by phase (hot_idle_claim|spawn|activate) and outcome (ok|error).",
67
+
Help: "Duration of individual worker-acquire phases on the remote/k8s backend, partitioned by org, phase (hot_idle_claim|spawn|activate) and outcome (ok|error).",
Help: "End-to-end OrgReservedPool.AcquireWorker duration, partitioned by outcome (ok|capacity|error|canceled).",
73
+
Help: "End-to-end OrgReservedPool.AcquireWorker duration (the time a pending session waits for a worker), partitioned by org, the allocation source (idle_reuse|hot_idle_claim|spawn|none) and outcome (ok|capacity|error|canceled).",
0 commit comments