Skip to content

metrics: per-org worker-acquire latency by allocation source + Y-axis fix#865

Merged
fuziontech merged 1 commit into
mainfrom
worker-acquire-metrics
Jul 1, 2026
Merged

metrics: per-org worker-acquire latency by allocation source + Y-axis fix#865
fuziontech merged 1 commit into
mainfrom
worker-acquire-metrics

Conversation

@fuziontech

Copy link
Copy Markdown
Member

Two things in one PR (as requested).

1. Worker-acquire latency, by org and by allocation phase

How long a pending session waits to get a worker — sliced by org and by the allocation source (did it reuse a hot worker or pay for a cold spawn?).

The acquire histograms already timed the wait but had no org label, and the end-to-end total wasn't tagged by how the worker was obtained. Now:

  • org label on all three acquire histograms (duckgres_worker_acquire_total_seconds, …_phase_seconds, …_gate_wait_seconds) — org is threaded from p.orgID / assignment.OrgID at every observe site.
  • source label on the end-to-end totalidle_reuse | hot_idle_claim | spawn | none. It's bound to the claim before completion, so a failed spawn still attributes its wait to source=spawn (with outcome=error) rather than vanishing.

So histogram_quantile(0.95, sum by (le, source) (rate(duckgres_worker_acquire_total_seconds_bucket{org="X"}[5m]))) answers "p95 wait for org X, per allocation path."

Two allow-listed admin panels surface it on the Metrics page (org-scopable via $ORG):

  • acquire_p95 — p95 acquire latency by source
  • acquire_by_source — acquire rate by source (cold-spawn frequency)

Cardinality: orgs are bounded managed-warehouse tenants, and source is only meaningful on success (else none), so the label cross-product stays small.

2. Metrics chart Y-axis fix (the 00000 bug)

The chart Y-axis had no tickFormatter and a narrow fixed width, so large byte-rate values got clipped to a meaningless 00000 (screenshot on the S3 read bytes rate panel). Added a unit-aware compact formatter — binary bytes (19 MB) for B/s, compact SI (20M, 1.5K) otherwise — for both the tick and the tooltip, and widened the axis.

Tests

  • acquire_metrics_test.go — updated for the new labels; now also asserts a cold-spawn acquire records source=spawn end-to-end through AcquireWorker
  • metrics_proxy_test.go — its panel loop validates the two new panels render without token corruption
  • lib/format.test.ts — new vitest for fmtCompact / fmtMetricAxis / fmtMetricValue
  • harness.sh — asserts the acquire panels are advertised in the allow-list (raw histogram emission is unit-tested; the :9090 metrics port is NetworkPolicy-blocked from the in-Job harness)
  • Full controlplane + admin Go suites green; UI tsc/lint/44 vitest/build green

🤖 Generated with Claude Code

Two things (per request, same PR):

1. Worker-acquire latency by org and allocation phase. The acquire histograms
   (duckgres_worker_acquire_{total,phase,gate_wait}_seconds) already timed the
   wait but had no `org` label, and the end-to-end total wasn't tagged by HOW
   the worker was obtained. Now:
   - all three carry an `org` label (sliceable per tenant)
   - the total carries a `source` label — idle_reuse | hot_idle_claim | spawn |
     none — so "how long did org X wait, and did it need a cold spawn?" is a
     dashboard query. source is bound to the claim BEFORE completion so a failed
     spawn still attributes its wait to source=spawn (outcome=error).
   org is threaded from p.orgID / assignment.OrgID at every observe site.
   Two allow-listed admin panels expose it: acquire_p95 (p95 by source) and
   acquire_by_source (acquire rate by source), both org-scopable via $ORG.

2. Metrics chart Y-axis fix. The axis had no tickFormatter and a narrow fixed
   width, so large byte-rate values were clipped to a meaningless "00000". Add a
   unit-aware compact formatter (binary bytes for B/s, compact SI otherwise) for
   the tick + tooltip, and widen the axis.

Tests: acquire_metrics_test.go updated for the new labels (+ asserts a cold
spawn records source=spawn end-to-end); metrics_proxy_test validates the new
panels render cleanly; format.test.ts covers the compact/axis/value formatters;
harness asserts the acquire panels are advertised (raw histogram emission is
unit-tested — the :9090 port is NetworkPolicy-blocked in-Job).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NUq2EVxvKQFq3YEDNLF5HP
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown

Test Impact Plan

Deterministic summary of how this PR changes tests, CI runners, and coverage-risk signals.

Summary

Area Added Changed Deleted
Test files 0 2 0
E2E/journey files 0 1 0
Workflow files 0 0 0

Signals

  • Test cases: +0 / -0
  • Assertions: +3 / -3
  • Skips or known failures added: 0
  • Workflow continue-on-error added: 0
  • Workflow path filters added: 0
  • Test commands removed from justfile: 0
  • E2E/journey retry lines added: 0

Coverage risk: needs review

Warnings

  • E2E or journey files changed (needs review)
    • tests/e2e-mw-dev/harness.sh

@fuziontech fuziontech merged commit 4829eb5 into main Jul 1, 2026
28 checks passed
@fuziontech fuziontech deleted the worker-acquire-metrics branch July 1, 2026 22:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant