Skip to content

feat: surface vLLM speculative-decoding metrics + responsive hardware cards#36

Merged
niklasfrick merged 4 commits into
mainfrom
feat/spec-decode-metrics
Jun 18, 2026
Merged

feat: surface vLLM speculative-decoding metrics + responsive hardware cards#36
niklasfrick merged 4 commits into
mainfrom
feat/spec-decode-metrics

Conversation

@niklasfrick

@niklasfrick niklasfrick commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Summary

Adds vLLM speculative-decoding metrics end-to-end (Rust parse → TS types → UI), and reworks the hardware cards so the dashboard stays a one-pager across viewport sizes.

Changes

  • feat(engines) — parse vLLM speculative-decoding metrics in src/engines/vllm.rs (+ mod.rs), with unit tests.
  • feat(frontend) — surface the spec-decode metrics in the cache card; new TS types (types/metrics.ts), formatters (lib/format.ts), and aggregation (lib/engineAggregate.ts).
  • style(frontend) — declutter the spec-decode card and tighten the responsive grid.
  • feat(frontend) — adapt hardware cards to vertical space: new HBar gauge and useElementSize hook; cards swap square gauges for compact horizontal bars and drop line charts when vertical space is tight.

Metrics contract

EngineSnapshot metrics gain spec-decode fields, propagated through Rust tests, types/metrics.ts, lib/format.ts, Vitest specs (EngineCardSpecDecode, engineAggregate, format, HBar), and the engine components.

Test plan

  • cargo clippy --all-targets --locked -- -D warnings, cargo test --locked
  • cd frontend && npm run build && npm test -- --run
  • End-to-end against a live vLLM engine on the DGX Spark (deploy.sh)

Companion PR #35 (GPU power gauge fix) is stacked on this branch; merge this first, then retarget #35 to main.

vLLM exposes `vllm:spec_decode_*` counters only when the served model has
speculative decoding configured. Parse the draft/accepted/drafts counters
(handling the prometheus_client `_total` suffix), expose them as cumulative
lifetime totals on EngineMetrics, and derive token acceptance rate (TAR),
mean acceptance length, and a live (windowed) TAR from per-poll deltas.

The live-TAR snapshot is discarded when a counter is absent for a poll or
appears to go backwards (engine restart) so a stale delta can't inflate or
negate the rate; counters are clamped non-negative before the f64->u64 cast.
Absent metrics leave every field None, which drives the frontend to hide the
speculative-decoding section entirely.
Rename the per-engine and aggregate "Cache" cards to "Cache & Speculative
Decoding" and add a SpecDecodeSection that renders only when the served model
has actually drafted tokens (spec_decode_draft_tokens_total > 0). It shows the
token acceptance rate (lifetime + live), mean acceptance length, and the
cumulative accepted/draft token counters that animate upward.

Aggregation sums the cumulative counters and recomputes TAR and mean
acceptance length from the aggregated sums (volume-weighted, not naively
averaged); live TAR is blended weighted by per-engine draft-token volume.
Adds the formatAcceptanceLength formatter and unit/component tests.
The Cache & Speculative Decoding card was the tallest cell in the engine row
(inflating every sibling), and its "ACCEPTANCE · TAR" label overflowed and
truncated. Compact the speculative-decoding section: fold the live TAR into
the "TAR" label inline, drop the separate live subline, and give the
cumulative Accepted/Draft counters their own lines (smaller fonts) so the
abbreviated values never clip on narrow cards.

Also move the engine metric and chart grids to a single 6-up row from `lg`
(1024px) instead of `xl` (1280px) so laptop widths render the engine row on
one line without overflowing the fixed-height one-pager.

Verified visually at 1024/1280/2000px via a throwaway Playwright preview.
The hardware cards only adapted to horizontal space, so on short viewports the
square gauges and trend charts overflowed and broke the one-pager. Measure the
available vertical space with a ResizeObserver and degrade gracefully:

- New `useElementSize` hook (ResizeObserver, SSR/jsdom-safe).
- New `HBar` component: a compact horizontal-bar gauge mirroring `ArcGauge`'s
  data API (single value+threshold, or stacked segments + legend).
- When per-card height is tight, each hardware card drops its line chart and
  swaps its arc gauge for an `HBar` (Memory becomes a stacked segmented bar);
  value-only cards (Clock/Disk/Network) keep their readouts without the chart.
  The CPU core heatmap is dropped in this mode too, alongside the line charts.
- When the dashboard content height is very short, the engine section also
  drops its per-metric trend charts so it stops crowding the hardware grid
  off-screen.

Both decisions key off measured, content-independent heights (per-card height
for the hardware swap, root height for the engine charts) so they can't
feedback-loop. Verified across 1050/800/620/520px viewports via a throwaway
Playwright preview. Adds HBar unit tests.
@niklasfrick niklasfrick merged commit 68ee1ff into main Jun 18, 2026
7 checks passed
@niklasfrick niklasfrick deleted the feat/spec-decode-metrics branch June 18, 2026 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant