feat: surface vLLM speculative-decoding metrics + responsive hardware cards by niklasfrick · Pull Request #36 · niklasfrick/spark-dashboard

niklasfrick · 2026-06-18T09:47:47Z

Summary

Adds vLLM speculative-decoding metrics end-to-end (Rust parse → TS types → UI), and reworks the hardware cards so the dashboard stays a one-pager across viewport sizes.

Changes

feat(engines) — parse vLLM speculative-decoding metrics in src/engines/vllm.rs (+ mod.rs), with unit tests.
feat(frontend) — surface the spec-decode metrics in the cache card; new TS types (types/metrics.ts), formatters (lib/format.ts), and aggregation (lib/engineAggregate.ts).
style(frontend) — declutter the spec-decode card and tighten the responsive grid.
feat(frontend) — adapt hardware cards to vertical space: new HBar gauge and useElementSize hook; cards swap square gauges for compact horizontal bars and drop line charts when vertical space is tight.

Metrics contract

EngineSnapshot metrics gain spec-decode fields, propagated through Rust tests, types/metrics.ts, lib/format.ts, Vitest specs (EngineCardSpecDecode, engineAggregate, format, HBar), and the engine components.

Test plan

cargo clippy --all-targets --locked -- -D warnings, cargo test --locked
cd frontend && npm run build && npm test -- --run
End-to-end against a live vLLM engine on the DGX Spark (deploy.sh)

Companion PR #35 (GPU power gauge fix) is stacked on this branch; merge this first, then retarget #35 to main.

vLLM exposes `vllm:spec_decode_*` counters only when the served model has speculative decoding configured. Parse the draft/accepted/drafts counters (handling the prometheus_client `_total` suffix), expose them as cumulative lifetime totals on EngineMetrics, and derive token acceptance rate (TAR), mean acceptance length, and a live (windowed) TAR from per-poll deltas. The live-TAR snapshot is discarded when a counter is absent for a poll or appears to go backwards (engine restart) so a stale delta can't inflate or negate the rate; counters are clamped non-negative before the f64->u64 cast. Absent metrics leave every field None, which drives the frontend to hide the speculative-decoding section entirely.

Rename the per-engine and aggregate "Cache" cards to "Cache & Speculative Decoding" and add a SpecDecodeSection that renders only when the served model has actually drafted tokens (spec_decode_draft_tokens_total > 0). It shows the token acceptance rate (lifetime + live), mean acceptance length, and the cumulative accepted/draft token counters that animate upward. Aggregation sums the cumulative counters and recomputes TAR and mean acceptance length from the aggregated sums (volume-weighted, not naively averaged); live TAR is blended weighted by per-engine draft-token volume. Adds the formatAcceptanceLength formatter and unit/component tests.

The Cache & Speculative Decoding card was the tallest cell in the engine row (inflating every sibling), and its "ACCEPTANCE · TAR" label overflowed and truncated. Compact the speculative-decoding section: fold the live TAR into the "TAR" label inline, drop the separate live subline, and give the cumulative Accepted/Draft counters their own lines (smaller fonts) so the abbreviated values never clip on narrow cards. Also move the engine metric and chart grids to a single 6-up row from `lg` (1024px) instead of `xl` (1280px) so laptop widths render the engine row on one line without overflowing the fixed-height one-pager. Verified visually at 1024/1280/2000px via a throwaway Playwright preview.

The hardware cards only adapted to horizontal space, so on short viewports the square gauges and trend charts overflowed and broke the one-pager. Measure the available vertical space with a ResizeObserver and degrade gracefully: - New `useElementSize` hook (ResizeObserver, SSR/jsdom-safe). - New `HBar` component: a compact horizontal-bar gauge mirroring `ArcGauge`'s data API (single value+threshold, or stacked segments + legend). - When per-card height is tight, each hardware card drops its line chart and swaps its arc gauge for an `HBar` (Memory becomes a stacked segmented bar); value-only cards (Clock/Disk/Network) keep their readouts without the chart. The CPU core heatmap is dropped in this mode too, alongside the line charts. - When the dashboard content height is very short, the engine section also drops its per-metric trend charts so it stops crowding the hardware grid off-screen. Both decisions key off measured, content-independent heights (per-card height for the hardware swap, root height for the engine charts) so they can't feedback-loop. Verified across 1050/800/620/520px viewports via a throwaway Playwright preview. Adds HBar unit tests.

niklasfrick added 4 commits June 18, 2026 10:42

niklasfrick merged commit 68ee1ff into main Jun 18, 2026
7 checks passed

niklasfrick deleted the feat/spec-decode-metrics branch June 18, 2026 11:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: surface vLLM speculative-decoding metrics + responsive hardware cards#36

feat: surface vLLM speculative-decoding metrics + responsive hardware cards#36
niklasfrick merged 4 commits into
mainfrom
feat/spec-decode-metrics

niklasfrick commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

niklasfrick commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Metrics contract

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

niklasfrick commented Jun 18, 2026 •

edited

Loading