feat: surface vLLM speculative-decoding metrics + responsive hardware cards#36
Merged
Conversation
vLLM exposes `vllm:spec_decode_*` counters only when the served model has speculative decoding configured. Parse the draft/accepted/drafts counters (handling the prometheus_client `_total` suffix), expose them as cumulative lifetime totals on EngineMetrics, and derive token acceptance rate (TAR), mean acceptance length, and a live (windowed) TAR from per-poll deltas. The live-TAR snapshot is discarded when a counter is absent for a poll or appears to go backwards (engine restart) so a stale delta can't inflate or negate the rate; counters are clamped non-negative before the f64->u64 cast. Absent metrics leave every field None, which drives the frontend to hide the speculative-decoding section entirely.
Rename the per-engine and aggregate "Cache" cards to "Cache & Speculative Decoding" and add a SpecDecodeSection that renders only when the served model has actually drafted tokens (spec_decode_draft_tokens_total > 0). It shows the token acceptance rate (lifetime + live), mean acceptance length, and the cumulative accepted/draft token counters that animate upward. Aggregation sums the cumulative counters and recomputes TAR and mean acceptance length from the aggregated sums (volume-weighted, not naively averaged); live TAR is blended weighted by per-engine draft-token volume. Adds the formatAcceptanceLength formatter and unit/component tests.
The Cache & Speculative Decoding card was the tallest cell in the engine row (inflating every sibling), and its "ACCEPTANCE · TAR" label overflowed and truncated. Compact the speculative-decoding section: fold the live TAR into the "TAR" label inline, drop the separate live subline, and give the cumulative Accepted/Draft counters their own lines (smaller fonts) so the abbreviated values never clip on narrow cards. Also move the engine metric and chart grids to a single 6-up row from `lg` (1024px) instead of `xl` (1280px) so laptop widths render the engine row on one line without overflowing the fixed-height one-pager. Verified visually at 1024/1280/2000px via a throwaway Playwright preview.
The hardware cards only adapted to horizontal space, so on short viewports the square gauges and trend charts overflowed and broke the one-pager. Measure the available vertical space with a ResizeObserver and degrade gracefully: - New `useElementSize` hook (ResizeObserver, SSR/jsdom-safe). - New `HBar` component: a compact horizontal-bar gauge mirroring `ArcGauge`'s data API (single value+threshold, or stacked segments + legend). - When per-card height is tight, each hardware card drops its line chart and swaps its arc gauge for an `HBar` (Memory becomes a stacked segmented bar); value-only cards (Clock/Disk/Network) keep their readouts without the chart. The CPU core heatmap is dropped in this mode too, alongside the line charts. - When the dashboard content height is very short, the engine section also drops its per-metric trend charts so it stops crowding the hardware grid off-screen. Both decisions key off measured, content-independent heights (per-card height for the hardware swap, root height for the engine charts) so they can't feedback-loop. Verified across 1050/800/620/520px viewports via a throwaway Playwright preview. Adds HBar unit tests.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds vLLM speculative-decoding metrics end-to-end (Rust parse → TS types → UI), and reworks the hardware cards so the dashboard stays a one-pager across viewport sizes.
Changes
feat(engines)— parse vLLM speculative-decoding metrics insrc/engines/vllm.rs(+mod.rs), with unit tests.feat(frontend)— surface the spec-decode metrics in the cache card; new TS types (types/metrics.ts), formatters (lib/format.ts), and aggregation (lib/engineAggregate.ts).style(frontend)— declutter the spec-decode card and tighten the responsive grid.feat(frontend)— adapt hardware cards to vertical space: newHBargauge anduseElementSizehook; cards swap square gauges for compact horizontal bars and drop line charts when vertical space is tight.Metrics contract
EngineSnapshotmetrics gain spec-decode fields, propagated through Rust tests,types/metrics.ts,lib/format.ts, Vitest specs (EngineCardSpecDecode,engineAggregate,format,HBar), and the engine components.Test plan
cargo clippy --all-targets --locked -- -D warnings,cargo test --lockedcd frontend && npm run build && npm test -- --rundeploy.sh)