feat: cache resolved model info and support per-endpoint engine API keys by niklasfrick · Pull Request #26 · niklasfrick/spark-dashboard

niklasfrick · 2026-05-19T07:43:55Z

Addresses #25 plus follow-on API/metrics work and a full dependency refresh.

Summary

Closes the root cause behind #25: spark-dashboard hit GET /v1/models on every poll tick (default 1s), flooding auth-gated vLLM logs with 401 Unauthorized and wasting traffic even on open deployments.

Model info is cached. Resolve /v1/models once, reuse the cached ModelInfo; re-resolve only on engine restart or every 10 min as a safety net. ~99.8% fewer /v1/models calls. /metrics is still polled live every tick.
Per-endpoint API key. --engine-api-key (index-paired with --engine-url) plus a global --provider-api-key / SPARK_DASHBOARD_PROVIDER_API_KEY fallback that also covers auto-detected engines. Bearer token applied to /health, /v1/models, /metrics; HF requests stay unauthenticated. Keys are redacted from logs.
Implicit "stop on 401": with no key, resolution falls back to the launch-command model name and the unresolved-retry cooldown prevents the per-second retry storm.
dev loop: dev.sh forwards SPARK_DASHBOARD_PROVIDER_API_KEY into the remote backend launch; documented in dev/README.md and .env.example.

Additional metrics/UI work on this branch: TPOT on the engine latency card, prefix cache hit-rate plotting, cumulative token totals + tok/s in throughput titles, physical/Wi-Fi network interface selection, and engine chart tooltip cleanup.

ModelInfo serde shape is unchanged — no frontend / metrics-contract changes required.

Dependencies

Full dependency refresh to latest stable (chore(deps) commit, no release bump):

Rust: sysinfo 0.38→0.39, bollard 0.20→0.21, Cargo.lock refreshed for compatible transitive updates. No source changes required.
Frontend: all semver-minor/patch (react/react-dom 19.2.6, vite 8.0.13, vitest 4.1.6, eslint 10.4.0, tailwindcss 4.3.0, etc.); no breaking majors available. npm audit fix cleared 5 transitive advisories (fast-uri, hono, ip-address, express-rate-limit) → 0 vulnerabilities.
⚠️ MSRV raised 1.75 → 1.95 (sysinfo 0.39 requirement). rust-version = "1.95" added to Cargo.toml; README updated. The DGX Spark deploy box needs rustup update (≥1.95) before the next deploy.sh run.

Test plan

cargo fmt --all -- --check
cargo clippy --all-targets --locked -- -D warnings
cargo test --locked — 96 passed
cd frontend && npm run build && npm test -- --run — 126 passed
Manual: auth-gated vLLM via dev env — /v1/models 200 once then cached; no per-second 401s; falls back to launch-command name without a key
CI ubuntu-24.04-arm job confirms linux-only crates (bollard 0.21, nvml-wrapper, procfs) compile — not verifiable on macOS locally

Refs #25

Stop hitting /v1/models every poll tick. Model identity is static for a running engine, so resolve it once and reuse the cached ModelInfo: re-resolve only on engine restart or every 10 minutes as a safety net. This drops ~99.8% of /v1/models traffic and ends the per-second 401 retry storm against auth-gated deployments (issue #25). - ApiKeyResolver: per-endpoint key (--engine-api-key, index-paired with --engine-url) with a global --provider-api-key / SPARK_DASHBOARD_PROVIDER_API_KEY fallback covering auto-detected engines. EngineOverride.api_key has a redacting Debug so keys never hit logs. - VllmAdapter sends the bearer token on /health, /v1/models, /metrics; HF requests stay unauthenticated. - EngineState caches ModelInfo with restart invalidation + unresolved retry cooldown; poll loop uses the cache, /metrics still polled live. - dev.sh forwards SPARK_DASHBOARD_PROVIDER_API_KEY into the remote launch via `nohup env VAR=val`; documented in dev/README, .env.example. ModelInfo serde shape unchanged — no frontend/metrics-contract changes. Tests: 11 new unit tests for cache logic + key resolution (mod.rs).

Add a prefixCacheHit history buffer mirroring the existing kvCache pipeline and render it as a second line on the engine Cache chart (retitled from "KV Cache" to "Cache"). Both series share the 0-100% domain. The static Prefix Hit tile is unchanged.

Surface vLLM's time_per_output_token histogram as TPOT — the average gap between generating each output token, excluding TTFT. Full treatment matching TTFT/ITL/E2E: average, p50/p95/p99 percentiles, tunable SLO goodput, raw buckets, history buffers, latency-card tile honoring the avg/p50/p95/p99 mode toggle, and a line on the Latency time-series chart. - Backend: TPOT_SLO_MS constant + 4 EngineMetrics fields; extraction in vllm.rs with v1/v0.6 metric-name fallback and warmup guard. - Frontend: types, SLO defaults/settings (backward-compatible parseStored backfills tpotMs), history wiring, EngineCard tiles (TPOT in Latency card; TPOT goodput keeps the 2x2 grid under Combined), chart series. - combinedGoodput intentionally stays TTFT/ITL/E2E (TPOT correlates with ITL; folding it in would double-weight decode latency). - Tests added/updated on both sides.

Surface lifetime processed (prompt) and generated (decode) token counts on the Prompt Processing and Token Generation cards, per-model and summed on the overview. The vLLM counters were already scraped for rate math but never exposed; add total_prompt_tokens / total_generation_tokens to EngineMetrics and pass them through. Frontend shows them next to the Live tile via a new LiveWithTotal primitive, abbreviated K/M/B/T (formatCompactTokens) and counting up with an ease-out AnimatedCounter that honors prefers-reduced-motion and snaps on counter resets. Aggregate view sums them across running engines. Tests: Rust counter-parse test, formatCompactTokens boundaries, AnimatedCounter behavior, aggregate sum.

Expose vLLM's prefix_cache_queries_total (already scraped for hit-rate math) on EngineMetrics as a pass-through lifetime counter, ungated by warmup like the token totals. Show it as an animated, abbreviated second-row tile in the Cache card per-engine and summed on the global card. Aggregation sums across running engines via sumOrNull.

collect_network_metrics picked the interface with the most cumulative traffic and summed rx/tx across every interface, so loopback (heavy local engine/websocket traffic) won the name and inflated throughput. Split extraction from a pure select_network_metrics: classify virtual devices (loopback, container, VPN, bridge) by name prefix, prefer non-virtual interfaces with a globally routable IP, break ties by traffic, and scope rx/tx totals to real interfaces. Falls back to the legacy all-interfaces behavior only when no real interface exists. Metrics contract: serde shape unchanged (name/rx/tx) — frontend types, formatters, vitest, and components are N/A; only emitted values change. Rust unit tests added for selection, classification, and IP detection.

Prefill/Decode throughput charts now follow the latency convention: unit in the title after a middle dot (Prefill Throughput · tok/s), short legend labels (Live/Avg/Per-req), and a meaningful hover header (Tokens / sec) via a new optional TimeSeriesChart tooltipLabel prop. Other charts are unaffected.

Move the unit from after the dot into brackets on engine chart titles (e.g. Prefill Throughput (tok/s), Latency (ms) · <stat>). Hide the tooltip header on the Latency, E2E Latency, Requests, and Cache charts, and label the E2E single-line series as 'E2E Latency' instead of the bare unit.

Rust: - sysinfo 0.38 -> 0.39 (now requires rustc 1.95) - bollard 0.20 -> 0.21 - refresh Cargo.lock for compatible transitive updates - add rust-version = "1.95" to reflect raised MSRV Frontend (all semver-minor/patch, no breaking majors available): - react/react-dom 19.2.6, vite 8.0.13, vitest 4.1.6, eslint 10.4.0, tailwindcss/@tailwindcss/vite 4.3.0, lucide-react 1.16.0, and others - npm audit fix resolved 5 transitive advisories (fast-uri, hono, ip-address, express-rate-limit) with no breaking changes MSRV raised 1.75 -> 1.95 (sysinfo 0.39 requirement); README updated. No source changes required; all Rust + frontend gates green.

niklasfrick added 10 commits May 19, 2026 09:43

feat: hide tooltip header on prefill and decode throughput charts too

630f44c

niklasfrick marked this pull request as ready for review May 19, 2026 10:24

niklasfrick merged commit 2e596b4 into main May 19, 2026
5 checks passed

niklasfrick mentioned this pull request May 19, 2026

chore: surface dependency commits in release notes; switch to rebase-merge #28

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cache resolved model info and support per-endpoint engine API keys#26

feat: cache resolved model info and support per-endpoint engine API keys#26
niklasfrick merged 10 commits into
mainfrom
feat/api-optimizations-and-metrics

niklasfrick commented May 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

niklasfrick commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Dependencies

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

niklasfrick commented May 19, 2026 •

edited

Loading