Skip to content

feat: cache resolved model info and support per-endpoint engine API keys#26

Merged
niklasfrick merged 10 commits into
mainfrom
feat/api-optimizations-and-metrics
May 19, 2026
Merged

feat: cache resolved model info and support per-endpoint engine API keys#26
niklasfrick merged 10 commits into
mainfrom
feat/api-optimizations-and-metrics

Conversation

@niklasfrick
Copy link
Copy Markdown
Owner

@niklasfrick niklasfrick commented May 19, 2026

Addresses #25 plus follow-on API/metrics work and a full dependency refresh.

Summary

Closes the root cause behind #25: spark-dashboard hit GET /v1/models on every poll tick (default 1s), flooding auth-gated vLLM logs with 401 Unauthorized and wasting traffic even on open deployments.

  • Model info is cached. Resolve /v1/models once, reuse the cached ModelInfo; re-resolve only on engine restart or every 10 min as a safety net. ~99.8% fewer /v1/models calls. /metrics is still polled live every tick.
  • Per-endpoint API key. --engine-api-key (index-paired with --engine-url) plus a global --provider-api-key / SPARK_DASHBOARD_PROVIDER_API_KEY fallback that also covers auto-detected engines. Bearer token applied to /health, /v1/models, /metrics; HF requests stay unauthenticated. Keys are redacted from logs.
  • Implicit "stop on 401": with no key, resolution falls back to the launch-command model name and the unresolved-retry cooldown prevents the per-second retry storm.
  • dev loop: dev.sh forwards SPARK_DASHBOARD_PROVIDER_API_KEY into the remote backend launch; documented in dev/README.md and .env.example.

Additional metrics/UI work on this branch: TPOT on the engine latency card, prefix cache hit-rate plotting, cumulative token totals + tok/s in throughput titles, physical/Wi-Fi network interface selection, and engine chart tooltip cleanup.

ModelInfo serde shape is unchanged — no frontend / metrics-contract changes required.

Dependencies

Full dependency refresh to latest stable (chore(deps) commit, no release bump):

  • Rust: sysinfo 0.38→0.39, bollard 0.20→0.21, Cargo.lock refreshed for compatible transitive updates. No source changes required.
  • Frontend: all semver-minor/patch (react/react-dom 19.2.6, vite 8.0.13, vitest 4.1.6, eslint 10.4.0, tailwindcss 4.3.0, etc.); no breaking majors available. npm audit fix cleared 5 transitive advisories (fast-uri, hono, ip-address, express-rate-limit) → 0 vulnerabilities.
  • ⚠️ MSRV raised 1.75 → 1.95 (sysinfo 0.39 requirement). rust-version = "1.95" added to Cargo.toml; README updated. The DGX Spark deploy box needs rustup update (≥1.95) before the next deploy.sh run.

Test plan

  • cargo fmt --all -- --check
  • cargo clippy --all-targets --locked -- -D warnings
  • cargo test --locked — 96 passed
  • cd frontend && npm run build && npm test -- --run — 126 passed
  • Manual: auth-gated vLLM via dev env — /v1/models 200 once then cached; no per-second 401s; falls back to launch-command name without a key
  • CI ubuntu-24.04-arm job confirms linux-only crates (bollard 0.21, nvml-wrapper, procfs) compile — not verifiable on macOS locally

Refs #25

Stop hitting /v1/models every poll tick. Model identity is static for a
running engine, so resolve it once and reuse the cached ModelInfo:
re-resolve only on engine restart or every 10 minutes as a safety net.
This drops ~99.8% of /v1/models traffic and ends the per-second 401
retry storm against auth-gated deployments (issue #25).

- ApiKeyResolver: per-endpoint key (--engine-api-key, index-paired with
  --engine-url) with a global --provider-api-key /
  SPARK_DASHBOARD_PROVIDER_API_KEY fallback covering auto-detected engines.
  EngineOverride.api_key has a redacting Debug so keys never hit logs.
- VllmAdapter sends the bearer token on /health, /v1/models, /metrics;
  HF requests stay unauthenticated.
- EngineState caches ModelInfo with restart invalidation + unresolved
  retry cooldown; poll loop uses the cache, /metrics still polled live.
- dev.sh forwards SPARK_DASHBOARD_PROVIDER_API_KEY into the remote
  launch via `nohup env VAR=val`; documented in dev/README, .env.example.

ModelInfo serde shape unchanged — no frontend/metrics-contract changes.
Tests: 11 new unit tests for cache logic + key resolution (mod.rs).
Add a prefixCacheHit history buffer mirroring the existing kvCache
pipeline and render it as a second line on the engine Cache chart
(retitled from "KV Cache" to "Cache"). Both series share the 0-100%
domain. The static Prefix Hit tile is unchanged.
Surface vLLM's time_per_output_token histogram as TPOT — the average
gap between generating each output token, excluding TTFT. Full
treatment matching TTFT/ITL/E2E: average, p50/p95/p99 percentiles,
tunable SLO goodput, raw buckets, history buffers, latency-card tile
honoring the avg/p50/p95/p99 mode toggle, and a line on the Latency
time-series chart.

- Backend: TPOT_SLO_MS constant + 4 EngineMetrics fields; extraction
  in vllm.rs with v1/v0.6 metric-name fallback and warmup guard.
- Frontend: types, SLO defaults/settings (backward-compatible
  parseStored backfills tpotMs), history wiring, EngineCard tiles
  (TPOT in Latency card; TPOT goodput keeps the 2x2 grid under
  Combined), chart series.
- combinedGoodput intentionally stays TTFT/ITL/E2E (TPOT correlates
  with ITL; folding it in would double-weight decode latency).
- Tests added/updated on both sides.
Surface lifetime processed (prompt) and generated (decode) token counts
on the Prompt Processing and Token Generation cards, per-model and summed
on the overview. The vLLM counters were already scraped for rate math but
never exposed; add total_prompt_tokens / total_generation_tokens to
EngineMetrics and pass them through.

Frontend shows them next to the Live tile via a new LiveWithTotal
primitive, abbreviated K/M/B/T (formatCompactTokens) and counting up with
an ease-out AnimatedCounter that honors prefers-reduced-motion and snaps
on counter resets. Aggregate view sums them across running engines.

Tests: Rust counter-parse test, formatCompactTokens boundaries,
AnimatedCounter behavior, aggregate sum.
Expose vLLM's prefix_cache_queries_total (already scraped for hit-rate
math) on EngineMetrics as a pass-through lifetime counter, ungated by
warmup like the token totals. Show it as an animated, abbreviated
second-row tile in the Cache card per-engine and summed on the global
card. Aggregation sums across running engines via sumOrNull.
collect_network_metrics picked the interface with the most cumulative
traffic and summed rx/tx across every interface, so loopback (heavy
local engine/websocket traffic) won the name and inflated throughput.

Split extraction from a pure select_network_metrics: classify virtual
devices (loopback, container, VPN, bridge) by name prefix, prefer
non-virtual interfaces with a globally routable IP, break ties by
traffic, and scope rx/tx totals to real interfaces. Falls back to the
legacy all-interfaces behavior only when no real interface exists.

Metrics contract: serde shape unchanged (name/rx/tx) — frontend types,
formatters, vitest, and components are N/A; only emitted values change.
Rust unit tests added for selection, classification, and IP detection.
Prefill/Decode throughput charts now follow the latency convention:
unit in the title after a middle dot (Prefill Throughput · tok/s),
short legend labels (Live/Avg/Per-req), and a meaningful hover
header (Tokens / sec) via a new optional TimeSeriesChart tooltipLabel
prop. Other charts are unaffected.
Move the unit from after the dot into brackets on engine chart titles
(e.g. Prefill Throughput (tok/s), Latency (ms) · <stat>). Hide the
tooltip header on the Latency, E2E Latency, Requests, and Cache charts,
and label the E2E single-line series as 'E2E Latency' instead of the
bare unit.
Rust:
- sysinfo 0.38 -> 0.39 (now requires rustc 1.95)
- bollard 0.20 -> 0.21
- refresh Cargo.lock for compatible transitive updates
- add rust-version = "1.95" to reflect raised MSRV

Frontend (all semver-minor/patch, no breaking majors available):
- react/react-dom 19.2.6, vite 8.0.13, vitest 4.1.6, eslint 10.4.0,
  tailwindcss/@tailwindcss/vite 4.3.0, lucide-react 1.16.0, and others
- npm audit fix resolved 5 transitive advisories (fast-uri, hono,
  ip-address, express-rate-limit) with no breaking changes

MSRV raised 1.75 -> 1.95 (sysinfo 0.39 requirement); README updated.
No source changes required; all Rust + frontend gates green.
@niklasfrick niklasfrick marked this pull request as ready for review May 19, 2026 10:24
@niklasfrick niklasfrick merged commit 2e596b4 into main May 19, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant