feat: cache resolved model info and support per-endpoint engine API keys#26
Merged
Conversation
Stop hitting /v1/models every poll tick. Model identity is static for a running engine, so resolve it once and reuse the cached ModelInfo: re-resolve only on engine restart or every 10 minutes as a safety net. This drops ~99.8% of /v1/models traffic and ends the per-second 401 retry storm against auth-gated deployments (issue #25). - ApiKeyResolver: per-endpoint key (--engine-api-key, index-paired with --engine-url) with a global --provider-api-key / SPARK_DASHBOARD_PROVIDER_API_KEY fallback covering auto-detected engines. EngineOverride.api_key has a redacting Debug so keys never hit logs. - VllmAdapter sends the bearer token on /health, /v1/models, /metrics; HF requests stay unauthenticated. - EngineState caches ModelInfo with restart invalidation + unresolved retry cooldown; poll loop uses the cache, /metrics still polled live. - dev.sh forwards SPARK_DASHBOARD_PROVIDER_API_KEY into the remote launch via `nohup env VAR=val`; documented in dev/README, .env.example. ModelInfo serde shape unchanged — no frontend/metrics-contract changes. Tests: 11 new unit tests for cache logic + key resolution (mod.rs).
Add a prefixCacheHit history buffer mirroring the existing kvCache pipeline and render it as a second line on the engine Cache chart (retitled from "KV Cache" to "Cache"). Both series share the 0-100% domain. The static Prefix Hit tile is unchanged.
Surface vLLM's time_per_output_token histogram as TPOT — the average gap between generating each output token, excluding TTFT. Full treatment matching TTFT/ITL/E2E: average, p50/p95/p99 percentiles, tunable SLO goodput, raw buckets, history buffers, latency-card tile honoring the avg/p50/p95/p99 mode toggle, and a line on the Latency time-series chart. - Backend: TPOT_SLO_MS constant + 4 EngineMetrics fields; extraction in vllm.rs with v1/v0.6 metric-name fallback and warmup guard. - Frontend: types, SLO defaults/settings (backward-compatible parseStored backfills tpotMs), history wiring, EngineCard tiles (TPOT in Latency card; TPOT goodput keeps the 2x2 grid under Combined), chart series. - combinedGoodput intentionally stays TTFT/ITL/E2E (TPOT correlates with ITL; folding it in would double-weight decode latency). - Tests added/updated on both sides.
Surface lifetime processed (prompt) and generated (decode) token counts on the Prompt Processing and Token Generation cards, per-model and summed on the overview. The vLLM counters were already scraped for rate math but never exposed; add total_prompt_tokens / total_generation_tokens to EngineMetrics and pass them through. Frontend shows them next to the Live tile via a new LiveWithTotal primitive, abbreviated K/M/B/T (formatCompactTokens) and counting up with an ease-out AnimatedCounter that honors prefers-reduced-motion and snaps on counter resets. Aggregate view sums them across running engines. Tests: Rust counter-parse test, formatCompactTokens boundaries, AnimatedCounter behavior, aggregate sum.
Expose vLLM's prefix_cache_queries_total (already scraped for hit-rate math) on EngineMetrics as a pass-through lifetime counter, ungated by warmup like the token totals. Show it as an animated, abbreviated second-row tile in the Cache card per-engine and summed on the global card. Aggregation sums across running engines via sumOrNull.
collect_network_metrics picked the interface with the most cumulative traffic and summed rx/tx across every interface, so loopback (heavy local engine/websocket traffic) won the name and inflated throughput. Split extraction from a pure select_network_metrics: classify virtual devices (loopback, container, VPN, bridge) by name prefix, prefer non-virtual interfaces with a globally routable IP, break ties by traffic, and scope rx/tx totals to real interfaces. Falls back to the legacy all-interfaces behavior only when no real interface exists. Metrics contract: serde shape unchanged (name/rx/tx) — frontend types, formatters, vitest, and components are N/A; only emitted values change. Rust unit tests added for selection, classification, and IP detection.
Prefill/Decode throughput charts now follow the latency convention: unit in the title after a middle dot (Prefill Throughput · tok/s), short legend labels (Live/Avg/Per-req), and a meaningful hover header (Tokens / sec) via a new optional TimeSeriesChart tooltipLabel prop. Other charts are unaffected.
Move the unit from after the dot into brackets on engine chart titles (e.g. Prefill Throughput (tok/s), Latency (ms) · <stat>). Hide the tooltip header on the Latency, E2E Latency, Requests, and Cache charts, and label the E2E single-line series as 'E2E Latency' instead of the bare unit.
Rust: - sysinfo 0.38 -> 0.39 (now requires rustc 1.95) - bollard 0.20 -> 0.21 - refresh Cargo.lock for compatible transitive updates - add rust-version = "1.95" to reflect raised MSRV Frontend (all semver-minor/patch, no breaking majors available): - react/react-dom 19.2.6, vite 8.0.13, vitest 4.1.6, eslint 10.4.0, tailwindcss/@tailwindcss/vite 4.3.0, lucide-react 1.16.0, and others - npm audit fix resolved 5 transitive advisories (fast-uri, hono, ip-address, express-rate-limit) with no breaking changes MSRV raised 1.75 -> 1.95 (sysinfo 0.39 requirement); README updated. No source changes required; all Rust + frontend gates green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Addresses #25 plus follow-on API/metrics work and a full dependency refresh.
Summary
Closes the root cause behind #25: spark-dashboard hit
GET /v1/modelson every poll tick (default 1s), flooding auth-gated vLLM logs with401 Unauthorizedand wasting traffic even on open deployments./v1/modelsonce, reuse the cachedModelInfo; re-resolve only on engine restart or every 10 min as a safety net. ~99.8% fewer/v1/modelscalls./metricsis still polled live every tick.--engine-api-key(index-paired with--engine-url) plus a global--provider-api-key/SPARK_DASHBOARD_PROVIDER_API_KEYfallback that also covers auto-detected engines. Bearer token applied to/health,/v1/models,/metrics; HF requests stay unauthenticated. Keys are redacted from logs.dev.shforwardsSPARK_DASHBOARD_PROVIDER_API_KEYinto the remote backend launch; documented indev/README.mdand.env.example.Additional metrics/UI work on this branch: TPOT on the engine latency card, prefix cache hit-rate plotting, cumulative token totals + tok/s in throughput titles, physical/Wi-Fi network interface selection, and engine chart tooltip cleanup.
ModelInfoserde shape is unchanged — no frontend / metrics-contract changes required.Dependencies
Full dependency refresh to latest stable (
chore(deps)commit, no release bump):sysinfo0.38→0.39,bollard0.20→0.21,Cargo.lockrefreshed for compatible transitive updates. No source changes required.npm audit fixcleared 5 transitive advisories (fast-uri, hono, ip-address, express-rate-limit) → 0 vulnerabilities.sysinfo0.39 requirement).rust-version = "1.95"added toCargo.toml; README updated. The DGX Spark deploy box needsrustup update(≥1.95) before the nextdeploy.shrun.Test plan
cargo fmt --all -- --checkcargo clippy --all-targets --locked -- -D warningscargo test --locked— 96 passedcd frontend && npm run build && npm test -- --run— 126 passed/v1/models200 once then cached; no per-second 401s; falls back to launch-command name without a keyubuntu-24.04-armjob confirms linux-only crates (bollard0.21,nvml-wrapper,procfs) compile — not verifiable on macOS locallyRefs #25