Real-time hardware and LLM inference monitoring for Linux systems with NVIDIA GPUs. Developed and tested on the NVIDIA DGX Spark, but works on any Linux host with NVIDIA drivers — discrete-GPU workstations, DGX boxes, cloud VMs. A Rust backend collects GPU, CPU, memory, disk, and network metrics alongside vLLM engine statistics and streams them over WebSocket to a React frontend.
Run as your normal user on any Linux host with NVIDIA drivers (requires Rust 1.95+):
cargo install spark-dashboard
sudo ~/.cargo/bin/spark-dashboard service install
systemctl status spark-dashboardThe dashboard is now served on port 3000. See Install on your Linux host for the full guide, config overrides, and uninstall.
git clone https://github.com/niklasfrick/spark-dashboard.git
cd spark-dashboard
cp .env.example .env # edit with your remote host's user/host
./dev/dev.shOpen http://localhost:5173 in your browser. See dev/README.md
for details on what each script does.
Hardware Monitoring (1s polling via NVML, sysinfo, procfs)
- GPU utilization, temperature, power draw, clock frequencies, fan speed
- GPU event detection — thermal throttling, hardware slowdown, power brake
- CPU aggregate and per-core utilization with heatmap
- Memory breakdown — CPU RAM and GPU VRAM separately on discrete-GPU hosts, or a single unified pool on systems where CPU and GPU share memory (e.g. DGX Spark GB10, GH200)
- Disk and network I/O throughput
LLM Engine Monitoring (vLLM via Prometheus metrics)
- Tokens per second (generation + prompt)
- Time to first token, inter-token latency, end-to-end latency, queue time
- Active/queued requests, batch size
- KV cache utilization, prefix cache hit rate
- Automatic engine discovery via process scan and Docker API
- SLO Goodput
Multi-Engine Support
- Run and monitor any number of inference engines side by side — each vLLM process or container is detected automatically and gets its own tab
- All Engines overview tab with running-engine count, summed throughput, and weighted-mean latencies across every detected engine
- Per-engine drill-down tabs with provider chips showing the engine type (vLLM) and source (Docker / host process)
- Auto-rotating tab carousel with a configurable interval (default 3 s); rotation pauses automatically the moment you interact with a tab so you can focus on a single engine without fighting the UI
Dashboard
- Arc gauges, time-series charts, sparklines, per-core heatmap
- 15-minute rolling history with circular buffers
- Connection status badge, staleness detection, auto-reconnect
┌──────────────────────┐ WebSocket (JSON) ┌────────────────────┐
│ Rust Backend │ ──────────────────────────────▶ │ React Frontend │
│ │ │ │
│ Tokio tasks: │ │ useMetrics hook │
│ ├─ metrics_collector│ broadcast channel (capacity 16) │ ├─ WebSocket conn │
│ │ (GPU/CPU/mem/…) ─┼──▶ tx ──▶ ws_handler ──▶ client │ ├─ batch flush 2s │
│ └─ engine_collector │ │ └─ circular bufs │
│ (vLLM/Docker) │ │ │
│ │ Static files (rust-embed) │ Recharts, Tailwind│
│ Axum router │ ◀──── production only ───────── │ shadcn/ui │
└──────────────────────┘ └────────────────────┘
Linux host (e.g. DGX Spark) Browser
Two independent Tokio tasks run in parallel — one for hardware metrics (NVML,
sysinfo, procfs) and one for engine detection/polling. Both feed into a
broadcast channel that fans out to all connected WebSocket clients. In
production the frontend is embedded in the binary via rust-embed; in
development, Vite serves the frontend locally and proxies API/WebSocket
traffic to the remote backend.
All operator config lives in a repo-root .env file. Copy the template and
edit:
cp .env.example .env| Variable | Purpose |
|---|---|
DEPLOY_USER |
SSH user on the remote host (required) |
DEPLOY_HOST |
Hostname or IP of the remote host (required) |
DEPLOY_DIR |
Project path on the remote host, relative to remote home (default spark-dashboard) |
VITE_BACKEND_URL |
Where Vite proxies /ws and /api (default http://localhost:3000) |
Legacy SPARK_USER / SPARK_HOST / SPARK_DIR are still accepted as a
fallback when DEPLOY_* are unset — dev.sh prints a one-line deprecation
note. The scripts in dev/ source this file; Vite picks up VITE_* variables
automatically. .env is gitignored — never commit it.
The dashboard runs as a supervised systemd service. Two install paths; both
build from source on the host.
# On the host. Requires Rust 1.95+, NVIDIA drivers, and internet access.
cargo install spark-dashboard
sudo ~/.cargo/bin/spark-dashboard service install
systemctl status spark-dashboardcargo install pulls the crate from crates.io
and compiles it locally. service install copies the binary to
/usr/local/bin, creates a locked-down spark-dashboard system user (added
to video, render, docker groups for NVML and Docker access), writes the
systemd unit, and enables it.
Why the explicit
~/.cargo/bin/path?cargo installdrops the binary in~/.cargo/bin, which isn't onsudo's sanitizedsecure_pathand isn't always on the user's interactive PATH either (depends on how Rust was installed). Passing the absolute path makes the command work regardless. Afterservice installcopies the binary to/usr/local/bin, subsequentsudo spark-dashboard …calls (e.g.service status,service uninstall) resolve normally.
Use this when you want to install without crates.io (audit the source, air-gapped install, or deploy an unreleased commit).
# On the host. Run as your normal user — the script escalates to sudo
# only for the systemd wiring step.
git clone https://github.com/niklasfrick/spark-dashboard.git
cd spark-dashboard
./packaging/install.shThis builds the frontend (npm run build) and the Rust binary
(cargo build --release), then hands off to the same service install
logic as Option A. You'll be prompted for your sudo password once, when
the service is installed.
sudo systemctl {start|stop|restart} spark-dashboard
journalctl -u spark-dashboard -f # follow logs
sudo spark-dashboard service status # same as `systemctl status`Optional overrides live in /etc/spark-dashboard/config.env — set
SPARK_DASHBOARD_PORT, SPARK_DASHBOARD_BIND, SPARK_DASHBOARD_POLL_INTERVAL,
SPARK_DASHBOARD_GPU_INDEX, SPARK_DASHBOARD_PROVIDER_API_KEY, or RUST_LOG, then
sudo systemctl restart spark-dashboard.
# Option A
cargo install --force spark-dashboard && sudo ~/.cargo/bin/spark-dashboard service install
# Option B
cd spark-dashboard && git pull && ./packaging/install.shRe-running service install is idempotent: it stops the service, swaps the
binary, and starts it again, preserving /etc/spark-dashboard/config.env.
sudo spark-dashboard service uninstall # keep /etc/spark-dashboard
sudo spark-dashboard service uninstall --purge # remove everythingspark-dashboard [OPTIONS] run the server (default)
spark-dashboard service install [--prefix /usr/local]
spark-dashboard service uninstall [--purge]
spark-dashboard service status
-p, --port <PORT> Listen port [default: 3000] [env: SPARK_DASHBOARD_PORT]
-b, --bind <BIND> Bind address [default: 0.0.0.0] [env: SPARK_DASHBOARD_BIND]
--poll-interval <MS> Polling interval ms [default: 1000] [env: SPARK_DASHBOARD_POLL_INTERVAL]
--gpu-index <IDX> NVML GPU index to monitor [default: 0] [env: SPARK_DASHBOARD_GPU_INDEX]
--engine <TYPE> Manual engine type (e.g. vllm)
--engine-url <URL> Manual engine endpoint (requires --engine)
--engine-api-key <KEY> API key for an endpoint, paired by index with --engine-url
--provider-api-key <KEY> Fallback API key for any endpoint [env: SPARK_DASHBOARD_PROVIDER_API_KEY]
On multi-GPU hosts use --gpu-index to select which device the dashboard
monitors. Engines are auto-detected via process scan and Docker API. Use
--engine and --engine-url to override when auto-detection doesn't work.
For auth-gated deployments (e.g. vLLM started with --api-key), pass
--engine-api-key (index-paired with --engine-url) or set
SPARK_DASHBOARD_PROVIDER_API_KEY as a global fallback covering auto-detected
engines too. Model info is resolved from /v1/models once and cached —
re-resolved only on engine restart or every 10 minutes — so an auth-gated
engine is no longer hit on every poll tick.
- Local machine (macOS or Linux): Node.js 20+, npm, rsync, ssh
- Remote host: Linux + NVIDIA drivers, Rust 1.95+, SSH access with key-based auth (no password prompts)
- Optional:
brew install fswatchfor instant file-change detection (the watcher falls back to 2s polling without it)
./dev/dev.shThe script handles everything:
- Syncs the full project to the remote host via rsync
- Builds the Rust backend on the remote host (
cargo build --release) - Starts the backend on the remote host (port 3000)
- Starts the Vite dev server locally (port 5173)
- Watches
src/andCargo.tomlfor Rust changes — auto-syncs and rebuilds on the remote host
| What you edit | What happens |
|---|---|
Frontend files (frontend/src/) |
Vite hot-reloads instantly in the browser |
Backend files (src/, Cargo.toml) |
Auto-detected → rsync to remote host → rebuild → restart (~compile time) |
Useful while dev.sh is running:
# Watch backend logs in another terminal
ssh "${DEPLOY_USER}@${DEPLOY_HOST}" tail -f /tmp/spark-dashboard.log
# Press Ctrl+C in the dev.sh terminal to stop everything (cleans up the remote process too)By default, Vite proxies /ws and /api to localhost:3000 — this works out
of the box with any SSH tunnel that maps the remote host's port 3000 to your
local machine.
Browser → localhost:5173/ws → Vite proxy → localhost:3000/ws (forwarded to remote)
Browser → localhost:5173/api → Vite proxy → localhost:3000/api (forwarded to remote)
To connect directly over the network instead, set in .env:
VITE_BACKEND_URL=http://${DEPLOY_HOST}:3000The frontend connects to the WebSocket using window.location.host, so the
proxy is transparent — no code changes between dev and production.
Releases are cut from main via release-please —
conventional commits drive the version bump, merging the release PR tags
vX.Y.Z and triggers cargo publish to crates.io. main always reflects
the latest stable version; see CHANGELOG.md for release notes.
# Frontend
cd frontend && npm test
# Backend (on Linux)
cargo testBackend tests include platform-aware stubs — GPU and memory tests validate real NVML/procfs parsing on Linux, with compile-time stubs on other platforms.
├── src/
│ ├── main.rs CLI args, task spawning, server startup
│ ├── server.rs Axum router, static file serving
│ ├── ws.rs WebSocket handler
│ ├── metrics/
│ │ ├── mod.rs MetricsSnapshot, collector loop
│ │ ├── gpu.rs NVML GPU metrics + event detection
│ │ ├── cpu.rs CPU aggregate + per-core
│ │ ├── memory.rs System RAM + GPU VRAM + unified-memory detection
│ │ ├── disk.rs Disk I/O rates
│ │ └── network.rs Network I/O rates
│ └── engines/
│ ├── mod.rs Engine trait, state machine, collector
│ ├── detector.rs Process scan + Docker discovery
│ ├── vllm.rs vLLM adapter (Prometheus parsing)
│ └── prometheus.rs Prometheus text-format parser
├── frontend/
│ └── src/
│ ├── hooks/ useMetrics, useMetricsHistory
│ ├── components/
│ │ ├── views/ Dashboard, GlanceableView, DetailedView
│ │ ├── engines/ EngineSection, EngineCard
│ │ ├── charts/ TimeSeriesChart, Sparkline, CoreHeatmap
│ │ └── gauges/ ArcGauge
│ ├── types/ TypeScript type definitions
│ └── lib/ Circular buffer, formatting, theme
├── dev/
│ ├── dev.sh Dev loop (local frontend + remote backend)
│ └── README.md Operator docs
├── .env.example Configuration template
├── LICENSE MIT
├── CONTRIBUTING.md
└── Cargo.toml
See CONTRIBUTING.md.
MIT — see LICENSE.
