Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 13 additions & 10 deletions .env.example
Original file line number Diff line number Diff line change
@@ -1,27 +1,30 @@
# Copy this file to `.env` and edit with your own values.
# The dev/ scripts source this file; Vite auto-loads VITE_* variables.

# --- Connection to the DGX Spark running the backend --------------------------
# SSH user on the Spark. Must have key-based auth configured (no passwords).
SPARK_USER=your-user
# --- Connection to the remote Linux host running the backend -----------------
# SSH user on the remote host. Must have key-based auth configured (no passwords).
DEPLOY_USER=your-user

# Hostname or IP address of the Spark.
SPARK_HOST=192.168.1.100
# Hostname or IP address of the remote host.
DEPLOY_HOST=192.168.1.100

# Path on the Spark where the project is synced.
# Path on the remote host where the project is synced.
# Relative paths (no leading slash) are resolved against the remote user's
# home directory. Use an absolute path like /opt/spark-dashboard if you want
# to sync somewhere else.
# Do NOT use a leading `~/` — that would be expanded to your *local* home.
SPARK_DIR=spark-dashboard
DEPLOY_DIR=spark-dashboard

# Legacy aliases: SPARK_USER / SPARK_HOST / SPARK_DIR still work as a fallback
# if DEPLOY_* are unset. You'll see a one-line deprecation note at dev.sh startup.

# --- Frontend (Vite dev server) ----------------------------------------------
# Where the Vite dev server proxies `/ws` and `/api` calls.
#
# Default `http://localhost:3000` works when you use an SSH tunnel that maps
# the Spark's port 3000 to your local machine (e.g. NVIDIA Sync port
# forwarding or `ssh -L 3000:localhost:3000 ...`).
# the remote host's port 3000 to your local machine
# (e.g. `ssh -L 3000:localhost:3000 ...`).
#
# To connect directly over the network, set this to:
# VITE_BACKEND_URL=http://${SPARK_HOST}:3000
# VITE_BACKEND_URL=http://${DEPLOY_HOST}:3000
VITE_BACKEND_URL=http://localhost:3000
6 changes: 3 additions & 3 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ bug reports, clear reproductions, and targeted PRs are all welcome.
```bash
git clone https://github.com/niklasfrick/spark-dashboard.git
cd spark-dashboard
cp .env.example .env # edit with your Spark's user/host
cp .env.example .env # edit with your remote host's user/host
./dev/dev.sh
```

Expand All @@ -21,7 +21,7 @@ environment variables are required.
# Frontend (runs on any OS)
cd frontend && npm test

# Backend (must run on Linux / the DGX Spark — depends on NVML, procfs)
# Backend (must run on Linux with NVIDIA drivers — depends on NVML, procfs)
cargo test
```

Expand Down Expand Up @@ -75,6 +75,6 @@ Do **not** hand-edit version numbers — `release-please` owns them.
When filing a bug, please include:

- What you expected vs. what happened
- DGX Spark OS / driver / CUDA versions (`nvidia-smi`)
- Host OS / NVIDIA driver / CUDA versions (`nvidia-smi`)
- Which engine adapter was involved (vLLM, etc.), if any
- A snippet from `/tmp/spark-dashboard.log` around the failure
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
name = "spark-dashboard"
version = "0.2.0"
edition = "2021"
description = "Real-time hardware and LLM inference monitoring for the NVIDIA DGX Spark"
description = "Real-time hardware and LLM inference monitoring for Linux hosts with NVIDIA GPUs"
license = "MIT"
repository = "https://github.com/niklasfrick/spark-dashboard"
readme = "README.md"
Expand Down
78 changes: 43 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,9 @@
# Spark Dashboard

Real-time hardware and LLM inference monitoring for the NVIDIA DGX Spark. A
Rust backend collects GPU, CPU, memory, disk, and network metrics alongside
Real-time hardware and LLM inference monitoring for Linux systems with NVIDIA
GPUs. Developed and tested on the NVIDIA DGX Spark, but works on any Linux
host with NVIDIA drivers — discrete-GPU workstations, DGX boxes, cloud VMs.
A Rust backend collects GPU, CPU, memory, disk, and network metrics alongside
vLLM engine statistics and streams them over WebSocket to a React frontend.

![Stack](https://img.shields.io/badge/Rust-Axum-orange) ![Stack](https://img.shields.io/badge/React_19-TypeScript-blue) ![Stack](https://img.shields.io/badge/Tailwind_CSS_4-06B6D4) ![Stack](https://img.shields.io/badge/Vite_8-646CFF) ![License](https://img.shields.io/badge/license-MIT-green)
Expand All @@ -10,25 +12,25 @@ vLLM engine statistics and streams them over WebSocket to a React frontend.

## Quick Start

### Install on the Spark
### Install on your Linux host

Run as your normal user on the DGX Spark (requires Rust 1.75+):
Run as your normal user on any Linux host with NVIDIA drivers (requires Rust 1.75+):

```bash
cargo install spark-dashboard
sudo spark-dashboard service install
systemctl status spark-dashboard
```

The dashboard is now served on port 3000. See [Install on the DGX Spark](#install-on-the-dgx-spark)
The dashboard is now served on port 3000. See [Install on your Linux host](#install-on-your-linux-host-1)
for the full guide, config overrides, and uninstall.

### Develop locally

```bash
git clone https://github.com/niklasfrick/spark-dashboard.git
cd spark-dashboard
cp .env.example .env # edit with your Spark's user/host
cp .env.example .env # edit with your remote host's user/host
./dev/dev.sh
```

Expand All @@ -41,7 +43,9 @@ for details on what each script does.
- GPU utilization, temperature, power draw, clock frequencies, fan speed
- GPU event detection — thermal throttling, hardware slowdown, power brake
- CPU aggregate and per-core utilization with heatmap
- Unified memory breakdown (GPU / CPU / cached / free)
- Memory breakdown — CPU RAM and GPU VRAM separately on discrete-GPU hosts,
or a single unified pool on systems where CPU and GPU share memory
(e.g. DGX Spark GB10, GH200)
- Disk and network I/O throughput

**LLM Engine Monitoring** (vLLM via Prometheus metrics)
Expand Down Expand Up @@ -71,7 +75,7 @@ for details on what each script does.
│ │ Static files (rust-embed) │ Recharts, Tailwind│
│ Axum router │ ◀──── production only ───────── │ shadcn/ui │
└──────────────────────┘ └────────────────────┘
DGX Spark Browser
Linux host (e.g. DGX Spark) Browser
```

Two independent Tokio tasks run in parallel — one for hardware metrics (NVML,
Expand All @@ -92,23 +96,25 @@ cp .env.example .env

| Variable | Purpose |
|--------------------|--------------------------------------------------------------|
| `SPARK_USER` | SSH user on the Spark (required) |
| `SPARK_HOST` | Hostname or IP of the Spark (required) |
| `SPARK_DIR` | Project path on the Spark, relative to remote home (default `spark-dashboard`) |
| `DEPLOY_USER` | SSH user on the remote host (required) |
| `DEPLOY_HOST` | Hostname or IP of the remote host (required) |
| `DEPLOY_DIR` | Project path on the remote host, relative to remote home (default `spark-dashboard`) |
| `VITE_BACKEND_URL` | Where Vite proxies `/ws` and `/api` (default `http://localhost:3000`) |

The scripts in `dev/` source this file; Vite picks up `VITE_*` variables
Legacy `SPARK_USER` / `SPARK_HOST` / `SPARK_DIR` are still accepted as a
fallback when `DEPLOY_*` are unset — `dev.sh` prints a one-line deprecation
note. The scripts in `dev/` source this file; Vite picks up `VITE_*` variables
automatically. `.env` is gitignored — never commit it.

## Install on the DGX Spark
## Install on your Linux host

The dashboard runs as a supervised `systemd` service on the Spark. Two install
paths; both build from source on the Spark.
The dashboard runs as a supervised `systemd` service. Two install paths; both
build from source on the host.

### Option A — via cargo (recommended)

```bash
# On the Spark. Requires Rust 1.75+ and internet access.
# On the host. Requires Rust 1.75+, NVIDIA drivers, and internet access.
cargo install spark-dashboard
sudo spark-dashboard service install
systemctl status spark-dashboard
Expand All @@ -126,7 +132,7 @@ Use this when you want to install without crates.io (audit the source,
air-gapped install, or deploy an unreleased commit).

```bash
# On the Spark. Run as your normal user — the script escalates to sudo
# On the host. Run as your normal user — the script escalates to sudo
# only for the systemd wiring step.
git clone https://github.com/niklasfrick/spark-dashboard.git
cd spark-dashboard
Expand All @@ -148,7 +154,8 @@ sudo spark-dashboard service status # same as `systemctl status`

Optional overrides live in `/etc/spark-dashboard/config.env` — set
`SPARK_DASHBOARD_PORT`, `SPARK_DASHBOARD_BIND`, `SPARK_DASHBOARD_POLL_INTERVAL`,
or `RUST_LOG`, then `sudo systemctl restart spark-dashboard`.
`SPARK_DASHBOARD_GPU_INDEX`, or `RUST_LOG`, then
`sudo systemctl restart spark-dashboard`.

### Upgrade

Expand Down Expand Up @@ -181,19 +188,21 @@ spark-dashboard service status
-p, --port <PORT> Listen port [default: 3000] [env: SPARK_DASHBOARD_PORT]
-b, --bind <BIND> Bind address [default: 0.0.0.0] [env: SPARK_DASHBOARD_BIND]
--poll-interval <MS> Polling interval ms [default: 1000] [env: SPARK_DASHBOARD_POLL_INTERVAL]
--gpu-index <IDX> NVML GPU index to monitor [default: 0] [env: SPARK_DASHBOARD_GPU_INDEX]
--engine <TYPE> Manual engine type (e.g. vllm)
--engine-url <URL> Manual engine endpoint (requires --engine)
```

Engines are auto-detected via process scan and Docker API. Use `--engine` and
`--engine-url` to override when auto-detection doesn't work.
On multi-GPU hosts use `--gpu-index` to select which device the dashboard
monitors. Engines are auto-detected via process scan and Docker API. Use
`--engine` and `--engine-url` to override when auto-detection doesn't work.

## Development

### Prerequisites

- **Local machine** (macOS or Linux): Node.js 20+, npm, rsync, ssh
- **DGX Spark**: Rust 1.75+, SSH access with key-based auth (no password prompts)
- **Remote host**: Linux + NVIDIA drivers, Rust 1.75+, SSH access with key-based auth (no password prompts)
- Optional: `brew install fswatch` for instant file-change detection (the
watcher falls back to 2s polling without it)

Expand All @@ -205,41 +214,41 @@ Engines are auto-detected via process scan and Docker API. Use `--engine` and

The script handles everything:

1. **Syncs** the full project to the Spark via rsync
2. **Builds** the Rust backend on the Spark (`cargo build --release`)
3. **Starts** the backend on the Spark (port 3000)
1. **Syncs** the full project to the remote host via rsync
2. **Builds** the Rust backend on the remote host (`cargo build --release`)
3. **Starts** the backend on the remote host (port 3000)
4. **Starts** the Vite dev server locally (port 5173)
5. **Watches** `src/` and `Cargo.toml` for Rust changes — auto-syncs and rebuilds on the Spark
5. **Watches** `src/` and `Cargo.toml` for Rust changes — auto-syncs and rebuilds on the remote host

| What you edit | What happens |
|------------------------------------|-------------------------------------------------------------------|
| Frontend files (`frontend/src/`) | Vite hot-reloads instantly in the browser |
| Backend files (`src/`, `Cargo.toml`) | Auto-detected → rsync to Spark → rebuild → restart (~compile time) |
| Backend files (`src/`, `Cargo.toml`) | Auto-detected → rsync to remote host → rebuild → restart (~compile time) |

Useful while `dev.sh` is running:

```bash
# Watch backend logs in another terminal
ssh "${SPARK_USER}@${SPARK_HOST}" tail -f /tmp/spark-dashboard.log
ssh "${DEPLOY_USER}@${DEPLOY_HOST}" tail -f /tmp/spark-dashboard.log

# Press Ctrl+C in the dev.sh terminal to stop everything (cleans up the remote process too)
```

### How the proxy works

By default, Vite proxies `/ws` and `/api` to `localhost:3000` — this works out
of the box with NVIDIA Sync port forwarding (or any SSH tunnel that maps
the Spark's port 3000 to your local machine).
of the box with any SSH tunnel that maps the remote host's port 3000 to your
local machine.

```
Browser → localhost:5173/ws → Vite proxy → localhost:3000/ws (forwarded to Spark)
Browser → localhost:5173/api → Vite proxy → localhost:3000/api (forwarded to Spark)
Browser → localhost:5173/ws → Vite proxy → localhost:3000/ws (forwarded to remote)
Browser → localhost:5173/api → Vite proxy → localhost:3000/api (forwarded to remote)
```

To connect directly over the network instead, set in `.env`:

```bash
VITE_BACKEND_URL=http://${SPARK_HOST}:3000
VITE_BACKEND_URL=http://${DEPLOY_HOST}:3000
```

The frontend connects to the WebSocket using `window.location.host`, so the
Expand All @@ -258,7 +267,7 @@ the latest stable version; see [CHANGELOG.md](./CHANGELOG.md) for release notes.
# Frontend
cd frontend && npm test

# Backend (on Linux / DGX Spark)
# Backend (on Linux)
cargo test
```

Expand All @@ -276,7 +285,7 @@ real NVML/procfs parsing on Linux, with compile-time stubs on other platforms.
│ │ ├── mod.rs MetricsSnapshot, collector loop
│ │ ├── gpu.rs NVML GPU metrics + event detection
│ │ ├── cpu.rs CPU aggregate + per-core
│ │ ├── memory.rs Unified memory via /proc/meminfo
│ │ ├── memory.rs System RAM + GPU VRAM + unified-memory detection
│ │ ├── disk.rs Disk I/O rates
│ │ └── network.rs Network I/O rates
│ └── engines/
Expand All @@ -296,7 +305,6 @@ real NVML/procfs parsing on Linux, with compile-time stubs on other platforms.
│ └── lib/ Circular buffer, formatting, theme
├── dev/
│ ├── dev.sh Dev loop (local frontend + remote backend)
│ ├── deploy.sh Production deploy to Spark
│ └── README.md Operator docs
├── .env.example Configuration template
├── LICENSE MIT
Expand Down
24 changes: 14 additions & 10 deletions dev/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,39 +4,43 @@ Development-only scripts for spark-dashboard. Configuration is read from a
repo-root `.env` file — copy `.env.example` to `.env` and edit before running.

For **production installs**, use `cargo install spark-dashboard` or
`packaging/install.sh`. See the repo [README](../README.md#install-on-the-dgx-spark).
`packaging/install.sh`. See the repo [README](../README.md#install-on-your-linux-host).

## Scripts

### `./dev/dev.sh` — development loop

Runs the full dev environment:

1. rsyncs the project to `${SPARK_USER}@${SPARK_HOST}:${SPARK_DIR}`
2. builds and starts the Rust backend on the Spark (`cargo build --release`)
1. rsyncs the project to `${DEPLOY_USER}@${DEPLOY_HOST}:${DEPLOY_DIR}`
2. builds and starts the Rust backend on the remote host (`cargo build --release`)
3. starts the Vite dev server locally on port 5173 with a proxy to the backend
4. streams remote backend logs from `/tmp/spark-dashboard.log`
5. watches `src/` and `Cargo.toml` — on change, re-syncs and rebuilds the backend

Frontend edits hot-reload in the browser via Vite. Backend edits trigger a
remote rebuild (takes about as long as `cargo build --release` does on your
Spark).
remote host).

## Required environment variables

| Variable | Purpose |
|--------------------|--------------------------------------------------------------|
| `SPARK_USER` | SSH user on the Spark (required) |
| `SPARK_HOST` | Hostname or IP of the Spark (required) |
| `SPARK_DIR` | Project path on the Spark, relative to remote home (default `spark-dashboard`) |
| `DEPLOY_USER` | SSH user on the remote host (required) |
| `DEPLOY_HOST` | Hostname or IP of the remote host (required) |
| `DEPLOY_DIR` | Project path on the remote host, relative to remote home (default `spark-dashboard`) |
| `VITE_BACKEND_URL` | Where Vite proxies `/ws` and `/api` (default `http://localhost:3000`) |

Missing `SPARK_USER` or `SPARK_HOST` causes the script to exit immediately with
a clear message.
Missing `DEPLOY_USER` or `DEPLOY_HOST` causes the script to exit immediately
with a clear message.

Legacy `SPARK_USER` / `SPARK_HOST` / `SPARK_DIR` are still accepted as a
fallback when `DEPLOY_*` are unset; you'll see a one-line deprecation note on
startup.

## Prerequisites

- **Local machine**: Node.js 20+, npm, rsync, ssh
- **DGX Spark**: Rust 1.75+ in `~/.cargo/env`, reachable over SSH
- **Remote host**: Linux with NVIDIA drivers, Rust 1.75+ in `~/.cargo/env`, reachable over SSH
- **SSH key auth** configured — the scripts make many non-interactive SSH calls and will block on password prompts
- Optional: `brew install fswatch` for instant change detection (otherwise the watcher polls every 2s)
Loading
Loading