Prometheus collector and exporter for metrics extracted from the Slurm resource scheduling system.
Note
Looking for a next-generation Slurm exporter with native OpenMetrics support (Slurm 25.11+)? Check out my new project: sckyzo/slurm_prometheus_exporter
✨ Features: Native OpenMetrics · Multiple endpoints · Basic Auth & TLS · Global labels · YAML config · Clean Architecture
- ✨ Features
- 📦 Installation
- ⚙️ Configuration (flags, collectors, Prometheus)
- 📊 Metrics Reference (all 14 collectors)
- 🛠️ Development (build, test, lint)
- 📈 Grafana Dashboards
- 📸 Screenshots
- 📜 License
- ✅ Exports a wide range of metrics from Slurm, including nodes, partitions, jobs, CPUs, and GPUs.
- ✅ All metric collectors are optional and can be enabled/disabled via flags.
- ✅ Supports TLS and Basic Authentication for secure connections.
- ✅ OpenMetrics format supported (exemplars, newer Prometheus features).
- ✅ Per-collector health metrics (
slurm_exporter_collector_success,slurm_exporter_collector_duration_seconds). - ✅ Liveness probe at
/healthzfor orchestrators (Kubernetes, systemd). - ✅ GPU metrics per account and user (
slurm_account_gpus_running,slurm_user_gpus_running). - ✅ Per-reservation node state metrics (
slurm_reservation_nodes_*). - ✅ Ready-to-use Grafana dashboard.
There are two recommended ways to install the Slurm Exporter.
This is the easiest method for most users.
-
Download the latest release for your OS and architecture from the GitHub Releases page. 📥
-
Place the
slurm_exporterbinary in a suitable location on a node with Slurm CLI access, such as/usr/local/bin/. -
Ensure the binary is executable:
chmod +x /usr/local/bin/slurm_exporter
-
(Optional) To run the exporter as a service, you can adapt the example Systemd unit file provided in this repository at systemd/slurm_exporter.service.
-
Copy it to
/etc/systemd/system/slurm_exporter.serviceand customize it for your environment (especially theExecStartpath). -
Reload the Systemd daemon, then enable and start the service:
sudo systemctl daemon-reload sudo systemctl enable slurm_exporter sudo systemctl start slurm_exporter
-
If you want to build the exporter yourself, you can do so using the provided Makefile. 👩💻
-
Clone the repository:
git clone https://github.com/sckyzo/slurm_exporter.git cd slurm_exporter -
Build the binary:
make build
-
The new binary will be available at
bin/slurm_exporter. You can then copy it to a location like/usr/local/bin/and set up the Systemd service as described in the section above.
Ten ready-to-use Grafana dashboards are provided in the dashboards_grafana/ directory.
All dashboards use a $datasource template variable and are compatible with Grafana 12+.
| # | Dashboard | UID | Description |
|---|---|---|---|
| 01 | Cluster Overview | slurm-overview |
Global cluster health: CPU/GPU utilization, node states, job totals, partition summary |
| 02 | Jobs & Queue | slurm-jobs |
Job queue details by user, account, partition — pending reasons, top users |
| 03 | Node Detail | slurm-nodes |
Per-node CPU & memory table (filtered by partition), scalable to 100k+ nodes |
| 04 | Cluster Usage Statistics | slurm-usage |
CPU/GPU utilization gauges, fairshare per account, top users by CPU |
| 05 | Scheduler | slurm-scheduler |
slurmctld internals: cycle time, backfill, RPC statistics |
| 06 | Reservations & Licenses | slurm-reservations |
Active reservations, node states per reservation, license usage |
| 07 | Accounting | slurm-accounting |
User/account consumption, FairShare analysis, top consumers, priority diagnostics |
| 08 | Exporter Health | slurm-health |
Collector OK/FAIL status, scrape duration history, Slurm binary versions |
| 09 | Exporter Performance | slurm-exporter-perf |
Command durations, cache freshness, error rates, scrape health (new in v1.8.0) |
| 10 | All Metrics Reference | slurm-all-metrics |
Exhaustive reference panel for every exported metric |
Option 1 — Copy JSON files to your Grafana provisioning directory:
cp dashboards_grafana/*.json /etc/grafana/provisioning/dashboards/Option 2 — Import via API:
for f in dashboards_grafana/*.json; do
curl -s -X POST http://admin:password@grafana-host:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d "{\"dashboard\": $(cat $f), \"overwrite\": true, \"folderId\": 0}"
doneScale note (Node Detail dashboard): The per-node table is filtered by the
$partitionvariable. On clusters with 100k+ nodes, always select a specific partition to avoid loading excessive data. The partition summary and problem nodes panels are always scalable regardless of cluster size.
Screenshots taken on a 20-node test cluster (alice/bob/carol/dave/eve/frank, multiple accounts and partitions). Click any thumbnail to open the full-size image. See
dashboards_grafana/README.mdfor the full dashboard documentation.
|
All 10 dashboards documented in |
||
This project is licensed under the GNU General Public License, version 3 or later.
This project is a fork of cea-hpc/slurm_exporter, which itself is a fork of vpenso/prometheus-slurm-exporter (now apparently unmaintained).
Feel free to contribute or open issues!
