Eight ready-to-use Grafana dashboards for the Prometheus Slurm Exporter.
All dashboards are compatible with Grafana 12+ and use a $datasource
template variable for portability.
Option 1 — Grafana UI: Go to Dashboards → Import, paste a JSON file or upload it directly.
Option 2 — Provisioning (recommended for permanent setup):
cp dashboards_grafana/*.json /etc/grafana/provisioning/dashboards/
# Reload Grafana or wait for the provisioning interval (default: 30s)Option 3 — API (batch import):
for f in dashboards_grafana/*.json; do
curl -s -X POST http://admin:password@grafana-host:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d "{\"dashboard\": $(cat "$f"), \"overwrite\": true, \"folderId\": 0}"
doneFile: 01-slurm-overview.json | UID: slurm-overview
Global cluster snapshot: node states, CPU/GPU utilization gauges, job totals, per-partition table (CPU allocation, running/pending jobs), and job state history.
| Panel | Type | Description |
|---|---|---|
| Total Nodes / Active / Running / Pending / CPU % / Down+Drain | stat | Quick health indicators |
| CPU Utilization | gauge | slurm_cpus_alloc / slurm_cpus_total * 100 |
| Node Utilization | gauge | (alloc + mix) / total * 100 |
| Jobs Over Time | timeseries | Running / Pending / Completing trend |
| Node State Breakdown | bargauge | Idle / Mixed / Alloc / Drain / Down / Maint |
| Node States Over Time | timeseries | Stacked node state history |
| CPUs Over Time | timeseries | Allocated / Idle / Total |
| Partitions | table | CPU Alloc / Total / % / Running Jobs / Pending Jobs per partition |
| Job States Over Time | timeseries | Running / Pending / Completing / Failed / Suspended |
| Cores Requested Over Time | timeseries | Cores running vs pending |
| Collector Status | stat | OK/FAIL per collector |
| Collector Duration | timeseries | Scrape duration history |
File: 02-slurm-jobs.json | UID: slurm-jobs
Detailed job queue monitoring: global totals, top users, pending reasons, per-partition breakdown, and job state timeline.
| Panel | Type | Description |
|---|---|---|
| Running / Pending / Cores / Failed / Timeout | stat | Cluster-wide instant counters |
| All Job States | timeseries | All slurm_jobs_* globals over time |
| Cores Running & Pending | timeseries | CPU core demand trend |
| Top Users — Running Jobs | bargauge | topk(15, slurm_user_jobs_running) |
| Top Users — Running CPUs | bargauge | topk(15, slurm_user_cpus_running) |
| Top Accounts — Running CPUs | bargauge | topk(15, slurm_account_cpus_running) |
| Pending Jobs by Reason | bargauge | sum by(reason) (slurm_queue_pending) |
| Queue State by Partition | timeseries | Running / Pending per partition |
| Running vs Pending Timeline | state-timeline | Visual job activity history |
File: 03-slurm-nodes.json | UID: slurm-nodes
Per-node CPU and memory utilization. Designed to scale from small clusters
to 100k+ nodes via a $partition filter variable.
| Panel | Type | Description |
|---|---|---|
| Total / Idle / Mixed / Alloc / Down / Drain | stat | Cluster-wide node counts |
| Node States per Partition | barchart | Stacked distribution by partition |
| Partition Summary | table | Alloc / Idle / Mix / Down / Drain per partition |
| Down & Drain Nodes | table | Always scalable — only shows degraded nodes |
| Node CPU & Memory Detail | table | Per-node: CPU alloc/total/%, Mem alloc/total/% — filtered by $partition |
| CPU Allocated per Node (top 10) | timeseries | Most loaded nodes over time |
| Memory Allocated per Node (top 10) | timeseries | Top memory consumers |
Scale note: The per-node detail table is filtered by the
$partitionvariable. On clusters with 100k+ nodes, always select a specific partition to limit results. The partition summary and "Down & Drain Nodes" panels are always O(partitions).
File: 04-slurm-usage.json | UID: slurm-usage
Comprehensive utilization metrics: CPU/memory/GPU gauges, per-user and per-account breakdowns, fairshare, scheduler health, and trend timeseries.
| Panel | Type | Description |
|---|---|---|
| CPU / Node / Memory / GPU Utilization | gauge | Current utilization percentages |
| Avg CPU Util (period) | stat | avg_over_time(cpu_util[1h:5m]) |
| Max CPU Util (period) | stat | max_over_time(cpu_util[1h:5m]) |
| Total CPUs / Nodes / GPUs / Running Jobs | stat | Cluster capacity snapshot |
| CPU Utilization % Over Time | timeseries | Instantaneous + rolling average |
| Memory Utilization % Over Time | timeseries | sum(node_mem_alloc) / sum(node_mem_total) |
| GPU Utilization & Allocation % | timeseries | GPU util and alloc trend |
| Node States Over Time | timeseries | Idle / Mix / Alloc / Down / Drain |
| FairShare per Account | timeseries | Fairshare factor by account |
| Running / Pending Jobs per User & Account | timeseries | Per-user and per-account breakdown |
| CPUs Allocated per User & Account | timeseries | CPU demand by user/account |
| GPU Running per Account & User | timeseries | GPU job distribution |
| CPU Allocation (total) | timeseries | Alloc / Idle / Total CPUs |
| CPUs Allocated per Partition | timeseries | Per-partition CPU usage |
| GPU States Over Time | timeseries | Total / Alloc / Idle GPU trend |
| Scheduler Stats | stat + timeseries | Threads, queue size, cycle time, backfill |
File: 05-slurm-scheduler.json | UID: slurm-scheduler
Deep-dive into slurmctld internals: main scheduler and backfill cycle times,
RPC statistics, queue sizes, and thread counts.
| Panel | Type | Description |
|---|---|---|
| Scheduler Threads | stat | Active scheduler threads |
| Queue Size | stat | Jobs in scheduler queue |
| Mean Cycle / Last Cycle | stat | Scheduler cycle times (µs) |
| DBD Queue | gauge | Slurm accounting daemon queue depth |
| Cycles/min | stat | Scheduler frequency |
| Scheduler Cycles (µs) | timeseries | Last vs mean cycle time |
| Backfill Cycles (µs) | timeseries | Backfill last vs mean |
| RPC Stats — Count | bargauge | Top RPCs by count |
| RPC Stats — Time | bargauge | Top RPCs by total time |
| RPC Calls Over Time | timeseries | RPC call frequency |
| User RPC Stats | table | Per-user RPC counts |
File: 06-slurm-reservations.json | UID: slurm-reservations
Active Slurm reservations, per-reservation node states, and license usage. License panels show "No data" when no licenses are configured — this is expected.
| Panel | Type | Description |
|---|---|---|
| Active Reservations | stat | Count of current reservations |
| Reservation Info | table | Name / State / Users / Nodes / Partition / Flags |
| Reservation Timeline | timeseries | Start/end times as gauge |
| Nodes per Reservation | bargauge | Node counts by reservation |
| Node States per Reservation | timeseries | Alloc / Idle / Mix / Down / Drain per reservation |
| Reservation Nodes Healthy % | gauge | (alloc+idle+mix+planned) / total_in_reservation |
| License Usage | timeseries | Total / Used / Free / Reserved per license |
| License Utilization % | bargauge | used / total * 100 per license |
Note: License panels (
slurm_license_*) only show data when Slurm licenses are configured (scontrol show licenses).
File: 07-slurm-accounting.json | UID: slurm-accounting
Dedicated HPC accounting dashboard. Answers the key question:
"Why is this user's priority low?" — by exposing FairShare components
(NormUsage, NormShares) and their ratio directly in Grafana.
Requires --collector.fairshare.user-metrics=true (default) for user-level panels.
Filter by $account and $user template variables.
| Panel | Type | Description |
|---|---|---|
| Running / Pending / CPUs / Active Users & Accounts | stat | Current cluster snapshot |
| Top Users — Running Jobs / CPUs | bargauge | topk(15, slurm_user_jobs_running/cpus_running) |
| Top Accounts — Running CPUs | bargauge | topk(10, slurm_account_cpus_running) |
| User FairShare Summary | table | user · account · FairShare · NormShares · NormUsage · Usage/Shares ratio · CPU-seconds |
| Users by FairShare (ascending) | bargauge | Lowest priority users at the top |
| Account Summary | table | FairShare · NormShares · NormUsage · CPU-seconds per account |
| FairShare per Account Over Time | timeseries | Account priority trend |
| Running Jobs / CPUs per User | timeseries | User activity history |
| Running Jobs per Account | timeseries | Account activity history |
| User FairShare Over Time | timeseries | Tracks priority evolution — declining = overusing share |
File: 08-slurm-health.json | UID: slurm-health
Monitors the health of the exporter itself: collector success/failure, scrape duration history, and Slurm binary availability.
| Panel | Type | Description |
|---|---|---|
| Collector Status | stat | OK / FAIL per collector — background color alert |
| Collectors Healthy % | gauge | sum(success) / count(success) * 100 |
| Collector Health Timeline | state-timeline | Visual OK/FAIL history per collector |
| Scrape Duration Status History | status-history | Duration colored by threshold (<1s green, 1-5s yellow, >5s red) |
| Last Scrape Duration | bargauge | Current duration per collector, sorted slowest-first |
| Duration Over Time | timeseries | Collector duration trend for detecting degradation |
| Slurm Version | stat | slurm_info{type="general"} — active Slurm version |
| Slurm Binaries | table | Availability check of each Slurm binary (sacct, sbatch, sinfo, …) |
File: 09-slurm-exporter-perf.json | UID: slurm-exporter-perf
Internal performance dashboard for the exporter itself. Use this to validate that optimisations work and to detect slowdowns before they cause scrape failures.
| Panel | Type | Description |
|---|---|---|
| Collectors OK / FAIL / Total Errors | stat | Scrape health indicators |
| scontrol Cache Age | stat | Should oscillate 0-25s — flat at TTL = problem |
| sacct Last Refresh | stat | Age since last sacct_efficiency refresh |
| Avg Duration per Command | bargauge | Which commands are slowest? |
| Call Count per Command | bargauge | sinfo should be ~1/scrape after Axe 2 |
| Command Duration p99 + avg | timeseries | Latency spikes = slurmctld overload |
| Command Error Rate | timeseries | Any > 0 = CLI failures |
| Collector Scrape Duration | timeseries | Alert if approaching scrape_timeout |
| scontrol Cache Age Over Time | timeseries | Cache hit/miss pattern |
| sacct Refresh Age | timeseries | sacct_efficiency staleness |
Screenshots are generated using Playwright in a Docker container:
# Prerequisite: Grafana running on localhost:3000
./scripts/take_screenshots.sh /tmp/screenshotsOr manually:
GSESSION=$(curl -s -D - -X POST http://localhost:3000/login \
-H "Content-Type: application/json" \
-d '{"user":"admin","password":"admin"}' \
| grep 'grafana_session=' | grep -v expiry \
| sed 's/.*grafana_session=\([^;]*\).*/\1/' | tr -d '\r\n')
docker run --rm --network slurm_slurm-network \
-e "GSESSION=$GSESSION" -e "LANG=en_US.UTF-8" \
-v /tmp/screenshots:/screenshots -w /work \
mcr.microsoft.com/playwright:v1.51.1-noble \
bash -c 'npm i -y playwright@1.51.1 > /dev/null && node script.js'See scripts/take_screenshots.sh for the full script.
File: 10-slurm-all-metrics.json | UID: slurm-all-metrics
An exhaustive reference dashboard showing every metric exported by the Slurm Exporter, organized by collector. Useful for:
- Discovering available metrics
- Debugging queries
- Validating that all collectors are working
115 panels covering all 14 collectors (+ new v1.8 metrics):
accounts, cpus, fairshare, gpus, info, licenses, node, nodes,
partitions, queue, reservation_nodes, reservations, scheduler, users
This dashboard is intentionally dense. Use the other focused dashboards for daily monitoring. This one is a reference/debug tool.


















