Grafana Dashboards

Eight ready-to-use Grafana dashboards for the Prometheus Slurm Exporter. All dashboards are compatible with Grafana 12+ and use a $datasource template variable for portability.

Import
Dashboards
Generating Screenshots

Import

Option 1 — Grafana UI: Go to Dashboards → Import, paste a JSON file or upload it directly.

Option 2 — Provisioning (recommended for permanent setup):

cp dashboards_grafana/*.json /etc/grafana/provisioning/dashboards/
# Reload Grafana or wait for the provisioning interval (default: 30s)

Option 3 — API (batch import):

for f in dashboards_grafana/*.json; do
  curl -s -X POST http://admin:password@grafana-host:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d "{\"dashboard\": $(cat "$f"), \"overwrite\": true, \"folderId\": 0}"
done

Dashboards

1. Cluster Overview

File: 01-slurm-overview.json | UID: slurm-overview

Global cluster snapshot: node states, CPU/GPU utilization gauges, job totals, per-partition table (CPU allocation, running/pending jobs), and job state history.

Panel	Type	Description
Total Nodes / Active / Running / Pending / CPU % / Down+Drain	stat	Quick health indicators
CPU Utilization	gauge	`slurm_cpus_alloc / slurm_cpus_total * 100`
Node Utilization	gauge	`(alloc + mix) / total * 100`
Jobs Over Time	timeseries	Running / Pending / Completing trend
Node State Breakdown	bargauge	Idle / Mixed / Alloc / Drain / Down / Maint
Node States Over Time	timeseries	Stacked node state history
CPUs Over Time	timeseries	Allocated / Idle / Total
Partitions	table	CPU Alloc / Total / % / Running Jobs / Pending Jobs per partition
Job States Over Time	timeseries	Running / Pending / Completing / Failed / Suspended
Cores Requested Over Time	timeseries	Cores running vs pending
Collector Status	stat	OK/FAIL per collector
Collector Duration	timeseries	Scrape duration history

2. Jobs & Queue

File: 02-slurm-jobs.json | UID: slurm-jobs

Detailed job queue monitoring: global totals, top users, pending reasons, per-partition breakdown, and job state timeline.

Panel	Type	Description
Running / Pending / Cores / Failed / Timeout	stat	Cluster-wide instant counters
All Job States	timeseries	All `slurm_jobs_*` globals over time
Cores Running & Pending	timeseries	CPU core demand trend
Top Users — Running Jobs	bargauge	`topk(15, slurm_user_jobs_running)`
Top Users — Running CPUs	bargauge	`topk(15, slurm_user_cpus_running)`
Top Accounts — Running CPUs	bargauge	`topk(15, slurm_account_cpus_running)`
Pending Jobs by Reason	bargauge	`sum by(reason) (slurm_queue_pending)`
Queue State by Partition	timeseries	Running / Pending per partition
Running vs Pending Timeline	state-timeline	Visual job activity history

3. Node Detail

File: 03-slurm-nodes.json | UID: slurm-nodes

Per-node CPU and memory utilization. Designed to scale from small clusters to 100k+ nodes via a $partition filter variable.

Panel	Type	Description
Total / Idle / Mixed / Alloc / Down / Drain	stat	Cluster-wide node counts
Node States per Partition	barchart	Stacked distribution by partition
Partition Summary	table	Alloc / Idle / Mix / Down / Drain per partition
Down & Drain Nodes	table	Always scalable — only shows degraded nodes
Node CPU & Memory Detail	table	Per-node: CPU alloc/total/%, Mem alloc/total/% — filtered by `$partition`
CPU Allocated per Node (top 10)	timeseries	Most loaded nodes over time
Memory Allocated per Node (top 10)	timeseries	Top memory consumers

Scale note: The per-node detail table is filtered by the $partition variable. On clusters with 100k+ nodes, always select a specific partition to limit results. The partition summary and "Down & Drain Nodes" panels are always O(partitions).

4. Cluster Usage Statistics

File: 04-slurm-usage.json | UID: slurm-usage

Comprehensive utilization metrics: CPU/memory/GPU gauges, per-user and per-account breakdowns, fairshare, scheduler health, and trend timeseries.

Panel	Type	Description
CPU / Node / Memory / GPU Utilization	gauge	Current utilization percentages
Avg CPU Util (period)	stat	`avg_over_time(cpu_util[1h:5m])`
Max CPU Util (period)	stat	`max_over_time(cpu_util[1h:5m])`
Total CPUs / Nodes / GPUs / Running Jobs	stat	Cluster capacity snapshot
CPU Utilization % Over Time	timeseries	Instantaneous + rolling average
Memory Utilization % Over Time	timeseries	`sum(node_mem_alloc) / sum(node_mem_total)`
GPU Utilization & Allocation %	timeseries	GPU util and alloc trend
Node States Over Time	timeseries	Idle / Mix / Alloc / Down / Drain
FairShare per Account	timeseries	Fairshare factor by account
Running / Pending Jobs per User & Account	timeseries	Per-user and per-account breakdown
CPUs Allocated per User & Account	timeseries	CPU demand by user/account
GPU Running per Account & User	timeseries	GPU job distribution
CPU Allocation (total)	timeseries	Alloc / Idle / Total CPUs
CPUs Allocated per Partition	timeseries	Per-partition CPU usage
GPU States Over Time	timeseries	Total / Alloc / Idle GPU trend
Scheduler Stats	stat + timeseries	Threads, queue size, cycle time, backfill

5. Scheduler

File: 05-slurm-scheduler.json | UID: slurm-scheduler

Deep-dive into slurmctld internals: main scheduler and backfill cycle times, RPC statistics, queue sizes, and thread counts.

Panel	Type	Description
Scheduler Threads	stat	Active scheduler threads
Queue Size	stat	Jobs in scheduler queue
Mean Cycle / Last Cycle	stat	Scheduler cycle times (µs)
DBD Queue	gauge	Slurm accounting daemon queue depth
Cycles/min	stat	Scheduler frequency
Scheduler Cycles (µs)	timeseries	Last vs mean cycle time
Backfill Cycles (µs)	timeseries	Backfill last vs mean
RPC Stats — Count	bargauge	Top RPCs by count
RPC Stats — Time	bargauge	Top RPCs by total time
RPC Calls Over Time	timeseries	RPC call frequency
User RPC Stats	table	Per-user RPC counts

6. Reservations & Licenses

File: 06-slurm-reservations.json | UID: slurm-reservations

Active Slurm reservations, per-reservation node states, and license usage. License panels show "No data" when no licenses are configured — this is expected.

Panel	Type	Description
Active Reservations	stat	Count of current reservations
Reservation Info	table	Name / State / Users / Nodes / Partition / Flags
Reservation Timeline	timeseries	Start/end times as gauge
Nodes per Reservation	bargauge	Node counts by reservation
Node States per Reservation	timeseries	Alloc / Idle / Mix / Down / Drain per reservation
Reservation Nodes Healthy %	gauge	`(alloc+idle+mix+planned) / total_in_reservation`
License Usage	timeseries	Total / Used / Free / Reserved per license
License Utilization %	bargauge	`used / total * 100` per license

Note: License panels (slurm_license_*) only show data when Slurm licenses are configured (scontrol show licenses).

7. Accounting (new in v1.7.0)

File: 07-slurm-accounting.json | UID: slurm-accounting

Dedicated HPC accounting dashboard. Answers the key question: "Why is this user's priority low?" — by exposing FairShare components (NormUsage, NormShares) and their ratio directly in Grafana.

Requires --collector.fairshare.user-metrics=true (default) for user-level panels. Filter by $account and $user template variables.

Panel	Type	Description
Running / Pending / CPUs / Active Users & Accounts	stat	Current cluster snapshot
Top Users — Running Jobs / CPUs	bargauge	`topk(15, slurm_user_jobs_running/cpus_running)`
Top Accounts — Running CPUs	bargauge	`topk(10, slurm_account_cpus_running)`
User FairShare Summary	table	user · account · FairShare · NormShares · NormUsage · Usage/Shares ratio · CPU-seconds
Users by FairShare (ascending)	bargauge	Lowest priority users at the top
Account Summary	table	FairShare · NormShares · NormUsage · CPU-seconds per account
FairShare per Account Over Time	timeseries	Account priority trend
Running Jobs / CPUs per User	timeseries	User activity history
Running Jobs per Account	timeseries	Account activity history
User FairShare Over Time	timeseries	Tracks priority evolution — declining = overusing share

8. Exporter Health

File: 08-slurm-health.json | UID: slurm-health

Monitors the health of the exporter itself: collector success/failure, scrape duration history, and Slurm binary availability.

Panel	Type	Description
Collector Status	stat	OK / FAIL per collector — background color alert
Collectors Healthy %	gauge	`sum(success) / count(success) * 100`
Collector Health Timeline	state-timeline	Visual OK/FAIL history per collector
Scrape Duration Status History	status-history	Duration colored by threshold (<1s green, 1-5s yellow, >5s red)
Last Scrape Duration	bargauge	Current duration per collector, sorted slowest-first
Duration Over Time	timeseries	Collector duration trend for detecting degradation
Slurm Version	stat	`slurm_info{type="general"}` — active Slurm version
Slurm Binaries	table	Availability check of each Slurm binary (sacct, sbatch, sinfo, …)

9. Exporter Performance (new in v1.8.0)

File: 09-slurm-exporter-perf.json | UID: slurm-exporter-perf

Internal performance dashboard for the exporter itself. Use this to validate that optimisations work and to detect slowdowns before they cause scrape failures.

Panel	Type	Description
Collectors OK / FAIL / Total Errors	stat	Scrape health indicators
scontrol Cache Age	stat	Should oscillate 0-25s — flat at TTL = problem
sacct Last Refresh	stat	Age since last sacct_efficiency refresh
Avg Duration per Command	bargauge	Which commands are slowest?
Call Count per Command	bargauge	sinfo should be ~1/scrape after Axe 2
Command Duration p99 + avg	timeseries	Latency spikes = slurmctld overload
Command Error Rate	timeseries	Any > 0 = CLI failures
Collector Scrape Duration	timeseries	Alert if approaching scrape_timeout
scontrol Cache Age Over Time	timeseries	Cache hit/miss pattern
sacct Refresh Age	timeseries	sacct_efficiency staleness

Generating Screenshots

Screenshots are generated using Playwright in a Docker container:

# Prerequisite: Grafana running on localhost:3000
./scripts/take_screenshots.sh /tmp/screenshots

Or manually:

GSESSION=$(curl -s -D - -X POST http://localhost:3000/login \
  -H "Content-Type: application/json" \
  -d '{"user":"admin","password":"admin"}' \
  | grep 'grafana_session=' | grep -v expiry \
  | sed 's/.*grafana_session=\([^;]*\).*/\1/' | tr -d '\r\n')

docker run --rm --network slurm_slurm-network \
  -e "GSESSION=$GSESSION" -e "LANG=en_US.UTF-8" \
  -v /tmp/screenshots:/screenshots -w /work \
  mcr.microsoft.com/playwright:v1.51.1-noble \
  bash -c 'npm i -y playwright@1.51.1 > /dev/null && node script.js'

See scripts/take_screenshots.sh for the full script.

10. All Metrics Reference

File: 10-slurm-all-metrics.json | UID: slurm-all-metrics

An exhaustive reference dashboard showing every metric exported by the Slurm Exporter, organized by collector. Useful for:

Discovering available metrics
Debugging queries
Validating that all collectors are working

115 panels covering all 14 collectors (+ new v1.8 metrics): accounts, cpus, fairshare, gpus, info, licenses, node, nodes, partitions, queue, reservation_nodes, reservations, scheduler, users

This dashboard is intentionally dense. Use the other focused dashboards for daily monitoring. This one is a reference/debug tool.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grafana Dashboards

Table of Contents

Import

Dashboards

1. Cluster Overview

2. Jobs & Queue

3. Node Detail

4. Cluster Usage Statistics

5. Scheduler

6. Reservations & Licenses

7. Accounting (new in v1.7.0)

8. Exporter Health

9. Exporter Performance (new in v1.8.0)

Generating Screenshots

10. All Metrics Reference

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Grafana Dashboards

Table of Contents

Import

Dashboards

1. Cluster Overview

2. Jobs & Queue

3. Node Detail

4. Cluster Usage Statistics

5. Scheduler

6. Reservations & Licenses

7. Accounting (new in v1.7.0)

8. Exporter Health

9. Exporter Performance (new in v1.8.0)

Generating Screenshots

10. All Metrics Reference