Configuration Reference

Back to README

⚙️ Usage

The exporter can be configured using command-line flags.

Basic execution:

./slurm_exporter --web.listen-address=":9341"

Using a configuration file for web settings (TLS/Basic Auth):

./slurm_exporter --web.config.file=/path/to/web-config.yml

For details on the web-config.yml format, see the Exporter Toolkit documentation.

View help and all available options:

./slurm_exporter --help

Command-Line Options

Flag	Description	Default
`--web.listen-address`	Address to listen on for web interface and telemetry	`:9341`
`--web.config.file`	Path to configuration file for TLS/Basic Auth	(none)
`--command.timeout`	Timeout for executing Slurm commands	`5s`
`--log.level`	Log level: `debug`, `info`, `warn`, `error`	`info`
`--log.format`	Log format: `json`, `text`	`text`
`--[no-]collector.<name>`	Enable or disable a collector (kingpin boolean flag). Most collectors default to enabled; `sacct_efficiency` defaults to disabled.	see below
`--collector.nodes.feature-set`	Include `active_feature_set` label in `slurm_nodes_*` metrics	`true`
`--collector.fairshare.user-metrics`	Collect per-user fairshare metrics (`slurm_user_fairshare_*`). Disable on clusters with many users to reduce cardinality.	`true`
`--collector.queue.user-label`	Include `user` label in `slurm_queue_*` metrics. Disable on clusters with many users to reduce cardinality.	`true`
`--collector.sacct_efficiency`	Enable the sacct_efficiency collector (disabled by default — queries SlurmDBD).	`false`
`--collector.sacct.interval`	Background refresh interval for sacct_efficiency.	`5m`
`--collector.sacct.lookback`	Time window for sacct_efficiency queries.	`1h`
`--slurm.bin-path`	Directory containing Slurm binaries. Defaults to `$PATH`. Required when running in containers with host-mounted binaries.	(empty)
`--web.disable-exporter-metrics`	Exclude Go runtime and process metrics from `/metrics`	`false`

Available collectors:

Collector	Default	Description
`accounts`	enabled	Job stats by Slurm account
`cpus`	enabled	Cluster-wide CPU states
`drain_reason`	enabled	Node drain/down reason and timestamp
`fairshare`	enabled	Fairshare factor per account and user
`gpus`	enabled	Cluster-wide GPU states
`info`	enabled	Slurm binary versions
`licenses`	enabled	License counts
`node`	enabled	Per-node CPU and memory detail
`nodes`	enabled	Aggregated node states by partition
`partitions`	enabled	CPU states and jobs per partition
`queue`	enabled	Job states and core counts by user/partition
`reservation_nodes`	enabled	Node states per reservation
`reservations`	enabled	Active reservation details
`sacct_efficiency`	disabled	CPU/mem job efficiency via sacct (queries SlurmDBD)
`scheduler`	enabled	slurmctld internals and RPC stats
`users`	enabled	Job stats by user

Enabling and Disabling Collectors

Most collectors are enabled by default. The sacct_efficiency collector is disabled by default because it queries SlurmDBD and can be expensive — enable it explicitly with --collector.sacct_efficiency.

Use --[no-]collector.<name> (kingpin boolean syntax) to enable or disable individual collectors.

Example: Disable the scheduler and partitions collectors

./slurm_exporter --no-collector.scheduler --no-collector.partitions

Example: Disable the gpus collector

./slurm_exporter --no-collector.gpus

Example: Run only the nodes and cpus collectors

This requires disabling all other collectors individually.

./slurm_exporter \
  --no-collector.accounts \
  --no-collector.fairshare \
  --no-collector.gpus \
  --no-collector.node \
  --no-collector.partitions \
  --no-collector.queue \
  --no-collector.reservations \
  --no-collector.scheduler \
  --no-collector.info \
  --no-collector.users

Example: Custom timeout and logging

./slurm_exporter \
  --command.timeout=10s \
  --log.level=debug \
  --log.format=json

📡 Prometheus Configuration

scrape_configs:
  - job_name: 'slurm_exporter'
    scrape_interval: 30s
    scrape_timeout: 30s
    static_configs:
      - targets: ['slurm_host.fqdn:9341']

scrape_interval: A 30s interval is recommended to avoid overloading the Slurm master with frequent command executions.
scrape_timeout: Should be equal to or less than the scrape_interval to prevent context_deadline_exceeded errors.

Check config:

promtool check-config prometheus.yml

Internal Exporter Metrics

Each collector emits two self-monitoring metrics:

Metric	Description	Labels
`slurm_exporter_collector_success`	`1` if last scrape succeeded, `0` if the collector panicked	`collector`
`slurm_exporter_collector_duration_seconds`	Wall time of the last `Collect()` call	`collector`

These allow per-collector alerting independently of the global Prometheus scrape_error.

Performance Considerations

Command Timeout: The default timeout is 5 seconds. Increase it if Slurm commands take longer in your environment:
```
./slurm_exporter --command.timeout=10s
```
Scrape Interval: Use at least 30 seconds to avoid overloading the Slurm controller with frequent command executions.
Collector Selection: Disable unused collectors to reduce load and improve performance:
```
./slurm_exporter --no-collector.fairshare --no-collector.reservations
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configuration Reference

⚙️ Usage

Command-Line Options

Enabling and Disabling Collectors

📡 Prometheus Configuration

Internal Exporter Metrics

Performance Considerations

📈 Grafana Dashboards

FilesExpand file tree

configuration.md

Latest commit

History

configuration.md

File metadata and controls

Configuration Reference

⚙️ Usage

Command-Line Options

Enabling and Disabling Collectors

📡 Prometheus Configuration

Internal Exporter Metrics

Performance Considerations

📈 Grafana Dashboards