SlurmGenie

AI Copilot for Slurm GPU Clusters

A fully-offline, privacy-first CLI tool that helps ML engineers and HPC admins diagnose failed Slurm jobs, monitor GPU utilization, and optimize sbatch scripts — all without sending a single byte to the cloud.

Why SlurmGenie?

Running ML workloads on HPC clusters is painful. When a job fails, you're left grepping through thousands of log lines to find a CUDA OOM error that could have been caught in seconds. When a job runs, you have no idea if it's actually using the GPUs you reserved.

SlurmGenie fixes this:

Problem	SlurmGenie Solution
Job failed, no idea why	`slurmgenie diagnose job <ID>` — detects 50+ error patterns in seconds
GPU reserved but idle	`slurmgenie monitor --idle` — finds wasted allocations instantly
Bad sbatch config	`slurmgenie suggest script job.sbatch` — cluster-aware optimization
Can't use cloud AI on HPC	Runs a local 400MB LLM on the login node, fully offline
Fear of data leaks	Zero cloud calls. All processing stays on your cluster.

Features

Feature	Description
🔍 Job Diagnosis	Automatically analyze failed Slurm jobs. Pinpoints the exact line and explains the root cause. No LLM required.
🧠 50+ Error Patterns	Recognizes CUDA OOM, NCCL timeouts, disk quota, node failures, GPU Xid errors, and more.
📊 GPU Monitoring	Live GPU utilization, memory, temperature, and power via `nvidia-smi`. Detects idle allocations.
⚡ sbatch Advisor	Rules-based + context-aware suggestions: AST analysis, job history profiling, module validation. Instant, no LLM.
🤖 AI Explain	Add `--explain` to any diagnosis to get a human-readable root cause narrative from the LLM.
✨ AI Script Fix	Add `--ai` to `suggest` to generate a corrected sbatch script.
📝 Script Generation	`slurmgenie generate "description"` — natural language to sbatch script, validated by the rules engine.
🔔 Slack Alerts	Block Kit-formatted failure notifications with root cause summaries.
🔒 Privacy-First	Local-first by default. Cloud APIs (OpenAI, Anthropic) are opt-in only.
🚀 GPU-Accelerated LLM	Auto-detects NVIDIA GPUs and offloads inference for 10-50x speedup.
📄 Output to File	Save diagnosis results with `--output/-o` flag.

Quick Start

# Install
pip install slurmgenie

# Check your cluster
slurmgenie status

# Diagnose a failed job
slurmgenie diagnose job 12345

# Diagnose with AI explanation
slurmgenie diagnose job 12345 --explain

# Save diagnosis output to file
slurmgenie diagnose job 12345 --output result.json

# Get optimization advice for your script
slurmgenie suggest script train.sbatch

# AI-generate a corrected script
slurmgenie suggest script train.sbatch --ai

# Generate a script from natural language
slurmgenie generate "Train a ViT with DDP on 2 nodes, 4 GPUs each, 24 hours"

Installation

Standard Installation

Requirements: Python 3.10+, Slurm 21+

# Core features only — diagnosis, monitoring, advisor, Slack alerts (~5MB)
pip install slurmgenie

# With local AI explanations (adds llama-cpp-python + huggingface-hub)
# Downloads ~400MB GGUF model on first use of 'slurmgenie explain'
pip install "slurmgenie[llm]"

# With Slack SDK (only if you need advanced Slack features beyond webhooks)
pip install "slurmgenie[slack]"

# Everything
pip install "slurmgenie[llm,slack]"

# Development install from source
git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install -e ".[dev,llm,slack]"

What's included in each install option:

Package	`pip install slurmgenie`	`[llm]`	`[slack]`	`[dev]`
`typer`, `rich`, `pydantic`, `requests`, `pyyaml`	✅	✅	✅	✅
`llama-cpp-python`	❌	✅	❌	❌
`huggingface-hub`	❌	✅	❌	❌
`slack-sdk`	❌	❌	✅	❌
`pytest`, `ruff`, `pytest-cov`	❌	❌	❌	✅

Note

The Slack integration in the core install uses incoming webhooks (requests only — no Slack SDK needed). Install [slack] only if you need the full Slack SDK (e.g., bot tokens, socket mode).

Air-Gapped / Offline Installation

Many HPC clusters have no internet access on login nodes. SlurmGenie is designed for this:

Step 1: Build a wheel on an internet-connected machine

git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install build
python -m build --wheel
# Produces: dist/slurmgenie-X.Y.Z-py3-none-any.whl

Step 2: Transfer to your cluster

scp dist/slurmgenie-*.whl user@cluster.hpc.edu:~/

Step 3: Install on the login node

pip install slurmgenie-*.whl
# Or with LLM support (bundle llama-cpp-python separately)
pip install "slurmgenie-*.whl[llm]"

Tip

For the local LLM feature on air-gapped clusters, download the GGUF model file manually from HuggingFace on an internet-connected machine and set SLURMGENIE_LLM_MODEL_FILE to its local path.

Usage

Diagnosing Failed Jobs

# Diagnose by Slurm Job ID (auto-discovers log via scontrol)
slurmgenie diagnose job 12345

# Diagnose with AI explanation (requires [llm] install)
slurmgenie diagnose job 12345 --explain

# Diagnose the last 5 failed jobs
slurmgenie diagnose recent --count 5

# Bulk explain recent failures (slow — triggers LLM per job)
slurmgenie diagnose recent --count 3 --explain

# Diagnose a log file directly (works without Slurm in PATH)
slurmgenie diagnose log /path/to/slurm-12345.out

Example output:

╭──────────────────────────── SlurmGenie Diagnosis ─────────────────────────────╮
│  Job ID: 12345  │  Name: train_llama  │  State: FAILED  │  13 lines analyzed  │
│                                                                                │
│  ISSUES FOUND   CRIT 3 critical   WARN 0 warnings                              │
╰────────────────────────────────────────────────────────────────────────────────╯

1. CUDA Out of Memory (line 12)
   GPU memory exhausted during tensor allocation.
   → Reduce batch size (e.g., halve it and retry)
   → Enable gradient checkpointing: model.gradient_checkpointing_enable()
   → Use mixed precision: --fp16 or torch.cuda.amp

Note

SlurmGenie first queries scontrol show job <ID> to find the StdOut log path automatically. No need to specify --log in most cases.

Model Selection (Local LLM)

By default, slurmgenie diagnose --explain uses Qwen2.5-0.5B-Instruct (compact, fast, works on CPU). You can switch to any GGUF model from HuggingFace:

# Use a different model (auto-downloads on first use)
slurmgenie diagnose job 12345 --explain \
  --llm-repo "bartowski/Llama-3.2-1B-Instruct-GGUF" \
  --llm-model "Llama-3.2-1B-Instruct-Q4_K_M.gguf"

# Or just change the model file within the same repo
slurmgenie diagnose job 12345 --explain -m "qwen2.5-1.5b-instruct-q4_k_m.gguf"

Flag	Default	Description
`--llm-repo`	`Qwen/Qwen2.5-0.5B-Instruct-GGUF`	HuggingFace repo ID containing GGUF files
`--llm-model`, `-m`	`qwen2.5-0.5b-instruct-q4_k_m.gguf`	Specific GGUF filename in the repo

Models are cached in ~/.cache/huggingface/hub/ after first download.

Monitoring GPU Utilization

# Live snapshot of all GPUs on the current node
slurmgenie monitor

# Show only idle / wasted GPU allocations
slurmgenie monitor --idle

# Continuously refresh every 5 seconds (like htop)
slurmgenie monitor --watch --interval 5

# Monitor resource usage for a specific running job
slurmgenie monitor --job 12345

Optimizing sbatch Scripts

slurmgenie suggest script train.sbatch

The advisor performs 4 layers of analysis in under a second:

Rules Engine — Checks CPU-to-GPU ratio, missing time limits, pip install in scripts, etc.
Dynamic Topology — Queries your cluster's actual node hardware to give partition-specific ratios.
Python AST Analysis — Reads your Python script statically. Detects DistributedDataParallel or FSDP and warns if --ntasks-per-node is missing.
Historical Profiling — Checks your last 5 jobs with the same name in sacct. If you requested 200GB but historically use 42GB, it tells you.

Example output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Issue                     ┃ Recommendation                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ [CRITICAL] DDP detected   │ Add: #SBATCH --ntasks-per-node=4                      │
│ [CRITICAL] Module missing │ 'cuda/11.8' not found. Use: module spider cuda        │
│ [INFO] Memory history     │ Historically uses ~42GB. Consider --mem=63G           │
└───────────────────────────┴───────────────────────────────────────────────────────┘

Slack Notifications

# Test your Slack integration
slurmgenie notify test

# Watch for job failures and send alerts automatically (realtime mode)
slurmgenie notify watch

# Generate a daily digest of failures + at-risk jobs, then send to Slack
slurmgenie notify digest --days 1

Daily Digest (Recommended for Teams)

The notify digest command is designed for cron — it runs once, collects everything from the last N days, and posts a single compact summary to Slack:

# Run once in the morning
crontab -e
# Add:  0 8 * * * /usr/local/bin/slurmgenie notify digest --days 1

# Options:
slurmgenie notify digest \
  --days 1 \
  --running-threshold 80 \
  --min-severity CRITICAL \
  --user webel1

# Preview locally without sending to Slack
slurmgenie notify digest --days 1 --dry-run

# Save to file in addition to Slack
slurmgenie notify digest --days 1 --output /tmp/slurmgenie-digest.txt

Flag	Default	Description
`--days`	`1`	Look back N days for failures
`--running-threshold`	`80`	Flag RUNNING jobs exceeding this % of time limit
`--min-severity`	`WARN`	Only include failures with ≥ this severity (CRITICAL/WARNING/INFO)
`--user`	all	Filter to specific user
`--dry-run`	`False`	Preview locally; do NOT send to Slack
`--output`	—	Save digest text to a file

What the digest includes:

Failed jobs table: jobID, name, user, state, elapsed time, top detected error
At-risk jobs table: RUNNING jobs near time limit or under memory pressure

Configure your webhook:

export SLURMGENIE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...

AI-Powered Features

All AI features work with 4 interchangeable backends:

Backend	Setup	Best For
`local` (default)	`pip install "slurmgenie[llm]"`	Air-gapped clusters, zero config
`ollama`	Install Ollama, `ollama pull llama3.1:8b`	Best local quality, many model choices
`openai`	`export SLURMGENIE_OPENAI_API_KEY=sk-...`	Best reasoning quality
`anthropic`	`export SLURMGENIE_ANTHROPIC_API_KEY=sk-ant-...`	Best reasoning quality

# Switch backends at any time
export SLURMGENIE_LLM_BACKEND=ollama    # or: local, openai, anthropic

Diagnose with AI Explanation (`--explain`)

# Pattern matching + AI synthesis in one command
slurmgenie diagnose job 12345 --explain
slurmgenie diagnose log slurm-12345.out --explain

The --explain flag feeds the structured diagnosis (matched errors, resource usage, job metadata) into the LLM, which synthesizes a human-readable narrative: "Your job OOM'd because the batch size exceeds A100 memory. Try reducing from 64 to 32."

AI Script Correction (`--ai`)

slurmgenie suggest script train.sbatch --ai

The rules engine finds issues → the LLM generates a corrected script → you get a clean, ready-to-use sbatch file.

Script Generation (`slurmgenie generate`)

# Natural language → sbatch script, validated by the rules engine
slurmgenie generate "Fine-tune LLaMA 3 with LoRA on 1 GPU, 48GB memory, 12 hours"
slurmgenie generate "DDP training on 4 nodes, 8 GPUs each" -o train.sbatch

Configuration

Copy .env.example to .env in your working directory, or set environment variables directly:

# --- Slurm binary paths ---
export SLURMGENIE_SACCT_PATH=/usr/local/bin/sacct
export SLURMGENIE_SCONTROL_PATH=/usr/local/bin/scontrol
export SLURMGENIE_SQUEUE_PATH=/usr/local/bin/squeue
export SLURMGENIE_SSTAT_PATH=/usr/local/bin/sstat
export SLURMGENIE_NVIDIA_SMI_PATH=/usr/bin/nvidia-smi

# --- AI Backend (choose one) ---
export SLURMGENIE_LLM_BACKEND=local          # local | ollama | openai | anthropic
export SLURMGENIE_LLM_MAX_TOKENS=512
export SLURMGENIE_LLM_CONTEXT_LENGTH=2048

# --- Ollama (if using ollama backend) ---
export SLURMGENIE_OLLAMA_BASE_URL=http://localhost:11434
export SLURMGENIE_OLLAMA_MODEL=llama3.1:8b

# --- OpenAI (if using openai backend) ---
export SLURMGENIE_OPENAI_API_KEY=sk-...
export SLURMGENIE_OPENAI_MODEL=gpt-4o-mini
# export SLURMGENIE_OPENAI_BASE_URL=...     # for Azure OpenAI

# --- Anthropic (if using anthropic backend) ---
export SLURMGENIE_ANTHROPIC_API_KEY=sk-ant-...
export SLURMGENIE_ANTHROPIC_MODEL=claude-sonnet-4-20250514

# --- Slack ---
export SLURMGENIE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

# --- GPU Monitoring ---
export SLURMGENIE_GPU_IDLE_THRESHOLD=10.0

# --- LLM Performance (local backend) ---
export SLURMGENIE_LLM_GPU_LAYERS=32        # 0 = CPU only, 32 = full GPU offload (auto-detected)
export SLURMGENIE_LLM_IDLE_TIMEOUT=300      # Auto-unload model after N seconds of idle

Local LLM Performance

The local LLM backend (llama-cpp-python) includes several optimizations:

Optimization	Default	Impact
GPU offloading	Auto-detected via `nvidia-smi`	10-50x speedup when GPU available
Thread count	Full CPU count	~2x CPU inference speedup
Batch size	`min(n_ctx, 512)`	Better throughput
Provider caching	Singleton per config	No model reinitialization between calls

Override GPU layers manually:

# Force CPU only (useful for debugging)
export SLURMGENIE_LLM_GPU_LAYERS=0

# Force full GPU offload
export SLURMGENIE_LLM_GPU_LAYERS=32

All SLURMGENIE_* variables can also be set in a .env file.

How It Works

slurmgenie diagnose job 12345
        │
        ├─ scontrol show job 12345  ──▶  StdOut=/scratch/user/logs/slurm-12345.out
        │                                (no --log flag needed)
        ├─ sacct -j 12345           ──▶  State=FAILED, ExitCode=1:0, MaxRSS=42G
        │
        ├─ Read log file            ──▶  Regex pattern matching (50+ patterns)
        │
        ├─ PatternMatcher           ──▶  CUDA OOM detected on line 12
        │
        └─ Rich CLI output          ──▶  Root cause + actionable suggestions

All subprocess calls use shell=False to prevent injection. The tool never modifies cluster state — it is purely read-only.

Development

See CONTRIBUTING.md for the full development guide including setup, testing, linting, and contribution workflows.

Quick Start

git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install -e ".[dev,llm]"
pytest
ruff check src/

Pre-commit Hooks

pip install pre-commit && pre-commit install

LLM Benchmarks

python benchmarks/llm_benchmark.py
python benchmarks/llm_benchmark.py --gpu-layers 32

Contributing

See CONTRIBUTING.md for the full guide. Quick summary:

Fork and create a feature branch
Write tests for new functionality
Run pytest and ruff check src/
Open a PR with a clear description

Security

No shell injection: All subprocess calls use shell=False. Module names are validated with a strict allowlist regex before being passed to bash -c.
No network calls: The tool makes zero outbound connections unless you explicitly configure a Slack webhook or invoke slurmgenie explain which downloads the GGUF model from HuggingFace.
Read-only: SlurmGenie never submits, modifies, or cancels jobs.
No credential storage: API keys/webhooks are loaded from environment variables or .env files only — never written to disk by SlurmGenie.

Found a security issue? Please report it privately via GitHub's Security Advisory rather than opening a public issue.

License

Apache License 2.0 — see LICENSE for details.

Developed by [Suhas Goravale Siddaramu](https://github.com/OriAlpha)

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
benchmarks		benchmarks
examples		examples
src/slurmgenie		src/slurmgenie
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
demo.bat		demo.bat
pyproject.toml		pyproject.toml
test.sbatch		test.sbatch

Folders and files

Latest commit

History

Repository files navigation

SlurmGenie

Table of Contents

Why SlurmGenie?

Features

Quick Start

Installation

Standard Installation

Air-Gapped / Offline Installation

Usage

Diagnosing Failed Jobs

Model Selection (Local LLM)

Monitoring GPU Utilization

Optimizing sbatch Scripts

Slack Notifications

Daily Digest (Recommended for Teams)

AI-Powered Features

Diagnose with AI Explanation (--explain)

AI Script Correction (--ai)

Script Generation (slurmgenie generate)

Configuration

Local LLM Performance

How It Works

Development

Quick Start

Pre-commit Hooks

LLM Benchmarks

Contributing

Security

License

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Diagnose with AI Explanation (`--explain`)

AI Script Correction (`--ai`)

Script Generation (`slurmgenie generate`)

Packages