AI Copilot for Slurm GPU Clusters
A fully-offline, privacy-first CLI tool that helps ML engineers and HPC admins diagnose failed Slurm jobs, monitor GPU utilization, and optimize sbatch scripts — all without sending a single byte to the cloud.
- Why SlurmGenie?
- Features
- Quick Start
- Installation
- Usage
- Configuration
- How It Works
- Development
- Contributing
- Security
- License
Running ML workloads on HPC clusters is painful. When a job fails, you're left grepping through thousands of log lines to find a CUDA OOM error that could have been caught in seconds. When a job runs, you have no idea if it's actually using the GPUs you reserved.
SlurmGenie fixes this:
| Problem | SlurmGenie Solution |
|---|---|
| Job failed, no idea why | slurmgenie diagnose job <ID> — detects 50+ error patterns in seconds |
| GPU reserved but idle | slurmgenie monitor --idle — finds wasted allocations instantly |
| Bad sbatch config | slurmgenie suggest script job.sbatch — cluster-aware optimization |
| Can't use cloud AI on HPC | Runs a local 400MB LLM on the login node, fully offline |
| Fear of data leaks | Zero cloud calls. All processing stays on your cluster. |
| Feature | Description |
|---|---|
| 🔍 Job Diagnosis | Automatically analyze failed Slurm jobs. Pinpoints the exact line and explains the root cause. No LLM required. |
| 🧠 50+ Error Patterns | Recognizes CUDA OOM, NCCL timeouts, disk quota, node failures, GPU Xid errors, and more. |
| 📊 GPU Monitoring | Live GPU utilization, memory, temperature, and power via nvidia-smi. Detects idle allocations. |
| ⚡ sbatch Advisor | Rules-based + context-aware suggestions: AST analysis, job history profiling, module validation. Instant, no LLM. |
| 🤖 AI Explain | Add --explain to any diagnosis to get a human-readable root cause narrative from the LLM. |
| ✨ AI Script Fix | Add --ai to suggest to generate a corrected sbatch script. |
| 📝 Script Generation | slurmgenie generate "description" — natural language to sbatch script, validated by the rules engine. |
| 🔔 Slack Alerts | Block Kit-formatted failure notifications with root cause summaries. |
| 🔒 Privacy-First | Local-first by default. Cloud APIs (OpenAI, Anthropic) are opt-in only. |
| 🚀 GPU-Accelerated LLM | Auto-detects NVIDIA GPUs and offloads inference for 10-50x speedup. |
| 📄 Output to File | Save diagnosis results with --output/-o flag. |
# Install
pip install slurmgenie
# Check your cluster
slurmgenie status
# Diagnose a failed job
slurmgenie diagnose job 12345
# Diagnose with AI explanation
slurmgenie diagnose job 12345 --explain
# Save diagnosis output to file
slurmgenie diagnose job 12345 --output result.json
# Get optimization advice for your script
slurmgenie suggest script train.sbatch
# AI-generate a corrected script
slurmgenie suggest script train.sbatch --ai
# Generate a script from natural language
slurmgenie generate "Train a ViT with DDP on 2 nodes, 4 GPUs each, 24 hours"Requirements: Python 3.10+, Slurm 21+
# Core features only — diagnosis, monitoring, advisor, Slack alerts (~5MB)
pip install slurmgenie
# With local AI explanations (adds llama-cpp-python + huggingface-hub)
# Downloads ~400MB GGUF model on first use of 'slurmgenie explain'
pip install "slurmgenie[llm]"
# With Slack SDK (only if you need advanced Slack features beyond webhooks)
pip install "slurmgenie[slack]"
# Everything
pip install "slurmgenie[llm,slack]"
# Development install from source
git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install -e ".[dev,llm,slack]"What's included in each install option:
| Package | pip install slurmgenie |
[llm] |
[slack] |
[dev] |
|---|---|---|---|---|
typer, rich, pydantic, requests, pyyaml |
✅ | ✅ | ✅ | ✅ |
llama-cpp-python |
❌ | ✅ | ❌ | ❌ |
huggingface-hub |
❌ | ✅ | ❌ | ❌ |
slack-sdk |
❌ | ❌ | ✅ | ❌ |
pytest, ruff, pytest-cov |
❌ | ❌ | ❌ | ✅ |
Note
The Slack integration in the core install uses incoming webhooks (requests only — no Slack SDK needed).
Install [slack] only if you need the full Slack SDK (e.g., bot tokens, socket mode).
Many HPC clusters have no internet access on login nodes. SlurmGenie is designed for this:
Step 1: Build a wheel on an internet-connected machine
git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install build
python -m build --wheel
# Produces: dist/slurmgenie-X.Y.Z-py3-none-any.whlStep 2: Transfer to your cluster
scp dist/slurmgenie-*.whl user@cluster.hpc.edu:~/Step 3: Install on the login node
pip install slurmgenie-*.whl
# Or with LLM support (bundle llama-cpp-python separately)
pip install "slurmgenie-*.whl[llm]"Tip
For the local LLM feature on air-gapped clusters, download the GGUF model file manually from HuggingFace on an internet-connected machine and set SLURMGENIE_LLM_MODEL_FILE to its local path.
# Diagnose by Slurm Job ID (auto-discovers log via scontrol)
slurmgenie diagnose job 12345
# Diagnose with AI explanation (requires [llm] install)
slurmgenie diagnose job 12345 --explain
# Diagnose the last 5 failed jobs
slurmgenie diagnose recent --count 5
# Bulk explain recent failures (slow — triggers LLM per job)
slurmgenie diagnose recent --count 3 --explain
# Diagnose a log file directly (works without Slurm in PATH)
slurmgenie diagnose log /path/to/slurm-12345.outExample output:
╭──────────────────────────── SlurmGenie Diagnosis ─────────────────────────────╮
│ Job ID: 12345 │ Name: train_llama │ State: FAILED │ 13 lines analyzed │
│ │
│ ISSUES FOUND CRIT 3 critical WARN 0 warnings │
╰────────────────────────────────────────────────────────────────────────────────╯
1. CUDA Out of Memory (line 12)
GPU memory exhausted during tensor allocation.
→ Reduce batch size (e.g., halve it and retry)
→ Enable gradient checkpointing: model.gradient_checkpointing_enable()
→ Use mixed precision: --fp16 or torch.cuda.amp
Note
SlurmGenie first queries scontrol show job <ID> to find the StdOut log path automatically. No need to specify --log in most cases.
By default, slurmgenie diagnose --explain uses Qwen2.5-0.5B-Instruct (compact, fast, works on CPU).
You can switch to any GGUF model from HuggingFace:
# Use a different model (auto-downloads on first use)
slurmgenie diagnose job 12345 --explain \
--llm-repo "bartowski/Llama-3.2-1B-Instruct-GGUF" \
--llm-model "Llama-3.2-1B-Instruct-Q4_K_M.gguf"
# Or just change the model file within the same repo
slurmgenie diagnose job 12345 --explain -m "qwen2.5-1.5b-instruct-q4_k_m.gguf"| Flag | Default | Description |
|---|---|---|
--llm-repo |
Qwen/Qwen2.5-0.5B-Instruct-GGUF |
HuggingFace repo ID containing GGUF files |
--llm-model, -m |
qwen2.5-0.5b-instruct-q4_k_m.gguf |
Specific GGUF filename in the repo |
Models are cached in ~/.cache/huggingface/hub/ after first download.
# Live snapshot of all GPUs on the current node
slurmgenie monitor
# Show only idle / wasted GPU allocations
slurmgenie monitor --idle
# Continuously refresh every 5 seconds (like htop)
slurmgenie monitor --watch --interval 5
# Monitor resource usage for a specific running job
slurmgenie monitor --job 12345slurmgenie suggest script train.sbatchThe advisor performs 4 layers of analysis in under a second:
- Rules Engine — Checks CPU-to-GPU ratio, missing time limits,
pip installin scripts, etc. - Dynamic Topology — Queries your cluster's actual node hardware to give partition-specific ratios.
- Python AST Analysis — Reads your Python script statically. Detects
DistributedDataParallelorFSDPand warns if--ntasks-per-nodeis missing. - Historical Profiling — Checks your last 5 jobs with the same name in
sacct. If you requested 200GB but historically use 42GB, it tells you.
Example output:
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Issue ┃ Recommendation ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ [CRITICAL] DDP detected │ Add: #SBATCH --ntasks-per-node=4 │
│ [CRITICAL] Module missing │ 'cuda/11.8' not found. Use: module spider cuda │
│ [INFO] Memory history │ Historically uses ~42GB. Consider --mem=63G │
└───────────────────────────┴───────────────────────────────────────────────────────┘
# Test your Slack integration
slurmgenie notify test
# Watch for job failures and send alerts automatically (realtime mode)
slurmgenie notify watch
# Generate a daily digest of failures + at-risk jobs, then send to Slack
slurmgenie notify digest --days 1The notify digest command is designed for cron — it runs once, collects everything from the last N days, and posts a single compact summary to Slack:
# Run once in the morning
crontab -e
# Add: 0 8 * * * /usr/local/bin/slurmgenie notify digest --days 1
# Options:
slurmgenie notify digest \
--days 1 \
--running-threshold 80 \
--min-severity CRITICAL \
--user webel1
# Preview locally without sending to Slack
slurmgenie notify digest --days 1 --dry-run
# Save to file in addition to Slack
slurmgenie notify digest --days 1 --output /tmp/slurmgenie-digest.txt| Flag | Default | Description |
|---|---|---|
--days |
1 |
Look back N days for failures |
--running-threshold |
80 |
Flag RUNNING jobs exceeding this % of time limit |
--min-severity |
WARN |
Only include failures with ≥ this severity (CRITICAL/WARNING/INFO) |
--user |
all | Filter to specific user |
--dry-run |
False |
Preview locally; do NOT send to Slack |
--output |
— | Save digest text to a file |
What the digest includes:
- Failed jobs table: jobID, name, user, state, elapsed time, top detected error
- At-risk jobs table: RUNNING jobs near time limit or under memory pressure
Configure your webhook:
export SLURMGENIE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...All AI features work with 4 interchangeable backends:
| Backend | Setup | Best For |
|---|---|---|
local (default) |
pip install "slurmgenie[llm]" |
Air-gapped clusters, zero config |
ollama |
Install Ollama, ollama pull llama3.1:8b |
Best local quality, many model choices |
openai |
export SLURMGENIE_OPENAI_API_KEY=sk-... |
Best reasoning quality |
anthropic |
export SLURMGENIE_ANTHROPIC_API_KEY=sk-ant-... |
Best reasoning quality |
# Switch backends at any time
export SLURMGENIE_LLM_BACKEND=ollama # or: local, openai, anthropic# Pattern matching + AI synthesis in one command
slurmgenie diagnose job 12345 --explain
slurmgenie diagnose log slurm-12345.out --explainThe --explain flag feeds the structured diagnosis (matched errors, resource usage, job metadata) into the LLM, which synthesizes a human-readable narrative: "Your job OOM'd because the batch size exceeds A100 memory. Try reducing from 64 to 32."
slurmgenie suggest script train.sbatch --aiThe rules engine finds issues → the LLM generates a corrected script → you get a clean, ready-to-use sbatch file.
# Natural language → sbatch script, validated by the rules engine
slurmgenie generate "Fine-tune LLaMA 3 with LoRA on 1 GPU, 48GB memory, 12 hours"
slurmgenie generate "DDP training on 4 nodes, 8 GPUs each" -o train.sbatchCopy .env.example to .env in your working directory, or set environment variables directly:
# --- Slurm binary paths ---
export SLURMGENIE_SACCT_PATH=/usr/local/bin/sacct
export SLURMGENIE_SCONTROL_PATH=/usr/local/bin/scontrol
export SLURMGENIE_SQUEUE_PATH=/usr/local/bin/squeue
export SLURMGENIE_SSTAT_PATH=/usr/local/bin/sstat
export SLURMGENIE_NVIDIA_SMI_PATH=/usr/bin/nvidia-smi
# --- AI Backend (choose one) ---
export SLURMGENIE_LLM_BACKEND=local # local | ollama | openai | anthropic
export SLURMGENIE_LLM_MAX_TOKENS=512
export SLURMGENIE_LLM_CONTEXT_LENGTH=2048
# --- Ollama (if using ollama backend) ---
export SLURMGENIE_OLLAMA_BASE_URL=http://localhost:11434
export SLURMGENIE_OLLAMA_MODEL=llama3.1:8b
# --- OpenAI (if using openai backend) ---
export SLURMGENIE_OPENAI_API_KEY=sk-...
export SLURMGENIE_OPENAI_MODEL=gpt-4o-mini
# export SLURMGENIE_OPENAI_BASE_URL=... # for Azure OpenAI
# --- Anthropic (if using anthropic backend) ---
export SLURMGENIE_ANTHROPIC_API_KEY=sk-ant-...
export SLURMGENIE_ANTHROPIC_MODEL=claude-sonnet-4-20250514
# --- Slack ---
export SLURMGENIE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
# --- GPU Monitoring ---
export SLURMGENIE_GPU_IDLE_THRESHOLD=10.0
# --- LLM Performance (local backend) ---
export SLURMGENIE_LLM_GPU_LAYERS=32 # 0 = CPU only, 32 = full GPU offload (auto-detected)
export SLURMGENIE_LLM_IDLE_TIMEOUT=300 # Auto-unload model after N seconds of idleThe local LLM backend (llama-cpp-python) includes several optimizations:
| Optimization | Default | Impact |
|---|---|---|
| GPU offloading | Auto-detected via nvidia-smi |
10-50x speedup when GPU available |
| Thread count | Full CPU count | ~2x CPU inference speedup |
| Batch size | min(n_ctx, 512) |
Better throughput |
| Provider caching | Singleton per config | No model reinitialization between calls |
Override GPU layers manually:
# Force CPU only (useful for debugging)
export SLURMGENIE_LLM_GPU_LAYERS=0
# Force full GPU offload
export SLURMGENIE_LLM_GPU_LAYERS=32All SLURMGENIE_* variables can also be set in a .env file.
slurmgenie diagnose job 12345
│
├─ scontrol show job 12345 ──▶ StdOut=/scratch/user/logs/slurm-12345.out
│ (no --log flag needed)
├─ sacct -j 12345 ──▶ State=FAILED, ExitCode=1:0, MaxRSS=42G
│
├─ Read log file ──▶ Regex pattern matching (50+ patterns)
│
├─ PatternMatcher ──▶ CUDA OOM detected on line 12
│
└─ Rich CLI output ──▶ Root cause + actionable suggestions
All subprocess calls use shell=False to prevent injection. The tool never modifies cluster state — it is purely read-only.
See CONTRIBUTING.md for the full development guide including setup, testing, linting, and contribution workflows.
git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install -e ".[dev,llm]"
pytest
ruff check src/pip install pre-commit && pre-commit installpython benchmarks/llm_benchmark.py
python benchmarks/llm_benchmark.py --gpu-layers 32See CONTRIBUTING.md for the full guide. Quick summary:
- Fork and create a feature branch
- Write tests for new functionality
- Run
pytestandruff check src/ - Open a PR with a clear description
- No shell injection: All subprocess calls use
shell=False. Module names are validated with a strict allowlist regex before being passed tobash -c. - No network calls: The tool makes zero outbound connections unless you explicitly configure a Slack webhook or invoke
slurmgenie explainwhich downloads the GGUF model from HuggingFace. - Read-only: SlurmGenie never submits, modifies, or cancels jobs.
- No credential storage: API keys/webhooks are loaded from environment variables or
.envfiles only — never written to disk by SlurmGenie.
Found a security issue? Please report it privately via GitHub's Security Advisory rather than opening a public issue.
Apache License 2.0 — see LICENSE for details.