Skip to content

OriAlpha/SlurmGenie

SlurmGenie

AI Copilot for Slurm GPU Clusters

License Python Tests PRs Welcome

A fully-offline, privacy-first CLI tool that helps ML engineers and HPC admins diagnose failed Slurm jobs, monitor GPU utilization, and optimize sbatch scripts — all without sending a single byte to the cloud.


Table of Contents


Why SlurmGenie?

Running ML workloads on HPC clusters is painful. When a job fails, you're left grepping through thousands of log lines to find a CUDA OOM error that could have been caught in seconds. When a job runs, you have no idea if it's actually using the GPUs you reserved.

SlurmGenie fixes this:

Problem SlurmGenie Solution
Job failed, no idea why slurmgenie diagnose job <ID> — detects 50+ error patterns in seconds
GPU reserved but idle slurmgenie monitor --idle — finds wasted allocations instantly
Bad sbatch config slurmgenie suggest script job.sbatch — cluster-aware optimization
Can't use cloud AI on HPC Runs a local 400MB LLM on the login node, fully offline
Fear of data leaks Zero cloud calls. All processing stays on your cluster.

Features

Feature Description
🔍 Job Diagnosis Automatically analyze failed Slurm jobs. Pinpoints the exact line and explains the root cause. No LLM required.
🧠 50+ Error Patterns Recognizes CUDA OOM, NCCL timeouts, disk quota, node failures, GPU Xid errors, and more.
📊 GPU Monitoring Live GPU utilization, memory, temperature, and power via nvidia-smi. Detects idle allocations.
sbatch Advisor Rules-based + context-aware suggestions: AST analysis, job history profiling, module validation. Instant, no LLM.
🤖 AI Explain Add --explain to any diagnosis to get a human-readable root cause narrative from the LLM.
AI Script Fix Add --ai to suggest to generate a corrected sbatch script.
📝 Script Generation slurmgenie generate "description" — natural language to sbatch script, validated by the rules engine.
🔔 Slack Alerts Block Kit-formatted failure notifications with root cause summaries.
🔒 Privacy-First Local-first by default. Cloud APIs (OpenAI, Anthropic) are opt-in only.
🚀 GPU-Accelerated LLM Auto-detects NVIDIA GPUs and offloads inference for 10-50x speedup.
📄 Output to File Save diagnosis results with --output/-o flag.

Quick Start

# Install
pip install slurmgenie

# Check your cluster
slurmgenie status

# Diagnose a failed job
slurmgenie diagnose job 12345

# Diagnose with AI explanation
slurmgenie diagnose job 12345 --explain

# Save diagnosis output to file
slurmgenie diagnose job 12345 --output result.json

# Get optimization advice for your script
slurmgenie suggest script train.sbatch

# AI-generate a corrected script
slurmgenie suggest script train.sbatch --ai

# Generate a script from natural language
slurmgenie generate "Train a ViT with DDP on 2 nodes, 4 GPUs each, 24 hours"

Installation

Standard Installation

Requirements: Python 3.10+, Slurm 21+

# Core features only — diagnosis, monitoring, advisor, Slack alerts (~5MB)
pip install slurmgenie

# With local AI explanations (adds llama-cpp-python + huggingface-hub)
# Downloads ~400MB GGUF model on first use of 'slurmgenie explain'
pip install "slurmgenie[llm]"

# With Slack SDK (only if you need advanced Slack features beyond webhooks)
pip install "slurmgenie[slack]"

# Everything
pip install "slurmgenie[llm,slack]"

# Development install from source
git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install -e ".[dev,llm,slack]"

What's included in each install option:

Package pip install slurmgenie [llm] [slack] [dev]
typer, rich, pydantic, requests, pyyaml
llama-cpp-python
huggingface-hub
slack-sdk
pytest, ruff, pytest-cov

Note

The Slack integration in the core install uses incoming webhooks (requests only — no Slack SDK needed). Install [slack] only if you need the full Slack SDK (e.g., bot tokens, socket mode).

Air-Gapped / Offline Installation

Many HPC clusters have no internet access on login nodes. SlurmGenie is designed for this:

Step 1: Build a wheel on an internet-connected machine

git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install build
python -m build --wheel
# Produces: dist/slurmgenie-X.Y.Z-py3-none-any.whl

Step 2: Transfer to your cluster

scp dist/slurmgenie-*.whl user@cluster.hpc.edu:~/

Step 3: Install on the login node

pip install slurmgenie-*.whl
# Or with LLM support (bundle llama-cpp-python separately)
pip install "slurmgenie-*.whl[llm]"

Tip

For the local LLM feature on air-gapped clusters, download the GGUF model file manually from HuggingFace on an internet-connected machine and set SLURMGENIE_LLM_MODEL_FILE to its local path.


Usage

Diagnosing Failed Jobs

# Diagnose by Slurm Job ID (auto-discovers log via scontrol)
slurmgenie diagnose job 12345

# Diagnose with AI explanation (requires [llm] install)
slurmgenie diagnose job 12345 --explain

# Diagnose the last 5 failed jobs
slurmgenie diagnose recent --count 5

# Bulk explain recent failures (slow — triggers LLM per job)
slurmgenie diagnose recent --count 3 --explain

# Diagnose a log file directly (works without Slurm in PATH)
slurmgenie diagnose log /path/to/slurm-12345.out

Example output:

╭──────────────────────────── SlurmGenie Diagnosis ─────────────────────────────╮
│  Job ID: 12345  │  Name: train_llama  │  State: FAILED  │  13 lines analyzed  │
│                                                                                │
│  ISSUES FOUND   CRIT 3 critical   WARN 0 warnings                              │
╰────────────────────────────────────────────────────────────────────────────────╯

1. CUDA Out of Memory (line 12)
   GPU memory exhausted during tensor allocation.
   → Reduce batch size (e.g., halve it and retry)
   → Enable gradient checkpointing: model.gradient_checkpointing_enable()
   → Use mixed precision: --fp16 or torch.cuda.amp

Note

SlurmGenie first queries scontrol show job <ID> to find the StdOut log path automatically. No need to specify --log in most cases.

Model Selection (Local LLM)

By default, slurmgenie diagnose --explain uses Qwen2.5-0.5B-Instruct (compact, fast, works on CPU). You can switch to any GGUF model from HuggingFace:

# Use a different model (auto-downloads on first use)
slurmgenie diagnose job 12345 --explain \
  --llm-repo "bartowski/Llama-3.2-1B-Instruct-GGUF" \
  --llm-model "Llama-3.2-1B-Instruct-Q4_K_M.gguf"

# Or just change the model file within the same repo
slurmgenie diagnose job 12345 --explain -m "qwen2.5-1.5b-instruct-q4_k_m.gguf"
Flag Default Description
--llm-repo Qwen/Qwen2.5-0.5B-Instruct-GGUF HuggingFace repo ID containing GGUF files
--llm-model, -m qwen2.5-0.5b-instruct-q4_k_m.gguf Specific GGUF filename in the repo

Models are cached in ~/.cache/huggingface/hub/ after first download.

Monitoring GPU Utilization

# Live snapshot of all GPUs on the current node
slurmgenie monitor

# Show only idle / wasted GPU allocations
slurmgenie monitor --idle

# Continuously refresh every 5 seconds (like htop)
slurmgenie monitor --watch --interval 5

# Monitor resource usage for a specific running job
slurmgenie monitor --job 12345

Optimizing sbatch Scripts

slurmgenie suggest script train.sbatch

The advisor performs 4 layers of analysis in under a second:

  1. Rules Engine — Checks CPU-to-GPU ratio, missing time limits, pip install in scripts, etc.
  2. Dynamic Topology — Queries your cluster's actual node hardware to give partition-specific ratios.
  3. Python AST Analysis — Reads your Python script statically. Detects DistributedDataParallel or FSDP and warns if --ntasks-per-node is missing.
  4. Historical Profiling — Checks your last 5 jobs with the same name in sacct. If you requested 200GB but historically use 42GB, it tells you.

Example output:

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Issue                     ┃ Recommendation                                        ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ [CRITICAL] DDP detected   │ Add: #SBATCH --ntasks-per-node=4                      │
│ [CRITICAL] Module missing │ 'cuda/11.8' not found. Use: module spider cuda        │
│ [INFO] Memory history     │ Historically uses ~42GB. Consider --mem=63G           │
└───────────────────────────┴───────────────────────────────────────────────────────┘

Slack Notifications

# Test your Slack integration
slurmgenie notify test

# Watch for job failures and send alerts automatically (realtime mode)
slurmgenie notify watch

# Generate a daily digest of failures + at-risk jobs, then send to Slack
slurmgenie notify digest --days 1

Daily Digest (Recommended for Teams)

The notify digest command is designed for cron — it runs once, collects everything from the last N days, and posts a single compact summary to Slack:

# Run once in the morning
crontab -e
# Add:  0 8 * * * /usr/local/bin/slurmgenie notify digest --days 1

# Options:
slurmgenie notify digest \
  --days 1 \
  --running-threshold 80 \
  --min-severity CRITICAL \
  --user webel1

# Preview locally without sending to Slack
slurmgenie notify digest --days 1 --dry-run

# Save to file in addition to Slack
slurmgenie notify digest --days 1 --output /tmp/slurmgenie-digest.txt
Flag Default Description
--days 1 Look back N days for failures
--running-threshold 80 Flag RUNNING jobs exceeding this % of time limit
--min-severity WARN Only include failures with ≥ this severity (CRITICAL/WARNING/INFO)
--user all Filter to specific user
--dry-run False Preview locally; do NOT send to Slack
--output Save digest text to a file

What the digest includes:

  • Failed jobs table: jobID, name, user, state, elapsed time, top detected error
  • At-risk jobs table: RUNNING jobs near time limit or under memory pressure

Configure your webhook:

export SLURMGENIE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/T.../B.../...

AI-Powered Features

All AI features work with 4 interchangeable backends:

Backend Setup Best For
local (default) pip install "slurmgenie[llm]" Air-gapped clusters, zero config
ollama Install Ollama, ollama pull llama3.1:8b Best local quality, many model choices
openai export SLURMGENIE_OPENAI_API_KEY=sk-... Best reasoning quality
anthropic export SLURMGENIE_ANTHROPIC_API_KEY=sk-ant-... Best reasoning quality
# Switch backends at any time
export SLURMGENIE_LLM_BACKEND=ollama    # or: local, openai, anthropic

Diagnose with AI Explanation (--explain)

# Pattern matching + AI synthesis in one command
slurmgenie diagnose job 12345 --explain
slurmgenie diagnose log slurm-12345.out --explain

The --explain flag feeds the structured diagnosis (matched errors, resource usage, job metadata) into the LLM, which synthesizes a human-readable narrative: "Your job OOM'd because the batch size exceeds A100 memory. Try reducing from 64 to 32."

AI Script Correction (--ai)

slurmgenie suggest script train.sbatch --ai

The rules engine finds issues → the LLM generates a corrected script → you get a clean, ready-to-use sbatch file.

Script Generation (slurmgenie generate)

# Natural language → sbatch script, validated by the rules engine
slurmgenie generate "Fine-tune LLaMA 3 with LoRA on 1 GPU, 48GB memory, 12 hours"
slurmgenie generate "DDP training on 4 nodes, 8 GPUs each" -o train.sbatch

Configuration

Copy .env.example to .env in your working directory, or set environment variables directly:

# --- Slurm binary paths ---
export SLURMGENIE_SACCT_PATH=/usr/local/bin/sacct
export SLURMGENIE_SCONTROL_PATH=/usr/local/bin/scontrol
export SLURMGENIE_SQUEUE_PATH=/usr/local/bin/squeue
export SLURMGENIE_SSTAT_PATH=/usr/local/bin/sstat
export SLURMGENIE_NVIDIA_SMI_PATH=/usr/bin/nvidia-smi

# --- AI Backend (choose one) ---
export SLURMGENIE_LLM_BACKEND=local          # local | ollama | openai | anthropic
export SLURMGENIE_LLM_MAX_TOKENS=512
export SLURMGENIE_LLM_CONTEXT_LENGTH=2048

# --- Ollama (if using ollama backend) ---
export SLURMGENIE_OLLAMA_BASE_URL=http://localhost:11434
export SLURMGENIE_OLLAMA_MODEL=llama3.1:8b

# --- OpenAI (if using openai backend) ---
export SLURMGENIE_OPENAI_API_KEY=sk-...
export SLURMGENIE_OPENAI_MODEL=gpt-4o-mini
# export SLURMGENIE_OPENAI_BASE_URL=...     # for Azure OpenAI

# --- Anthropic (if using anthropic backend) ---
export SLURMGENIE_ANTHROPIC_API_KEY=sk-ant-...
export SLURMGENIE_ANTHROPIC_MODEL=claude-sonnet-4-20250514

# --- Slack ---
export SLURMGENIE_SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

# --- GPU Monitoring ---
export SLURMGENIE_GPU_IDLE_THRESHOLD=10.0

# --- LLM Performance (local backend) ---
export SLURMGENIE_LLM_GPU_LAYERS=32        # 0 = CPU only, 32 = full GPU offload (auto-detected)
export SLURMGENIE_LLM_IDLE_TIMEOUT=300      # Auto-unload model after N seconds of idle

Local LLM Performance

The local LLM backend (llama-cpp-python) includes several optimizations:

Optimization Default Impact
GPU offloading Auto-detected via nvidia-smi 10-50x speedup when GPU available
Thread count Full CPU count ~2x CPU inference speedup
Batch size min(n_ctx, 512) Better throughput
Provider caching Singleton per config No model reinitialization between calls

Override GPU layers manually:

# Force CPU only (useful for debugging)
export SLURMGENIE_LLM_GPU_LAYERS=0

# Force full GPU offload
export SLURMGENIE_LLM_GPU_LAYERS=32

All SLURMGENIE_* variables can also be set in a .env file.


How It Works

slurmgenie diagnose job 12345
        │
        ├─ scontrol show job 12345  ──▶  StdOut=/scratch/user/logs/slurm-12345.out
        │                                (no --log flag needed)
        ├─ sacct -j 12345           ──▶  State=FAILED, ExitCode=1:0, MaxRSS=42G
        │
        ├─ Read log file            ──▶  Regex pattern matching (50+ patterns)
        │
        ├─ PatternMatcher           ──▶  CUDA OOM detected on line 12
        │
        └─ Rich CLI output          ──▶  Root cause + actionable suggestions

All subprocess calls use shell=False to prevent injection. The tool never modifies cluster state — it is purely read-only.


Development

See CONTRIBUTING.md for the full development guide including setup, testing, linting, and contribution workflows.

Quick Start

git clone https://github.com/OriAlpha/SlurmGenie.git
cd SlurmGenie
pip install -e ".[dev,llm]"
pytest
ruff check src/

Pre-commit Hooks

pip install pre-commit && pre-commit install

LLM Benchmarks

python benchmarks/llm_benchmark.py
python benchmarks/llm_benchmark.py --gpu-layers 32

Contributing

See CONTRIBUTING.md for the full guide. Quick summary:

  1. Fork and create a feature branch
  2. Write tests for new functionality
  3. Run pytest and ruff check src/
  4. Open a PR with a clear description

Security

  • No shell injection: All subprocess calls use shell=False. Module names are validated with a strict allowlist regex before being passed to bash -c.
  • No network calls: The tool makes zero outbound connections unless you explicitly configure a Slack webhook or invoke slurmgenie explain which downloads the GGUF model from HuggingFace.
  • Read-only: SlurmGenie never submits, modifies, or cancels jobs.
  • No credential storage: API keys/webhooks are loaded from environment variables or .env files only — never written to disk by SlurmGenie.

Found a security issue? Please report it privately via GitHub's Security Advisory rather than opening a public issue.


License

Apache License 2.0 — see LICENSE for details.


Developed by [Suhas Goravale Siddaramu](https://github.com/OriAlpha)

About

No description, website, or topics provided.

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages