A minimal Python tool to run parameter-grid experiments on a Slurm cluster with persistent SSH, log streaming, and simple YAML configs.
pip install slurmster- CLI with subcommands:
submit,monitor,status,fetch,cancel,gui - YAML config (explicitly provided via
--config) - Persistent SSH connection for low latency
- Per-run working directories on the remote side
- Automatic log redirection to
stdout.loginside each run directory - Live log streaming (and re-attach later)
- Local workspace to track runs and "fetched" state
- Cancel jobs from local machine
- Web-based GUI for easy management
All commands follow this pattern:
slurmster --config <config.yaml> --user <username> --host <hostname> [options] <command>Submit experiments:
slurmster --config config.yaml --user myuser --host myhost submitMonitor logs:
# Monitor by job ID:
slurmster --config config.yaml --user myuser --host myhost monitor --job 1234567Check status:
slurmster --config config.yaml --user myuser --host myhost statusFetch completed runs:
# Fetch all completed runs:
slurmster --config config.yaml --user myuser --host myhost fetch
# Or fetch a specific job:
slurmster --config config.yaml --user myuser --host myhost fetch --job 1234567Cancel jobs:
# Cancel specific job:
slurmster --config config.yaml --user myuser --host myhost cancel --job 1234567
# or cancel all:
slurmster --config config.yaml --user myuser --host myhost cancel --all--password-env ENV_VAR: Use password from environment variable--key /path/to/key: Use SSH key file instead of password--port 22: Specify SSH port (default: 22)
For submit:
--no-monitor: Don't automatically start monitoring after submission
For monitor:
--from-start: Stream from beginning instead of last 100 lines--lines N: Number of trailing lines when attaching (default: 100)
For status:
--all: Show all runs (default: only non-fetched)
For fetch:
--job <job_id>: Only fetch a specific job by ID
Create a YAML config file (see example/config.yaml):
remote:
base_dir: ~/experiments # remote working root
files:
push:
- example/train.py # any code/data files you need on remote
fetch:
- "model.pth" # optional; if omitted we fetch the entire run dir
- "log.txt"
slurm:
directives: | # SBATCH lines; placeholders allowed
#SBATCH --job-name={base_dir}
#SBATCH --partition=gpu
#SBATCH --time=00:10:00
#SBATCH --cpus-per-gpu=40
#SBATCH --nodes=1
#SBATCH --gres=gpu:1
#SBATCH --mem=32G
run:
command: | # your run command; placeholders allowed
source venv/bin/activate
python example/train.py --lr {lr} --epochs {epochs} --save_model "{run_dir}/model.pth" --log_file "{run_dir}/log.txt"
# ONE of the following:
grid:
lr: [0.1, 0.01, 0.001]
epochs: [1, 2, 5, 10]
# experiments:
# - { lr: 0.1, epochs: 1 }
# - { lr: 0.001, epochs: 10 }{base_dir}: resolved remote base directory (e.g./home/you/experiments)- Any run parameter placeholder, e.g.
{lr},{epochs} {remote_dir}: the configuredremote.base_dir{run_dir}: the per-run directory (underremote.base_dir/runs/{exp_name})
Under the .slurmster directory next to your config.yaml (<config-dir>/.slurmster/<user>@<host>/<sanitized-remote-base>), we store:
runs.json— run registry (job id, exp name, fetched flag, etc.)results/<exp_name>_<job_id>/...— fetched run directories
For a more user-friendly experience, you can use the web-based GUI:
slurmster --config config.yaml --user myuser --host myhost guiAdditional GUI options:
--gui-port 8000: Set the HTTP port (default: 8000)--gui-bind 0.0.0.0: Set the bind interface (default: 0.0.0.0)--no-browser: Don't automatically open browser
The GUI provides:
Configuration Management:
- View and edit your current configuration
- See resolved placeholders and SLURM directives
- Modify files to push/fetch and run commands
Job Submission:
- Submit single jobs with custom parameters
- Submit grid jobs with parameter combinations
- Real-time parameter validation
Job Monitoring:
- View all jobs with their current status
- Monitor and browse job outputs in real-time
- Access job logs directly in the browser
Bulk Operations:
- Fetch all completed jobs at once
- Cancel multiple jobs
- Track job progress and completion status
The GUI automatically opens in your browser at http://localhost:8000 (or your specified port) and provides an intuitive interface for all slurmster functionality.
MIT — see LICENSE.
