Development

This page is for people working on SML itself. If you just want to use SML, see Getting Started.

Setting up the dev environment

git clone https://github.com/swiss-ai/model-launch.git
cd model-launch
make install-dev
source .venv/bin/activate

make install-dev creates a virtualenv at .venv/, installs SML in editable mode, and sets up pre-commit hooks.

A handful of lint tools live outside the venv and need a one-time install:

Tool	Why	Install (macOS)
`taplo`	TOML formatter, used by `make format` / `make tomlfmt` and the pre-commit hook	`brew install taplo`
`npx` (Node)	Runs `prettier` and `markdownlint-cli2` on demand	`brew install node`

Pin: CI installs taplo v0.9.3 — match it locally if you hit format-drift between your machine and CI.

Test environment

Integration tests need real cluster credentials. Create .test.sh at the repo root:

export SML_CSCS_API_KEY=<your-api-key>
export SML_FIRECREST_CLIENT_ID=<your-client-id>
export SML_FIRECREST_CLIENT_SECRET=<your-client-secret>
export SML_FIRECREST_SYSTEM=clariden
export SML_FIRECREST_TOKEN_URI=<your-token-uri>
export SML_FIRECREST_URL=<your-firecrest-url>
export SML_PARTITION=normal
export SML_RESERVATION=<your-reservation>

.test.sh is gitignored; the test targets source it automatically.

Common make targets

Target	What it does
`make format`	Format Python (`ruff`)
`make shellcheck`	Lint shell scripts
`make markdownlint`	Lint Markdown
`make test-lightweight`	Auto-CI subset of integration tests
`make test-comprehensive`	Full integration test suite
`make clean-cache`	Remove cache files
`make clean-dev`	Remove the venv and cache

Debugging

Set SML_DEBUG=1 to include local variables in crash tracebacks:

export SML_DEBUG=1

Warning: SML_DEBUG=1 may expose secrets (CSCS API key, FirecREST credentials) in crash output. Don't share terminal output captured with this flag.

By default, locals are stripped from crash reports.

Adding a new model recipe

The lowest-friction contribution. Drop a shell script under examples/<system>/cli/<vendor>/. Use the adding-new-model-to-sml issue template as a checklist; existing scripts (e.g. examples/clariden/cli/swiss-ai/Apertus-8B-Instruct-2509-sglang.sh) are good templates.

For models that should appear in the sml interactive catalog (not just sml advanced), the recipe also needs an entry in the model catalog — see existing entries under src/swiss_ai_model_launch/assets/models.json.

Try it yourself first

The SML team can't take a "please add my model" request for every checkpoint that lands on Hugging Face. Before filing an issue, work the checklist:

Find the closest existing example under examples/<system>/cli/<vendor>/ — same framework (sglang/vllm), similar size class, same architecture if possible. Copy it.
Swap in your model path via --framework-args "--model-path /capstor/store/.../<your-model>" (and --served-model-name <something-unique>).
Try it with sml advanced. If it serves, you're done — the script is the recipe; PR it.
If it doesn't serve, narrow the failure before opening an issue:
- Does the same model work with the framework directly (no SML)? If not, it's a framework issue, not an SML issue — report upstream.
- Does it OOM? See Sizing — you may need bigger TP, more nodes, or quantization.
- Does it fail to load? Architecture not supported by the framework version in the environment toml — try the other framework, or a newer image.
Only if you've gotten through 1-4 and are still stuck, file an issue with the failing command, the trailing 50 lines of logs, and what you've already ruled out.

Modifying the SLURM submission script

The SLURM script is rendered from Python at submit time — there is no static script.sh or template.jinja to edit. The renderer is in src/swiss_ai_model_launch/launchers/framework.py.

What gets rendered

A single master.sh (visible via --output-script — see usage) containing in order:

Telemetry POST
Arch detection — sets OCF_BIN, SP_NCCL_SO_PATH, metrics_agent_bin per aarch64 / x86_64
Node mapping — mapfile -t nodes < <(scontrol show hostnames ...)
Self-extracting rank scripts — single-quoted cat-heredocs that lay down head.sh, optionally follower.sh, optionally router.sh under $HOME/.sml/job-${SLURM_JOB_ID}/
Per-replica head IP discovery — one hostname -i srun per replica
Per-rank srun calls — one block per (replica, rank). Each binds the rank dir into the pyxis container via --container-mounts="$RANKS_DIR:$RANKS_DIR" and invokes bash $RANKS_DIR/<role>.sh
vmagent (optional) — metrics scraper on the batch node
Router (optional) — sglang_router on nodes[0] when replicas > 1 && --use-router
Footer — connect/cancel hints, wait, "Script finished"

Where to make changes

If you want to change…	Edit…
What runs inside the container per rank	`_render_sglang_head`, `_render_sglang_follower`, `_render_vllm_head`, `_render_vllm_follower`
Framework env exports (NCCL flags, `no_proxy`, JIT DeepGEMM toggle, …)	`Sglang.env_exports` / `Vllm.env_exports`
Add a new inference framework	Subclass `Framework`, register in `_FRAMEWORKS`, write per-shape renderers
The OCF wrap	`_ocf_wrap`
The router rank script	`_render_router`
Arch detection / node mapping / vmagent / footer	The matching `_render_<section>` functions
What gets bind-mounted into the container per srun	The `--container-mounts` line in `_render_replica_launches` / `_render_router_launch`
The toml mount list itself (per env: sglang, vllm, …)	The files under `src/swiss_ai_model_launch/assets/envs/`
Total nodes / partition / time / SBATCH directives	`to_sbatch_args` on `LaunchArgs` (or `render_sbatch_header` for the firecrest path)
New CLI flag flowing into `LaunchArgs`	Add to `LaunchArgs` (pydantic), wire through `build_launch_args_from_advanced` in `cli/main.py`

Preview your change

sml advanced ... --output-script /tmp/before    # current behaviour
# edit framework.py
sml advanced ... --output-script /tmp/after     # new behaviour
diff -r /tmp/before /tmp/after                  # per-file diff across master + ranks

For full coverage, the test matrix at tests/unit/test_framework.py renders 96 configurations (framework × replicas × nodes_per_replica × use_router × disable_ocf × telemetry) and runs bash -n + shellcheck against each. If your change leaves any of those broken, the test will catch it before submit time:

uv run pytest tests/unit/test_framework.py -q

tests/unit/test_examples.py also renders six real example scripts through the production CLI parser, so adding a flag that breaks one of those will fail there.

CI / CD

See CI/CD for the pipeline structure. PRs run static checks → image build → integration tests; each stage gates the next.

Filing issues / PRs

Bugs: use the bug report template. Include the failing command and the trailing chunk of TUI logs.
New models: use the adding-new-model template.
PRs: keep them focused; pre-commit hooks must pass; integration tests must pass on at least one partition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development

Setting up the dev environment

Test environment

Common make targets

Debugging

Adding a new model recipe

Try it yourself first

Modifying the SLURM submission script

What gets rendered

Where to make changes

Preview your change

CI / CD

Filing issues / PRs

FilesExpand file tree

development.md

Latest commit

History

development.md

File metadata and controls

Development

Setting up the dev environment

Test environment

Common make targets

Debugging

Adding a new model recipe

Try it yourself first

Modifying the SLURM submission script

What gets rendered

Where to make changes

Preview your change

CI / CD

Filing issues / PRs