Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,7 @@ A SLURM concept — a named subset of cluster nodes with its own queue, time lim

## Replica

One independent copy of the model (a [DP](sizing.md#parallelism-dp-tp-pp-ep-and-why-dp-is-replicas) unit). Set via `--slurm-replicas`. More replicas = more throughput. Distinct from `--slurm-nodes-per-replica`, which sets how many nodes one replica spans.
One independent copy of the model (a [DP](sizing.md#parallelism-dp-tp-pp-ep-and-why-dp-is-replicas) unit), called a *worker* in the CLI. Set via `--slurm-workers`. More workers = more throughput. Distinct from `--slurm-nodes-per-worker`, which sets how many nodes one worker spans.

## Reservation

Expand Down
36 changes: 18 additions & 18 deletions docs/sizing.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ CSCS GH200 nodes have 4 GPUs at ~96 GB each (~384 GB per node).

| Model size (BF16) | Fits where | Layout |
| ----------------- | ----------------------------------------- | ----------------------------------------------------------------------------- |
| ≤ 30 B | 1 GPU | `--slurm-replicas N --slurm-nodes-per-replica 1`, set framework `--tp-size 1` |
| 30–80 B | 1 node (4-way TP) | 1 replica per node, framework `--tp-size 4` |
| 80–250 B | 1 node (4-way TP) at FP8, or 2 nodes BF16 | quantize, or `--slurm-nodes-per-replica 2` + matching TP |
| 250 B+ | Multiple nodes | `--slurm-nodes-per-replica 2+`, expect tensor + pipeline parallelism |
| ≤ 30 B | 1 GPU | `--slurm-workers N --slurm-nodes-per-worker 1`, set framework `--tp-size 1` |
| 30–80 B | 1 node (4-way TP) | 1 worker per node, framework `--tp-size 4` |
| 80–250 B | 1 node (4-way TP) at FP8, or 2 nodes BF16 | quantize, or `--slurm-nodes-per-worker 2` + matching TP |
| 250 B+ | Multiple nodes | `--slurm-nodes-per-worker 2+`, expect tensor + pipeline parallelism |

## Parallelism: DP / TP / PP / EP — and why DP is replicas

Expand All @@ -53,17 +53,17 @@ Four flavors of parallelism show up when serving large models:
| Term | What it splits across GPUs | Where SML expresses it |
| ----------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| **TP** (tensor parallelism) | A single matmul, sharded across GPUs within a layer | Framework flag (e.g. sglang/vLLM `--tp-size`) inside `--framework-args`. Stays inside one replica. |
| **PP** (pipeline parallelism) | Layers, sharded across GPUs (or nodes) end-to-end | Framework flag (e.g. `--pp-size`) inside `--framework-args`. Spans nodes within one replica when `--slurm-nodes-per-replica > 1`. |
| **EP** (expert parallelism) | MoE experts, sharded across GPUs — only meaningful for MoE models | Framework flag (e.g. vLLM/sglang `--ep-size` or `--enable-expert-parallel`) inside `--framework-args`. Stays inside one replica. |
| **DP** (data parallelism) | Independent copies serving different requests in parallel | **`--slurm-replicas N`** — N copies of the model, optionally fronted by `--use-router`. |
| **PP** (pipeline parallelism) | Layers, sharded across GPUs (or nodes) end-to-end | Framework flag (e.g. `--pp-size`) inside `--framework-args`. Spans nodes within one worker when `--slurm-nodes-per-worker > 1`. |
| **EP** (expert parallelism) | MoE experts, sharded across GPUs — only meaningful for MoE models | Framework flag (e.g. vLLM/sglang `--ep-size` or `--enable-expert-parallel`) inside `--framework-args`. Stays inside one worker. |
| **DP** (data parallelism) | Independent copies serving different requests in parallel | **`--slurm-workers N`** — N copies of the model, optionally fronted by `--use-router`. |

In short: **a "replica" in SML is a DP unit.** TP, PP, and EP are framework-internal — they affect how one replica is laid out across its allocated GPUs/nodes. DP is just "how many replicas".
In short: **a worker in SML is a DP unit (one replica).** TP, PP, and EP are framework-internal — they affect how one worker is laid out across its allocated GPUs/nodes. DP is just "how many workers".

### A note on dense models in Kubernetes

For dense models (one weight matrix per layer, no MoE routing), DP isn't usually expressed inside the inference framework — you don't tell the framework "give me 4 data-parallel copies on these 4 GPUs". You just request a single GPU per replica and let the **autoscaler** add more replicas when load grows. The orchestrator (k8s, or here, SLURM + `--slurm-replicas`) provides DP naturally; the framework only handles TP (and PP when needed).
For dense models (one weight matrix per layer, no MoE routing), DP isn't usually expressed inside the inference framework — you don't tell the framework "give me 4 data-parallel copies on these 4 GPUs". You just request a single GPU per worker and let the **autoscaler** add more workers when load grows. The orchestrator (k8s, or here, SLURM + `--slurm-workers`) provides DP naturally; the framework only handles TP (and PP when needed).

This shapes the rule below: bump `--slurm-replicas` for throughput, not the framework's DP flags.
This shapes the rule below: bump `--slurm-workers` for throughput, not the framework's DP flags.

### MoE models change the picture

Expand All @@ -74,21 +74,21 @@ For Mixture-of-Experts models (Mixtral, DeepSeek-V3, GLM-4.5/5, Qwen-MoE, …),

Rule of thumb: for MoE models with many experts and modest expert size, **prefer EP over TP within a replica** — it's typically faster on multi-GPU nodes. Use TP for the dense (attention) parts and EP for the MoE feed-forward parts when the framework supports it (most modern serving stacks do).

DP across replicas still applies the same way for throughput: more concurrent requests → bump `--slurm-replicas`.
DP across workers still applies the same way for throughput: more concurrent requests → bump `--slurm-workers`.

## Step 4 — replicas vs. nodes-per-replica
## Step 4 — workers vs. nodes-per-worker

These two flags set very different things:

- **`--slurm-replicas N`** — N independent copies of the model. Use for **throughput**: more concurrent requests, optionally fronted by `--use-router` for load balancing.
- **`--slurm-nodes-per-replica K`** — each replica spans K nodes. Use when **one replica doesn't fit on a single node** (large models, long context, more KV cache).
- **`--slurm-workers N`** — N independent copies of the model (replicas). Use for **throughput**: more concurrent requests, optionally fronted by `--use-router` for load balancing.
- **`--slurm-nodes-per-worker K`** — each worker spans K nodes. Use when **one worker doesn't fit on a single node** (large models, long context, more KV cache).

Total nodes = `replicas × nodes-per-replica`.
Total nodes = `workers × nodes-per-worker`.

Rule of thumb:

- Model fits on 1 node, want more throughput? → bump `--slurm-replicas`.
- Model doesn't fit on 1 node? → bump `--slurm-nodes-per-replica` first, then add replicas if you still need throughput.
- Model fits on 1 node, want more throughput? → bump `--slurm-workers`.
- Model doesn't fit on 1 node? → bump `--slurm-nodes-per-worker` first, then add workers if you still need throughput.

## Step 5 — sanity-check before submitting

Expand Down Expand Up @@ -119,7 +119,7 @@ Use this when **you have a lot of work to push through** — batch eval, dataset

| Knob | Recommended for high throughput |
| --- | --- |
| Replicas | **More.** Bump `--slurm-replicas` until you hit a partition or budget cap. DP scales linearly. |
| Workers | **More.** Bump `--slurm-workers` until you hit a partition or budget cap. DP scales linearly. |
| Router | **On** (`--use-router`). Spreads load across replicas; without it you have to load-balance yourself. |
| Framework batching | Crank `--max-num-seqs` (e.g. 256+) so the framework can group requests into fat batches. |
| KV cache headroom | Leave more VRAM for the cache. Bigger cache = more concurrent sequences = more batching opportunity. |
Expand Down
16 changes: 8 additions & 8 deletions docs/usage-advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,17 @@ For the guided flow with a curated catalog, use [`sml`](usage-sml.md).
| --------------------------- | ----------------------- | ----------------------------------------------------------------- |
| `--firecrest-system` | `SML_FIRECREST_SYSTEM` | Target HPC system |
| `--partition` | `SML_PARTITION` | SLURM partition |
| `--slurm-reservation` | `SML_RESERVATION` | SLURM reservation (optional) |
| `--slurm-reservation` | | SLURM reservation (optional) |
| `--serving-framework` | | Inference framework (`sglang`, `vllm`) — **required** |
| `--slurm-environment` | | Local path to the environment `.toml` file — **required** |
| `--framework-args` | | Arguments forwarded to the inference framework |
| `--slurm-nodes` | | Total nodes (default: `replicas × nodes-per-replica`) |
| `--slurm-replicas` | | Number of replicas (default: `1`) |
| `--slurm-nodes-per-replica` | | Nodes per replica (default: `1`) |
| `--slurm-nodes` | | Total nodes (default: `workers × nodes-per-worker`) |
| `--slurm-workers` | | Number of workers / replicas (default: `1`) |
| `--slurm-nodes-per-worker` | | Nodes per worker (default: `1`) |
| `--slurm-time` | | Job time limit `HH:MM:SS` (default: `00:05:00`) |
| `--served-model-name` | | Name under which the model is served (auto-generated if omitted) |
| `--replica-port` | | Port used by replicas (default: `5000`) |
| `--use-router` | | Enable router to load-balance across replicas |
| `--worker-port` | | Port used by workers (default: `5000`) |
| `--use-router` | | Enable router to load-balance across workers |
| `--router-args` | | Arguments forwarded to the router |
| `--disable-ocf` | | Disable OCF wrapper |
| `--pre-launch-cmds` | | Shell commands to run before the framework starts |
Expand All @@ -35,8 +35,8 @@ For the guided flow with a curated catalog, use [`sml`](usage-sml.md).
sml advanced \
--firecrest-system clariden \
--partition normal \
--slurm-replicas 1 \
--slurm-nodes-per-replica 1 \
--slurm-workers 1 \
--slurm-nodes-per-worker 1 \
--serving-framework sglang \
--slurm-environment src/swiss_ai_model_launch/assets/envs/sglang.toml \
--framework-args "--model-path /capstor/store/cscs/swissai/infra01/hf_models/models/swiss-ai/Apertus-8B-Instruct-2509 \
Expand Down
Loading
Loading