swiss-ai · AryanAhadinia · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026 · Apr 29, 2026
diff --git a/docs/glossary.md b/docs/glossary.md
@@ -50,7 +50,7 @@ A SLURM concept — a named subset of cluster nodes with its own queue, time lim
 
 ## Replica
 
-One independent copy of the model (a [DP](sizing.md#parallelism-dp-tp-pp-ep-and-why-dp-is-replicas) unit). Set via `--slurm-replicas`. More replicas = more throughput. Distinct from `--slurm-nodes-per-replica`, which sets how many nodes one replica spans.
+One independent copy of the model (a [DP](sizing.md#parallelism-dp-tp-pp-ep-and-why-dp-is-replicas) unit), called a *worker* in the CLI. Set via `--slurm-workers`. More workers = more throughput. Distinct from `--slurm-nodes-per-worker`, which sets how many nodes one worker spans.
 
 ## Reservation
 

diff --git a/docs/sizing.md b/docs/sizing.md
@@ -41,10 +41,10 @@ CSCS GH200 nodes have 4 GPUs at ~96 GB each (~384 GB per node).
 
 | Model size (BF16) | Fits where                                | Layout                                                                        |
 | ----------------- | ----------------------------------------- | ----------------------------------------------------------------------------- |
-| ≤ 30 B            | 1 GPU                                     | `--slurm-replicas N --slurm-nodes-per-replica 1`, set framework `--tp-size 1` |
-| 30–80 B           | 1 node (4-way TP)                         | 1 replica per node, framework `--tp-size 4`                                   |
-| 80–250 B          | 1 node (4-way TP) at FP8, or 2 nodes BF16 | quantize, or `--slurm-nodes-per-replica 2` + matching TP                      |
-| 250 B+            | Multiple nodes                            | `--slurm-nodes-per-replica 2+`, expect tensor + pipeline parallelism          |
+| ≤ 30 B            | 1 GPU                                     | `--slurm-workers N --slurm-nodes-per-worker 1`, set framework `--tp-size 1`   |
+| 30–80 B           | 1 node (4-way TP)                         | 1 worker per node, framework `--tp-size 4`                                    |
+| 80–250 B          | 1 node (4-way TP) at FP8, or 2 nodes BF16 | quantize, or `--slurm-nodes-per-worker 2` + matching TP                       |
+| 250 B+            | Multiple nodes                            | `--slurm-nodes-per-worker 2+`, expect tensor + pipeline parallelism           |
 
 ## Parallelism: DP / TP / PP / EP — and why DP is replicas
 
@@ -53,17 +53,17 @@ Four flavors of parallelism show up when serving large models:
 | Term                          | What it splits across GPUs                                        | Where SML expresses it                                                                                                            |
 | ----------------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------- |
 | **TP** (tensor parallelism)   | A single matmul, sharded across GPUs within a layer               | Framework flag (e.g. sglang/vLLM `--tp-size`) inside `--framework-args`. Stays inside one replica.                                |
-| **PP** (pipeline parallelism) | Layers, sharded across GPUs (or nodes) end-to-end                 | Framework flag (e.g. `--pp-size`) inside `--framework-args`. Spans nodes within one replica when `--slurm-nodes-per-replica > 1`. |
-| **EP** (expert parallelism)   | MoE experts, sharded across GPUs — only meaningful for MoE models | Framework flag (e.g. vLLM/sglang `--ep-size` or `--enable-expert-parallel`) inside `--framework-args`. Stays inside one replica.  |
-| **DP** (data parallelism)     | Independent copies serving different requests in parallel         | **`--slurm-replicas N`** — N copies of the model, optionally fronted by `--use-router`.                                           |
+| **PP** (pipeline parallelism) | Layers, sharded across GPUs (or nodes) end-to-end                 | Framework flag (e.g. `--pp-size`) inside `--framework-args`. Spans nodes within one worker when `--slurm-nodes-per-worker > 1`.   |
+| **EP** (expert parallelism)   | MoE experts, sharded across GPUs — only meaningful for MoE models | Framework flag (e.g. vLLM/sglang `--ep-size` or `--enable-expert-parallel`) inside `--framework-args`. Stays inside one worker.   |
+| **DP** (data parallelism)     | Independent copies serving different requests in parallel         | **`--slurm-workers N`** — N copies of the model, optionally fronted by `--use-router`.                                            |
 
-In short: **a "replica" in SML is a DP unit.** TP, PP, and EP are framework-internal — they affect how one replica is laid out across its allocated GPUs/nodes. DP is just "how many replicas".
+In short: **a worker in SML is a DP unit (one replica).** TP, PP, and EP are framework-internal — they affect how one worker is laid out across its allocated GPUs/nodes. DP is just "how many workers".
 
 ### A note on dense models in Kubernetes
 
-For dense models (one weight matrix per layer, no MoE routing), DP isn't usually expressed inside the inference framework — you don't tell the framework "give me 4 data-parallel copies on these 4 GPUs". You just request a single GPU per replica and let the **autoscaler** add more replicas when load grows. The orchestrator (k8s, or here, SLURM + `--slurm-replicas`) provides DP naturally; the framework only handles TP (and PP when needed).
+For dense models (one weight matrix per layer, no MoE routing), DP isn't usually expressed inside the inference framework — you don't tell the framework "give me 4 data-parallel copies on these 4 GPUs". You just request a single GPU per worker and let the **autoscaler** add more workers when load grows. The orchestrator (k8s, or here, SLURM + `--slurm-workers`) provides DP naturally; the framework only handles TP (and PP when needed).
 
-This shapes the rule below: bump `--slurm-replicas` for throughput, not the framework's DP flags.
+This shapes the rule below: bump `--slurm-workers` for throughput, not the framework's DP flags.
 
 ### MoE models change the picture
 
@@ -74,21 +74,21 @@ For Mixture-of-Experts models (Mixtral, DeepSeek-V3, GLM-4.5/5, Qwen-MoE, …),
 
 Rule of thumb: for MoE models with many experts and modest expert size, **prefer EP over TP within a replica** — it's typically faster on multi-GPU nodes. Use TP for the dense (attention) parts and EP for the MoE feed-forward parts when the framework supports it (most modern serving stacks do).
 
-DP across replicas still applies the same way for throughput: more concurrent requests → bump `--slurm-replicas`.
+DP across workers still applies the same way for throughput: more concurrent requests → bump `--slurm-workers`.
 
-## Step 4 — replicas vs. nodes-per-replica
+## Step 4 — workers vs. nodes-per-worker
 
 These two flags set very different things:
 
-- **`--slurm-replicas N`** — N independent copies of the model. Use for **throughput**: more concurrent requests, optionally fronted by `--use-router` for load balancing.
-- **`--slurm-nodes-per-replica K`** — each replica spans K nodes. Use when **one replica doesn't fit on a single node** (large models, long context, more KV cache).
+- **`--slurm-workers N`** — N independent copies of the model (replicas). Use for **throughput**: more concurrent requests, optionally fronted by `--use-router` for load balancing.
+- **`--slurm-nodes-per-worker K`** — each worker spans K nodes. Use when **one worker doesn't fit on a single node** (large models, long context, more KV cache).
 
-Total nodes = `replicas × nodes-per-replica`.
+Total nodes = `workers × nodes-per-worker`.
 
 Rule of thumb:
 
-- Model fits on 1 node, want more throughput? → bump `--slurm-replicas`.
-- Model doesn't fit on 1 node? → bump `--slurm-nodes-per-replica` first, then add replicas if you still need throughput.
+- Model fits on 1 node, want more throughput? → bump `--slurm-workers`.
+- Model doesn't fit on 1 node? → bump `--slurm-nodes-per-worker` first, then add workers if you still need throughput.
 
 ## Step 5 — sanity-check before submitting
 
@@ -119,7 +119,7 @@ Use this when **you have a lot of work to push through** — batch eval, dataset
 
 | Knob | Recommended for high throughput |
 | --- | --- |
-| Replicas | **More.** Bump `--slurm-replicas` until you hit a partition or budget cap. DP scales linearly. |
+| Workers | **More.** Bump `--slurm-workers` until you hit a partition or budget cap. DP scales linearly. |
 | Router | **On** (`--use-router`). Spreads load across replicas; without it you have to load-balance yourself. |
 | Framework batching | Crank `--max-num-seqs` (e.g. 256+) so the framework can group requests into fat batches. |
 | KV cache headroom | Leave more VRAM for the cache. Bigger cache = more concurrent sequences = more batching opportunity. |

diff --git a/docs/usage-advanced.md b/docs/usage-advanced.md
@@ -14,17 +14,17 @@ For the guided flow with a curated catalog, use [`sml`](usage-sml.md).
 | --------------------------- | ----------------------- | ----------------------------------------------------------------- |
 | `--firecrest-system`        | `SML_FIRECREST_SYSTEM`  | Target HPC system                                                 |
 | `--partition`               | `SML_PARTITION`         | SLURM partition                                                   |
-| `--slurm-reservation`       | `SML_RESERVATION`       | SLURM reservation (optional)                                      |
+| `--slurm-reservation`       |                         | SLURM reservation (optional)                                      |
 | `--serving-framework`       |                         | Inference framework (`sglang`, `vllm`) — **required**             |
 | `--slurm-environment`       |                         | Local path to the environment `.toml` file — **required**         |
 | `--framework-args`          |                         | Arguments forwarded to the inference framework                    |
-| `--slurm-nodes`             |                         | Total nodes (default: `replicas × nodes-per-replica`)             |
-| `--slurm-replicas`          |                         | Number of replicas (default: `1`)                                 |
-| `--slurm-nodes-per-replica` |                         | Nodes per replica (default: `1`)                                  |
+| `--slurm-nodes`             |                         | Total nodes (default: `workers × nodes-per-worker`)               |
+| `--slurm-workers`           |                         | Number of workers / replicas (default: `1`)                       |
+| `--slurm-nodes-per-worker`  |                         | Nodes per worker (default: `1`)                                   |
 | `--slurm-time`              |                         | Job time limit `HH:MM:SS` (default: `00:05:00`)                   |
 | `--served-model-name`       |                         | Name under which the model is served (auto-generated if omitted)  |
-| `--replica-port`            |                         | Port used by replicas (default: `5000`)                           |
-| `--use-router`              |                         | Enable router to load-balance across replicas                     |
+| `--worker-port`             |                         | Port used by workers (default: `5000`)                            |
+| `--use-router`              |                         | Enable router to load-balance across workers                      |
 | `--router-args`             |                         | Arguments forwarded to the router                                 |
 | `--disable-ocf`             |                         | Disable OCF wrapper                                               |
 | `--pre-launch-cmds`         |                         | Shell commands to run before the framework starts                 |
@@ -35,8 +35,8 @@ For the guided flow with a curated catalog, use [`sml`](usage-sml.md).
 sml advanced \
   --firecrest-system clariden \
   --partition normal \
-  --slurm-replicas 1 \
-  --slurm-nodes-per-replica 1 \
+  --slurm-workers 1 \
+  --slurm-nodes-per-worker 1 \
   --serving-framework sglang \
   --slurm-environment src/swiss_ai_model_launch/assets/envs/sglang.toml \
   --framework-args "--model-path /capstor/store/cscs/swissai/infra01/hf_models/models/swiss-ai/Apertus-8B-Instruct-2509 \