Complete reference for job configuration YAML files.
- Overview
- Cluster Config Discovery
- name
- model
- resources
- slurm
- frontend
- backend
- benchmark
- dynamo
- profiling
- output
- health_check
- infra
- sweep
- Config Overrides
- FormattablePath Template System
- container_mounts
- environment
- extra_mount
- sbatch_directives
- srun_options
- setup_script
- enable_config_dump
- Complete Examples
name: "my-benchmark" # Required: job name
model: # Required: model settings
path: "deepseek-r1"
container: "latest"
precision: "fp8"
resources: # Required: GPU allocation
gpu_type: "gb200"
prefill_nodes: 1
decode_nodes: 2
slurm: # Optional: SLURM overrides
time_limit: "02:00:00"
frontend: # Optional: router/frontend config
type: dynamo
backend: # Optional: worker config
type: sglang
sglang_config:
prefill: {}
decode: {}
benchmark: # Optional: benchmark config
type: "sa-bench"
isl: 1024
osl: 1024
dynamo: # Optional: dynamo version
version: "0.8.0"
profiling: # Optional: profiling config
type: "none"
output: # Optional: output paths
log_dir: "./outputs/{job_id}/logs"
health_check: # Optional: health check settings
max_attempts: 180
interval_seconds: 10
setup_script: "my-setup.sh" # Optional: custom setup scriptsrtctl looks for srtslurm.yaml (cluster-wide settings) in this order:
SRTSLURM_CONFIGenvironment variable (if set) - explicit path to config file- Current working directory
- Parent directory (1 level up)
- Grandparent directory (2 levels up)
For users working in deep directory structures (e.g., study directories), set SRTSLURM_CONFIG in your shell profile:
# Add to ~/.bashrc or ~/.zshrc
export SRTSLURM_CONFIG="/path/to/srt-slurm/srtslurm.yaml"This allows you to run srtctl apply -f config.yaml from anywhere without needing srtslurm.yaml nearby.
The srtslurm.yaml file can contain the following fields:
| Field | Type | Description |
|---|---|---|
default_account |
string | Default SLURM account |
default_partition |
string | Default SLURM partition |
default_time_limit |
string | Default job time limit |
gpus_per_node |
int | Default GPUs per node |
network_interface |
string | Network interface for NCCL |
srtctl_root |
string | Root directory for srtctl |
output_dir |
string | Custom output directory (overrides srtctl_root/outputs) |
model_paths |
dict | Model path aliases |
containers |
dict | Container image aliases |
default_mounts |
dict | Cluster-wide container mounts |
output_dir: When set, job logs are written to output_dir/{job_id}/logs instead of srtctl_root/outputs/{job_id}/logs. Useful for CI/CD and ephemeral environments.
| Field | Type | Required | Description |
|---|---|---|---|
name |
string | Yes | Job name, used for identification and log prefixes |
name: "deepseek-r1-benchmark"Model and container configuration.
model:
path: "deepseek-r1" # Alias from srtslurm.yaml or full path
container: "latest" # Container alias from srtslurm.yaml
precision: "fp8" # fp8, fp4, bf16, etc.| Field | Type | Required | Description |
|---|---|---|---|
path |
string | Yes | Model path alias (from srtslurm.yaml) or absolute path |
container |
string | Yes | Container alias (from srtslurm.yaml) or .sqsh path |
precision |
string | Yes | Model precision (informational: fp4, fp8, fp16, bf16) |
GPU allocation and worker topology.
resources:
gpu_type: "gb200"
gpus_per_node: 4 # GPUs per node (default: from srtslurm.yaml)
prefill_nodes: 2 # Nodes for prefill workers
prefill_workers: 4 # Number of prefill workers
decode_nodes: 4 # Nodes for decode workers
decode_workers: 8 # Number of decode workersresources:
gpu_type: "h100"
gpus_per_node: 8
agg_nodes: 2 # Nodes for aggregated workers
agg_workers: 4 # Number of aggregated workers| Field | Type | Default | Description |
|---|---|---|---|
gpu_type |
string | - | GPU type: "gb200", "gb300", or "h100" |
gpus_per_node |
int | 4 | GPUs per node |
prefill_nodes |
int | null | Nodes dedicated to prefill |
decode_nodes |
int | null | Nodes dedicated to decode |
prefill_workers |
int | null | Number of prefill workers |
decode_workers |
int | null | Number of decode workers |
agg_nodes |
int | null | Nodes for aggregated mode |
agg_workers |
int | null | Number of aggregated workers |
gpus_per_prefill |
int | computed | Explicit GPUs per prefill worker |
gpus_per_decode |
int | computed | Explicit GPUs per decode worker |
gpus_per_agg |
int | computed | Explicit GPUs per aggregated worker |
Notes:
- Set
decode_nodes: 0to have decode workers share nodes with prefill workers. - Either use disaggregated mode (prefill_nodes/decode_nodes) OR aggregated mode (agg_nodes), not both.
- GPUs per worker are computed automatically:
(nodes * gpus_per_node) / workers - Use
gpus_per_prefill,gpus_per_decode,gpus_per_aggto explicitly override the computed values
The ResourceConfig provides several computed properties:
is_disaggregated: True if using prefill/decode modetotal_nodes: Total nodes allocated (prefill + decode or agg)num_prefill,num_decode,num_agg: Worker counts for each rolegpus_per_prefill,gpus_per_decode,gpus_per_agg: GPUs allocated per workerprefill_gpus,decode_gpus: Total GPUs for each role
SLURM job settings.
slurm:
time_limit: "04:00:00" # Job time limit
account: "my-account" # SLURM account (overrides srtslurm.yaml)
partition: "batch" # SLURM partition (overrides srtslurm.yaml)| Field | Type | Default | Description |
|---|---|---|---|
time_limit |
string | from srtslurm.yaml | Job time limit (HH:MM:SS) |
account |
string | from srtslurm.yaml | SLURM account |
partition |
string | from srtslurm.yaml | SLURM partition |
Frontend/router configuration.
frontend:
# Frontend type: "dynamo" (default) or "sglang"
type: dynamo
# Scaling
enable_multiple_frontends: true # Enable nginx + multiple routers
num_additional_frontends: 9 # Additional routers (total = 1 + this)
# CLI args passed to the frontend/router
args:
router-mode: "kv" # dynamo: router-mode
policy: "cache_aware" # sglang: policy
no-kv-events: true # boolean flags
# Environment variables for frontend processes
env:
MY_VAR: "value"| Field | Type | Default | Description |
|---|---|---|---|
type |
str | dynamo | Frontend type: "dynamo" or "sglang" |
enable_multiple_frontends |
bool | true | Scale with nginx + multiple routers |
num_additional_frontends |
int | 9 | Additional routers beyond master |
nginx_container |
str | nginx:1.27.4 | Custom nginx container image |
args |
dict | null | CLI args for the frontend |
env |
dict | null | Env vars for frontend processes |
See SGLang Router for detailed architecture.
Worker configuration and SGLang settings.
backend:
type: sglang # Backend type (currently only sglang)
# Per-mode environment variables
prefill_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
aggregated_environment: {}
# SGLang CLI config per mode
sglang_config:
prefill:
tensor-parallel-size: 4
mem-fraction-static: 0.84
kv-cache-dtype: "fp8_e4m3"
disaggregation-mode: "prefill"
# ... any sglang CLI flag
decode:
tensor-parallel-size: 8
mem-fraction-static: 0.83
data-parallel-size: 8
enable-dp-attention: true
aggregated:
# ... for aggregated mode
# KV events (for kv-aware routing)
kv_events_config:
prefill: true # Enable for prefill workers
decode: true # Enable for decode workers| Field | Type | Default | Description |
|---|---|---|---|
type |
string | sglang | Backend type: "sglang" or "trtllm" |
gpu_type |
string | null | GPU type override |
prefill_environment |
dict | {} | Environment variables for prefill |
decode_environment |
dict | {} | Environment variables for decode |
aggregated_environment |
dict | {} | Environment variables for aggregated |
sglang_config |
object | null | SGLang CLI configuration per mode |
kv_events_config |
bool/dict | null | KV events configuration |
Per-mode SGLang server configuration. Any SGLang CLI flag can be specified (use kebab-case or snake_case):
| Common Flags | Type | Description |
|---|---|---|
tensor-parallel-size |
int | Tensor parallelism degree |
data-parallel-size |
int | Data parallelism degree |
expert-parallel-size |
int | Expert parallelism (MoE models) |
mem-fraction-static |
float | GPU memory fraction (0.0-1.0) |
kv-cache-dtype |
string | KV cache precision (fp8_e4m3, etc.) |
context-length |
int | Max context length |
chunked-prefill-size |
int | Chunked prefill batch size |
enable-dp-attention |
bool | Enable DP attention |
disaggregation-mode |
string | "prefill" or "decode" |
disaggregation-transfer-backend |
string | Transfer backend ("nixl" or other) |
served-model-name |
string | Model name for API |
grpc-mode |
bool | Enable gRPC mode |
Note: KV events is a Dynamo frontend feature for kv-aware routing. It allows workers to publish cache/scheduling information over ZMQ for the Dynamo router to make intelligent routing decisions.
Enables --kv-events-config for workers with auto-allocated ZMQ ports.
# Enable with defaults
kv_events_config: true # prefill+decode with publisher=zmq, topic=kv-events
# Per-mode control
kv_events_config:
prefill: true
decode: true
aggregated: true # Enable for aggregated workers
# Custom settings
kv_events_config:
prefill:
publisher: "zmq"
topic: "prefill-events"
decode:
topic: "decode-events" # publisher defaults to "zmq"
aggregated: true # Enable for aggregated modeEach worker leader gets a globally unique port starting at 5550:
| Worker | Port |
|---|---|
| prefill_0 | 5550 |
| prefill_1 | 5551 |
| decode_0 | 5552 |
| decode_1 | 5553 |
When using type: trtllm, the backend uses TRTLLM with MPI-style launching:
backend:
type: trtllm
# Per-mode environment variables
prefill_environment:
CUDA_LAUNCH_BLOCKING: "1"
decode_environment:
CUDA_LAUNCH_BLOCKING: "1"
# TRTLLM CLI config per mode
trtllm_config:
prefill:
mem-fraction-static: 0.8
chunked-prefill-size: 8192
decode:
mem-fraction-static: 0.9| Field | Type | Default | Description |
|---|---|---|---|
type |
string | - | Must be "trtllm" |
prefill_environment |
dict | {} | Environment variables for prefill |
decode_environment |
dict | {} | Environment variables for decode |
trtllm_config |
object | null | TRTLLM CLI configuration per mode |
Key differences from SGLang backend:
- No aggregated mode support (prefill/decode only)
- Uses MPI-style launching (one srun per endpoint with all nodes)
- Uses
trtllm-llmapi-launchfor distributed launching - Automatically sets
TRTLLM_EPLB_SHM_NAMEwith unique UUID per endpoint
Benchmark configuration. The type field determines which benchmark runner is used and what additional fields are available.
| Type | Description |
|---|---|
manual |
No benchmark (default), manual testing mode |
sa-bench |
Throughput/latency serving benchmark |
sglang-bench |
SGLang bench_serving benchmark |
mmlu |
MMLU accuracy evaluation |
longbenchv2 |
Long-context evaluation benchmark |
router |
Router performance with prefix caching |
mooncake-router |
KV-aware routing with Mooncake trace |
No benchmark is run. Use for manual testing and debugging.
benchmark:
type: "manual"Throughput and latency benchmark at various concurrency levels.
benchmark:
type: "sa-bench"
isl: 1024 # Required: Input sequence length
osl: 1024 # Required: Output sequence length
concurrencies: [256, 512] # Required: Concurrency levels to test
req_rate: "inf" # Optional: Request rate (default: "inf")| Field | Type | Required | Default | Description |
|---|---|---|---|---|
isl |
int | Yes | - | Input sequence length |
osl |
int | Yes | - | Output sequence length |
concurrencies |
list/string | Yes | - | Concurrency levels (list or "NxM" format) |
req_rate |
string/int | No | "inf" | Request rate |
Concurrencies format: Can be a list [128, 256, 512] or x-separated string "128x256x512".
SGLang bench_serving benchmark at various concurrency levels.
benchmark:
type: "sglang-bench"
isl: 1024 # Required: Input sequence length
osl: 1024 # Required: Output sequence length
concurrencies: [256, 512] # Required: Concurrency levels to test
req_rate: "inf" # Optional: Request rate (default: "inf")| Field | Type | Required | Default | Description |
|---|---|---|---|---|
isl |
int | Yes | - | Input sequence length |
osl |
int | Yes | - | Output sequence length |
concurrencies |
list/string | Yes | - | Concurrency levels (list or "NxM" format) |
req_rate |
string/int | No | "inf" | Request rate |
Concurrencies format: Can be a list [128, 256, 512] or x-separated string "128x256x512".
MMLU accuracy evaluation using sglang.test.run_eval.
benchmark:
type: "mmlu"
num_examples: 200 # Optional: Number of examples
max_tokens: 2048 # Optional: Max tokens per response
repeat: 8 # Optional: Number of repeats
num_threads: 512 # Optional: Concurrent threads| Field | Type | Required | Default | Description |
|---|---|---|---|---|
num_examples |
int | No | 200 | Number of examples to run |
max_tokens |
int | No | 2048 | Max tokens per response |
repeat |
int | No | 8 | Number of repeats |
num_threads |
int | No | 512 | Concurrent threads |
Long-context evaluation benchmark.
benchmark:
type: "longbenchv2"
max_context_length: 128000 # Optional: Max context length
num_threads: 16 # Optional: Concurrent threads
max_tokens: 16384 # Optional: Max tokens
num_examples: null # Optional: Number of examples (all if null)
categories: # Optional: Task categories
- "multi_doc_qa"
- "single_doc_qa"| Field | Type | Required | Default | Description |
|---|---|---|---|---|
max_context_length |
int | No | 128000 | Max context length |
num_threads |
int | No | 16 | Concurrent threads |
max_tokens |
int | No | 16384 | Max tokens |
num_examples |
int | No | all | Number of examples |
categories |
list[str] | No | all | Task categories to run |
Router performance benchmark with prefix caching. Requires frontend.type: sglang.
benchmark:
type: "router"
isl: 14000 # Optional: Input sequence length
osl: 200 # Optional: Output sequence length
num_requests: 200 # Optional: Number of requests
concurrency: 20 # Optional: Concurrency level
prefix_ratios: [0.1, 0.3, 0.5, 0.7, 0.9] # Optional: Prefix ratios to test| Field | Type | Required | Default | Description |
|---|---|---|---|---|
isl |
int | No | 14000 | Input sequence length |
osl |
int | No | 200 | Output sequence length |
num_requests |
int | No | 200 | Number of requests |
concurrency |
int | No | 20 | Concurrency level |
prefix_ratios |
list/string | No | "0.1 0.3 0.5 0.7 0.9" | Prefix ratios to test |
KV-aware routing benchmark using Mooncake conversation trace.
benchmark:
type: "mooncake-router"
mooncake_workload: "conversation" # Optional: Trace type
ttft_threshold_ms: 2000 # Optional: Goodput TTFT threshold
itl_threshold_ms: 25 # Optional: Goodput ITL threshold| Field | Type | Required | Default | Description |
|---|---|---|---|---|
mooncake_workload |
string | No | "conversation" | Trace type (see options below) |
ttft_threshold_ms |
int | No | 2000 | Goodput TTFT threshold in ms |
itl_threshold_ms |
int | No | 25 | Goodput ITL threshold in ms |
Workload options: "mooncake", "conversation", "synthetic", "toolagent"
Dataset characteristics (conversation trace):
- 12,031 requests over ~59 minutes (3.4 req/s)
- Avg input: 12,035 tokens, Avg output: 343 tokens
- 36.64% cache efficiency potential
Dynamo installation configuration.
dynamo:
version: "0.8.0" # Install from PyPI
# OR
hash: "abc123" # Install from git commit
# OR
top_of_tree: true # Install from main branch| Field | Type | Default | Description |
|---|---|---|---|
install |
bool | true | Whether to install dynamo (set false if pre-installed) |
version |
string | "0.8.0" | PyPI version |
hash |
string | null | Git commit hash (source install) |
top_of_tree |
bool | false | Install from main branch |
Notes:
- Set
install: falseif your container already has dynamo pre-installed. - Only one of
version,hash, ortop_of_treeshould be specified. hashandtop_of_treeare mutually exclusive.- When
hashortop_of_treeis set,versionis automatically cleared. - Source installs (
hashortop_of_tree) clone the repo and build with maturin.
Profiling configuration for nsys or torch profiler.
profiling:
type: "nsys" # "none", "nsys", or "torch"
# Phase-specific profiling step configs
prefill:
start_step: 10 # Step to start profiling
stop_step: 20 # Step to stop profiling
decode:
start_step: 10
stop_step: 20
# OR for aggregated mode:
aggregated:
start_step: 10
stop_step: 20| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type |
string | No | "none" | Profiling type: "none", "nsys", "torch" |
prefill |
object | Disaggregated | null | Prefill phase config |
decode |
object | Disaggregated | null | Decode phase config |
aggregated |
object | Aggregated | null | Aggregated phase config |
Each phase config has:
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
start_step |
int | No | null | Step to start profiling |
stop_step |
int | No | null | Step to stop profiling |
- nsys: NVIDIA Nsight Systems profiling. Wraps worker command with
nsys profile. - torch: PyTorch profiler. Sets
SGLANG_TORCH_PROFILER_DIRenvironment variable.
- Disaggregated mode requires both
prefillanddecodephase configs when profiling is enabled. - Aggregated mode requires
aggregatedphase config when profiling is enabled.
resources:
gpu_type: "h100"
prefill_nodes: 1
prefill_workers: 1
decode_nodes: 1
decode_workers: 1
profiling:
type: "torch"
prefill:
start_step: 5
stop_step: 15
decode:
start_step: 5
stop_step: 15resources:
gpu_type: "h100"
agg_nodes: 1
agg_workers: 1
profiling:
type: "nsys"
aggregated:
start_step: 10
stop_step: 25Output configuration with formattable paths.
output:
log_dir: "./outputs/{job_id}/logs"| Field | Type | Default | Description |
|---|---|---|---|
log_dir |
FormattablePath | "./outputs/{job_id}/logs" | Directory for log files |
The log_dir supports FormattablePath templating. See FormattablePath Template System.
Health check configuration for worker readiness.
health_check:
max_attempts: 180
interval_seconds: 10| Field | Type | Default | Description |
|---|---|---|---|
max_attempts |
int | 180 | Maximum health check attempts (180 = 30 minutes) |
interval_seconds |
int | 10 | Seconds between health check attempts |
Notes:
- Default of 180 attempts at 10 second intervals = 30 minutes total wait time.
- Large models (e.g., 70B+ parameters) may require the full 30 minutes to load.
- Reduce
max_attemptsfor smaller models or faster testing.
Infrastructure configuration for etcd/nats placement.
infra:
etcd_nats_dedicated_node: true| Field | Type | Default | Description |
|---|---|---|---|
etcd_nats_dedicated_node |
bool | false | Reserve first node for infrastructure services |
Notes:
- When
etcd_nats_dedicated_node: true, the first allocated node is reserved exclusively for etcd and nats services. - This can improve stability for large-scale deployments by isolating infrastructure services.
- The reserved node is not used for worker processes.
Parameter sweep configuration for running multiple benchmark variations.
sweep:
mode: "zip" # "zip" or "grid"
parameters:
isl: [512, 1024, 2048]
osl: [128, 256, 512]| Field | Type | Default | Description |
|---|---|---|---|
mode |
string | "zip" | Sweep mode: "zip" or "grid" |
parameters |
dict | {} | Parameter name to list of values mapping |
-
zip: Pairs up parameters at matching indices. Parameters must have equal lengths.
- Example:
isl=[512, 1024], osl=[128, 256]produces 2 combinations:{isl: 512, osl: 128}{isl: 1024, osl: 256}
- Example:
-
grid: Cartesian product of all parameter values.
- Example:
isl=[512, 1024], osl=[128, 256]produces 4 combinations:{isl: 512, osl: 128}{isl: 512, osl: 256}{isl: 1024, osl: 128}{isl: 1024, osl: 256}
- Example:
Reference sweep parameters in your config using {placeholder} syntax:
benchmark:
type: "sa-bench"
isl: "{isl}" # Replaced by sweep value
osl: "{osl}" # Replaced by sweep value
concurrencies: [128, 256]
sweep:
mode: "grid"
parameters:
isl: [512, 1024, 2048, 4096]
osl: [128, 256, 512]Config overrides let you define a base config plus multiple variants in a single YAML file. Each variant deep-merges a small set of changes onto the base, and is submitted as an independent SLURM job. This eliminates the need to duplicate entire config files when testing different parameter combinations.
base:
name: "my-benchmark"
resources:
decode_nodes: 8
backend:
sglang_config:
decode:
tp-size: 32
benchmark:
concurrencies: [8192, 10240]
override_tp64:
backend:
sglang_config:
decode:
tp-size: 64
override_small:
resources:
decode_nodes: 4
benchmark:
concurrencies: [4096]| Key | Description |
|---|---|
base |
Required. A complete, valid config (same structure as a normal recipe). |
override_<suffix> |
Optional. Partial config merged onto base. <suffix> is appended to the job name. |
Override job names are auto-generated: {base.name}_{suffix}.
The example above produces three jobs: my-benchmark, my-benchmark_tp64, and my-benchmark_small.
| Type | Behavior | Example |
|---|---|---|
| Scalar (str/int/bool) | Override replaces base | tp-size: 32 → tp-size: 64 |
| Dict | Recursive merge — only specified keys change | Override sglang_config.decode.tp-size: 64 leaves other decode keys untouched |
| List | Full replacement (no append) | concurrencies: [4096] replaces [8192, 10240] |
| New key | Added to base | Override adds fields base doesn't have |
null value |
Deletes the key from base | extra_mount: null removes it |
Overrides and sweeps can coexist in the same file. Override expansion happens first, then each variant with a sweep: section is expanded via Cartesian product.
base:
name: "combined"
sweep:
chunked_prefill_size: [4096, 8192]
backend:
sglang_config:
prefill:
chunked-prefill-size: "{chunked_prefill_size}"
override_big:
resources:
decode_nodes: 16This produces 4 jobs: base × 2 sweep + override_big × 2 sweep.
Files without a base top-level key are treated as normal configs — no behavior change.
FormattablePath is a powerful templating system for paths that supports runtime placeholders and environment variable expansion.
FormattablePath ensures that configuration values with placeholders are always explicitly formatted before use, preventing accidental use of unformatted templates.
# Example usage in config
output:
log_dir: "$HOME/logs/{job_id}/{run_name}"
container_mounts:
"$HOME/data": "/data"
"$HOME/logs/{job_id}": "/logs"| Placeholder | Type | Description | Example |
|---|---|---|---|
{job_id} |
string | SLURM job ID | "12345" |
{run_name} |
string | Job name + job ID | "my-benchmark_12345" |
{head_node_ip} |
string | IP address of head node | "10.0.0.1" |
{log_dir} |
string | Resolved log directory path | "/home/user/outputs/12345/logs" |
{model_path} |
string | Resolved model path | "/models/deepseek-r1" |
{container_image} |
string | Resolved container image path | "/containers/sglang.sqsh" |
{gpus_per_node} |
int | GPUs per node | 8 |
FormattablePath also expands environment variables using $VAR or ${VAR} syntax:
output:
log_dir: "$HOME/outputs/{job_id}/logs"
# Expands to: /home/username/outputs/12345/logsCommon environment variables:
$HOME- User home directory$USER- Username$SLURM_JOB_ID- SLURM job ID (also available as{job_id})
Some contexts support additional placeholders:
| Placeholder | Context | Description |
|---|---|---|
{nginx_url} |
Frontend config | Nginx URL for load balancing |
{frontend_url} |
Frontend config | Frontend/router URL |
{index} |
Worker config | Worker index |
{host} |
Worker config | Worker host |
{port} |
Worker config | Worker port |
# Log directory with job ID
output:
log_dir: "./outputs/{job_id}/logs"
# Mount user data into container
container_mounts:
"$HOME/datasets": "/datasets"
"./outputs/{job_id}": "/outputs"
# Custom paths with environment variables
extra_mount:
- "$SCRATCH/cache:/cache"
- "${DATA_DIR}/models:/models:ro"Custom container mount mappings with FormattablePath support.
container_mounts:
"$HOME/datasets": "/datasets"
"$HOME/outputs/{job_id}": "/outputs"
"/shared/cache": "/cache"| Key (Host Path) | Value (Container Path) | Description |
|---|---|---|
| FormattablePath | FormattablePath | Host path -> Container mount path |
Both keys and values support FormattablePath templating with placeholders and environment variables.
The following mounts are always added automatically:
| Host Path | Container Path | Description |
|---|---|---|
| Model path | /model |
Resolved model directory |
| Log directory | /logs |
Log output directory |
configs/ directory |
/configs |
NATS, etcd binaries |
| Benchmark scripts | /srtctl-benchmarks |
Bundled benchmark scripts |
You can also define cluster-wide mounts in srtslurm.yaml using the default_mounts field. These are applied to all jobs on the cluster, after the built-in defaults but before job-level mounts.
# In srtslurm.yaml
default_mounts:
"/cluster/special/libs": "/opt/libs"
"$SCRATCH": "/scratch"Environment variables (e.g., $SCRATCH, $HOME) are expanded. This is useful for mounting cluster-specific paths that are required by certain images without adding them to every job config.
Mounts have the following priority (highest to lowest):
- Job-level
container_mounts- FormattablePath dict (highest priority) - Job-level
extra_mount- simplehost:containerstrings - Cluster-level -
default_mountsfromsrtslurm.yaml - Built-in defaults - model, logs, configs, benchmark scripts (lowest priority)
Job-level mounts always take precedence over cluster-level and built-in defaults.
Global environment variables for all worker processes.
environment:
MY_VAR: "value"
CUDA_LAUNCH_BLOCKING: "1"
NCCL_DEBUG: "INFO"| Key | Value | Description |
|---|---|---|
| string | string | Environment variable name=value |
Environment variable values support per-worker templating with these placeholders:
| Placeholder | Description | Example |
|---|---|---|
{node} |
Hostname of the node where the worker runs | "gpu-01" |
{node_id} |
Numeric index of the node in worker list (0-based) | 0, 1, 2 |
Note: For per-worker-mode environment variables, use backend.prefill_environment, backend.decode_environment, or backend.aggregated_environment.
Additional container mounts as a list of mount specifications.
extra_mount:
- "/local/path:/container/path"
- "/data:/data:ro"
- "$HOME/cache:/cache"| Format | Description |
|---|---|
host_path:container_path |
Read-write mount |
host_path:container_path:ro |
Read-only mount |
Note: Unlike container_mounts, extra_mount uses simple string format, not FormattablePath. Environment variables are still expanded.
Additional SLURM sbatch directives.
sbatch_directives:
mail-user: "user@example.com"
mail-type: "END,FAIL"
comment: "Benchmark run for paper"
reservation: "my-reservation"
constraint: "volta"
exclusive: "" # Flag without value
gres: "gpu:8"| Directive | Example Value | Description |
|---|---|---|
mail-user |
"user@example.com" | Email for notifications |
mail-type |
"END,FAIL" | When to send email (BEGIN,END,FAIL) |
comment |
"My job description" | Job comment for tracking |
reservation |
"my-reservation" | Use a specific reservation |
constraint |
"volta" | Node feature constraint |
exclusive |
"" | Exclusive node access (flag) |
gres |
"gpu:8" | Generic resource specification |
dependency |
"afterok:12345" | Job dependency |
qos |
"high" | Quality of service |
Format: Each directive becomes #SBATCH --{key}={value} or #SBATCH --{key} if value is empty.
Additional srun options for worker processes.
srun_options:
cpu-bind: "none"
mpi: "pmix"
overlap: "" # Flag without value
ntasks-per-node: "1"| Option | Example Value | Description |
|---|---|---|
cpu-bind |
"none" | CPU binding mode (none, cores, sockets) |
mpi |
"pmix" | MPI implementation |
overlap |
"" | Allow step overlap (flag) |
ntasks-per-node |
"1" | Tasks per node |
gpus-per-task |
"1" | GPUs per task |
mem |
"0" | Memory per node |
Format: Each option becomes --{key}={value} or --{key} if value is empty.
Run a custom script before dynamo install and worker startup.
setup_script: "install-custom-deps.sh"| Field | Type | Default | Description |
|---|---|---|---|
setup_script |
string | null | Script filename (must be in configs/) |
Notes:
- Script must be located in the
configs/directory. - Script runs inside the container before dynamo installation.
- Useful for installing custom SGLang versions, additional dependencies, or patches.
Example setup script (configs/install-sglang-main.sh):
#!/bin/bash
pip install --quiet git+https://github.com/sgl-project/sglang.gitEnable dumping worker configuration to JSON for debugging.
enable_config_dump: true| Field | Type | Default | Description |
|---|---|---|---|
enable_config_dump |
bool | true | Dump config JSON for debugging |
When enabled, worker startup commands include --dump-config-to which writes the resolved configuration to a JSON file.
name: "deepseek-r1-disagg"
model:
path: "deepseek-r1"
container: "0.5.6"
precision: "fp8"
resources:
gpu_type: "gb200"
gpus_per_node: 4
prefill_nodes: 2
prefill_workers: 4
decode_nodes: 4
decode_workers: 8
slurm:
time_limit: "04:00:00"
frontend:
type: dynamo
enable_multiple_frontends: true
args:
router-mode: "kv"
backend:
type: sglang
kv_events_config:
prefill: true
prefill_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
decode_environment:
TORCH_DISTRIBUTED_DEFAULT_TIMEOUT: "1800"
sglang_config:
prefill:
tensor-parallel-size: 4
mem-fraction-static: 0.84
kv-cache-dtype: "fp8_e4m3"
decode:
tensor-parallel-size: 8
mem-fraction-static: 0.83
data-parallel-size: 8
benchmark:
type: "sa-bench"
isl: 1024
osl: 1024
concurrencies: [128, 256, 512]
health_check:
max_attempts: 180
interval_seconds: 10
dynamo:
version: "0.8.0"name: "qwen-agg-router"
model:
path: "qwen3-32b"
container: "latest"
precision: "bf16"
resources:
gpu_type: "h100"
gpus_per_node: 8
agg_nodes: 4
agg_workers: 8
slurm:
time_limit: "02:00:00"
frontend:
type: sglang
enable_multiple_frontends: false
args:
policy: "cache_aware"
backend:
type: sglang
sglang_config:
aggregated:
tensor-parallel-size: 4
mem-fraction-static: 0.9
enable-dp-attention: true
benchmark:
type: "router"
isl: 14000
osl: 200
num_requests: 200
prefix_ratios: [0.1, 0.3, 0.5, 0.7, 0.9]name: "profile-decode"
model:
path: "llama-70b"
container: "latest"
precision: "fp8"
resources:
gpu_type: "h100"
gpus_per_node: 8
prefill_nodes: 1
prefill_workers: 1
decode_nodes: 1
decode_workers: 1
slurm:
time_limit: "01:00:00"
profiling:
type: "torch"
prefill:
start_step: 5
stop_step: 15
decode:
start_step: 5
stop_step: 15
backend:
type: sglang
sglang_config:
prefill:
tensor-parallel-size: 8
decode:
tensor-parallel-size: 8
benchmark:
type: "sa-bench"
isl: 2048
osl: 256
concurrencies: "32x64"
req_rate: "inf"name: "sweep-throughput"
model:
path: "deepseek-r1"
container: "latest"
precision: "fp8"
resources:
gpu_type: "gb200"
gpus_per_node: 4
prefill_nodes: 1
prefill_workers: 2
decode_nodes: 2
decode_workers: 4
benchmark:
type: "sa-bench"
isl: "{isl}"
osl: "{osl}"
concurrencies: [64, 128, 256]
sweep:
mode: "grid"
parameters:
isl: [512, 1024, 2048, 4096]
osl: [128, 256, 512, 1024]base:
name: "disagg-fp8-benchmark"
model:
path: "deepseek-r1"
container: "latest"
precision: "fp8"
resources:
gpu_type: "h100"
gpus_per_node: 8
prefill_nodes: 2
prefill_workers: 2
decode_nodes: 8
decode_workers: 8
backend:
sglang_config:
prefill:
tp-size: 8
decode:
tp-size: 8
benchmark:
type: "sa-bench"
isl: 1024
osl: 8192
concurrencies: [8192, 10240]
# Use TP=64 for both prefill and decode
override_tp64:
backend:
sglang_config:
prefill:
tp-size: 64
decode:
tp-size: 64
# Smaller cluster with fewer decode nodes
override_small:
resources:
decode_nodes: 4
decode_workers: 4
benchmark:
concurrencies: [4096]name: "custom-setup"
model:
path: "$MODELS_DIR/my-model"
container: "$CONTAINERS_DIR/custom.sqsh"
precision: "fp8"
resources:
gpu_type: "h100"
gpus_per_node: 8
agg_nodes: 2
agg_workers: 4
setup_script: "install-custom-sglang.sh"
environment:
CUSTOM_VAR: "value"
NCCL_DEBUG: "INFO"
container_mounts:
"$HOME/datasets": "/datasets"
"$SCRATCH/cache": "/cache"
extra_mount:
- "/shared/data:/data:ro"
sbatch_directives:
mail-user: "user@example.com"
mail-type: "END,FAIL"
reservation: "gpu-cluster"
srun_options:
cpu-bind: "none"
output:
log_dir: "$HOME/experiments/{job_id}/logs"
health_check:
max_attempts: 120
interval_seconds: 15