Development guide for working on this codebase.
# Run lint + tests (recommended)
make check
# Just lint
make lint
# Just tests
make test
# Run single test file
uv run pytest tests/test_e2e.py -v
# Run single test
uv run pytest tests/test_e2e.py::TestH100Cluster::test_endpoint_allocation -v
# Auto-fix lint issues
uv run ruff check --fix src/srtctl/
uv run ruff format src/srtctl/- Python 3.10+ - use modern syntax (
|unions,matchstatements) - Ruff for linting and formatting (config in
pyproject.toml) - Type hints everywhere - use
tyfor type checking - Frozen dataclasses for configs (
@dataclass(frozen=True)) - Line length: 120 characters
Follow these patterns when extending the codebase:
- Frozen dataclasses for config - Use
@dataclass(frozen=True)for all configuration objects. Immutability prevents accidental mutation and makes code easier to reason about. - Protocol over ABC - Prefer
typing.Protocolfor interface definitions (seeBackendProtocol). Enables duck typing without inheritance coupling. - marshmallow_dataclass for validation - Combine dataclasses with marshmallow schemas for type-safe config loading with validation. Custom fields (e.g.,
BackendConfigField) handle polymorphic deserialization. - Factory classmethods - Use
@classmethodnamedfrom_*for construction (e.g.,RuntimeContext.from_config(),RunMetadata.from_json()). Keep__init__simple. - TYPE_CHECKING guard - Import type-only dependencies under
if TYPE_CHECKING:to avoid circular imports. Use string annotations for forward refs. - Computed properties - Use
@propertyfor derived values instead of storing computed state. SeeResourceConfig.gpus_per_prefill,RunMetadata.topology_label. - Registry pattern - Use decorators for extensible registration (
@register_benchmark("sa-bench")). New implementations just decorate and import. - TypedDict for external data - Use
TypedDictfor typing dicts from JSON/external sources where you can't control the structure. - Single source of truth - Create context objects (like
RuntimeContext) that compute all derived paths/values once at startup rather than recomputing. - testing - when we make a new significant feature change, we should always add a new test
Single source of truth for computed paths. Created once at job start:
runtime = RuntimeContext.from_config(config, job_id)
runtime.log_dir # /path/to/logs/12345_1P_4D_...
runtime.head_node_ip # 10.0.0.1
runtime.container_mounts # List of mount stringsMaps logical workers to physical nodes/GPUs:
endpoints = allocate_endpoints(
num_prefill=2, num_decode=4, num_agg=0,
gpus_per_prefill=8, gpus_per_decode=4, gpus_per_agg=0,
gpus_per_node=8,
available_nodes=("node0", "node1", "node2"),
)
# Returns List[Endpoint] with node assignments and GPU indicesTwo patterns for checking worker readiness:
# Dynamo backend
check_dynamo_health(response_json, expected_prefill=2, expected_decode=4)
# SGLang router
check_sglang_router_health(response_json, expected_prefill=2, expected_decode=4)For aggregated mode, pass expected_prefill=0, expected_decode=num_agg.
Optional fire-and-forget HTTP status reporting to external APIs. Configure in srtslurm.yaml:
# Cluster-level config (srtslurm.yaml)
cluster: "bruh" # Cluster name for dashboard display
reporting:
status:
endpoint: "test-endpoint.com"StatusReporter - Used in do_sweep.py to report job lifecycle:
from srtctl.core.status import StatusReporter, JobStatus, JobStage
reporter = StatusReporter.from_config(config.reporting, job_id)
reporter.report_started(runtime) # Job started with metadata
reporter.report(JobStatus.WORKERS_READY, JobStage.WORKERS, "All workers healthy")
reporter.report_completed(exit_code) # Final statusStatus lifecycle:
submitted → starting → head_ready → workers_starting → workers_ready
→ frontend_starting → frontend_ready → benchmark → completed | failed
create_job_record() - Standalone function for job submission:
from srtctl.core.status import create_job_record
# Called in submit.py after sbatch succeeds
create_job_record(
reporting=config.reporting,
job_id=job_id,
job_name=config.name,
cluster=get_srtslurm_setting("cluster"),
recipe=str(config_path),
metadata=metadata, # Tags go in metadata["tags"]
)Key behaviors:
- All HTTP requests have 5-second timeout
- Failures are logged at DEBUG and silently ignored
- Job execution is never blocked by status reporting
- Tags are passed via
metadata["tags"](not a separate field)
Controls infrastructure placement (etcd/nats):
infra:
etcd_nats_dedicated_node: true # Reserve first node for infra servicesSupports explicit GPUs per worker (overrides computed values):
resources:
gpu_type: "gb200"
prefill_nodes: 2
prefill_workers: 4
decode_nodes: 4
decode_workers: 8
gpus_per_prefill: 4 # Optional: explicit override
gpus_per_decode: 2 # Optional: explicit overrideTests are located in tests/. Run make check to run lint + all tests.
class H100Rack:
NUM_NODES = 13
GPUS_PER_NODE = 8
@classmethod
def slurm_env(cls):
return {
"SLURM_JOB_ID": "12345",
"SLURM_NODELIST": "h100-[01-13]",
...
}
with patch.dict(os.environ, H100Rack.slurm_env()):
with patch("subprocess.run", H100Rack.mock_scontrol()):
# Test code here- Create
backends/mybackend.pywith a dataclass implementingBackendProtocol - Implement required methods:
get_srun_config()- MPI settings and launch strategyget_config_for_mode(mode)- Mode-specific configurationget_environment_for_mode(mode)- Environment variablesallocate_endpoints()- Logical worker allocationendpoints_to_processes()- Physical process mappingbuild_worker_command(process, runtime)- Command construction
- Export from
backends/__init__.py - Add polymorphic deserialization in
BackendConfigFieldinschema.py
Current backends:
- SGLang: Per-process srun launching, supports prefill/decode/aggregated modes
- TRTLLM: MPI-style launching (one srun per endpoint with all nodes), prefill/decode only
- Create
benchmarks/mybench.pyinheriting fromBenchmarkRunner - Implement
run(config, log_dir)method - Add bash script to
benchmarks/scripts/mybench/bench.sh - Register in benchmark type mapping
When adding new config fields that affect what gets passed to srun (environment variables, container mounts, srun options), you must also update:
show_config_details()insrc/srtctl/cli/submit.py-- this renders all mounts/env/options insrtctl dry-runoutput so users can verify config before submittingtests/test_dry_run.py-- add test cases verifying the new config appears in dry-run output
Config sources that feed into dry-run display:
- Mounts:
config.extra_mount,config.container_mounts,default_mountsfrom srtslurm.yaml - Env vars:
config.environment(global),backend.prefill_environment,backend.decode_environment,backend.aggregated_environment - srun options:
config.srun_options
srtctl dry-run shows the sbatch script, all container mounts (with source labels), environment variables (global and per-mode), and srun options:
srtctl dry-run -f config.yamlThe full srun command (with all mounts, env vars, and flags) is logged at INFO level in the sweep log:
tail -f outputs/<job_id>/logs/sweep_<job_id>.log | grep "srun command"Per-worker env vars and commands are also logged individually (search for Env: and Command: lines).