feat: add --sglang-config YAML for engine group configuration by zhuzilin · Pull Request #1614 · THUDM/slime

zhuzilin · 2026-02-21T16:48:42Z

No description provided.

Add --sglang-config argument that accepts a YAML file to configure engine groups with explicit roles and GPU counts, including support for placeholder groups that reserve GPU slots without creating engines (useful for NUMA alignment or GPU gaps between model parts). Configuration priority: 1. --sglang-config YAML with explicit engine_groups 2. --prefill-num-servers (legacy PD disaggregation) 3. Default single "regular" group New dataclasses: - EngineGroupConfig: per-group role + num_gpus + overrides - SglangConfig: list of groups with factory methods + from_yaml Key changes: - start_rollout_server() uses config-driven loop with cumulative GPU offset for rank calculation, all groups start concurrently - RolloutServer.active_engine_groups filters out placeholders - Health monitors, engine aggregation, offload/onload skip placeholders - _start_router() accepts has_pd_disaggregation kwarg - Per-engine-group sglang ServerArgs overrides via YAML "overrides" dict threaded through EngineGroup -> SGLangEngine -> _compute_server_args - Mutual exclusion: --sglang-config vs --rollout-external and --sglang-config vs --prefill-num-servers Example YAML: engine_groups: - role: prefill num_gpus: 4 overrides: mem_fraction_static: 0.9 - role: decode num_gpus: 10 overrides: mem_fraction_static: 0.7

…ate sglang.patch

…s_per_engine Support serving multiple models via --sglang-config YAML, each with its own router and engine groups. Engine groups within a model may have different num_gpus_per_engine (e.g. PD disaggregation with prefill TP=2 and decode TP=4). Key changes: - rollout.py: New ModelConfig/SglangConfig with models list. EngineGroup now carries per-group num_gpus_per_engine, gpu_offset, router_ip/port. RolloutManager.servers is now a dict[str, RolloutServer]. Each model gets its own router via _start_router(force_new=True). - sglang_engine.py: SGLangEngine accepts per-engine num_gpus_per_engine and per-engine router_ip/port, forwarded to _compute_server_args. - Weight updates (Megatron and FSDP): engine_gpu_counts flows through actor -> weight_updater -> connect_rollout_engines_from_distributed, supporting heterogeneous NCCL world_size and rank offsets. - Backward compatible: legacy engine_groups-only YAML still works as a single default model. Single-model callers use the .server property.

- Megatron UpdateWeightFromTensor: move Gloo group creation from __init__ to connect_rollout_engines where engine_gpu_counts is available; compute colocate_engine_nums via cumulative GPU budget; use cumulative offsets for engine mapping instead of uniform stride. - FSDP UpdateWeightFromTensor: same cumulative offset fix for IPC groups. - Remove stale 'assume gpu id same' comments.

CSWYF3634076 · 2026-03-03T03:09:37Z

Hello, is this PR for PD disaggregation?

zhuzilin added the run-ci-megatron label Feb 21, 2026

zhuzilin added 2 commits February 25, 2026 03:30

refactor: rename role to worker_type to align with sglang naming

2d18f92

zhuzilin force-pushed the zhuzilin/sglang-config-v2 branch from bac171c to 2d18f92 Compare February 25, 2026 03:31

fix: skip sglang --config to avoid conflict with --sglang-config, upd…

51e18dc

…ate sglang.patch

zhuzilin force-pushed the zhuzilin/sglang-config-v2 branch from 16e2c44 to 51e18dc Compare February 25, 2026 05:20

zhuzilin added 2 commits February 25, 2026 06:28

zhuzilin removed the run-ci-megatron label Feb 25, 2026

zhuzilin added 2 commits February 25, 2026 09:08

bugfix

4a1044c

add ci

ead3923

zhuzilin added the run-ci-changed label Feb 25, 2026

zhuzilin added 7 commits February 25, 2026 09:52

fix update weights with pla placeholder

52d02f1

bugfix

a6fd021

add tests

53211ef

add port cursors

597b08c

loosen the check

5aaf4f3

Merge branch 'main' into zhuzilin/sglang-config-v2

2d255a4

fix lint

2bba294

zhuzilin added run-ci-megatron and removed run-ci-megatron labels Feb 25, 2026

zhuzilin merged commit da7080f into main Feb 26, 2026
23 of 29 checks passed

zhuzilin deleted the zhuzilin/sglang-config-v2 branch February 26, 2026 00:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add --sglang-config YAML for engine group configuration #1614

feat: add --sglang-config YAML for engine group configuration #1614
zhuzilin merged 14 commits intomainfrom
zhuzilin/sglang-config-v2

zhuzilin commented Feb 21, 2026

Uh oh!

Uh oh!

CSWYF3634076 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhuzilin commented Feb 21, 2026

Uh oh!

Uh oh!

CSWYF3634076 commented Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants