[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

SwordFaith · 2025-05-20T11:57:21Z

Checklist Before Starting

Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

Fix sglang megatron support
Add sglang_async megatron support
Add CI task to protect megatron-sglang impl

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

https://wandb.ai/swordfaith/gsm8k_async_rl/runs/6h7apmbn?nw=nwuserswordfaith

Additional Info.

Issue Number: Fixes issue # or discussion # if any.
Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
Inference: SGLang

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title if it breaks any API.
Update the documentation about your changes in the docs.
Add CI test(s) if necessary.

ETOgaosion · 2025-05-20T13:17:35Z

Great work! Thanks a lot for your efforts to help support sglang and megatron to catch up with vLLM.

ETOgaosion · 2025-05-20T13:20:29Z

.github/workflows/e2e_ppo_trainer_megatron.yml

@@ -265,3 +265,229 @@ jobs:
      - name: clean up
        run: |
          rm -rf checkpoints
+
+  e2e_ppo_trainer_megatron-qwen-sgl:


Now this CI file has become so complicate that is hard to maintain both vLLM and SGLang tests. My suggestion is that we can add both vLLM and SGLang tests in a single shell script test file https://github.com/volcengine/verl/blob/main/tests/e2e/run_ppo_trainer_megatron.sh, and we simply need to call this file once.

ETOgaosion · 2025-05-20T13:21:55Z

tests/e2e/run_ppo_trainer_megatron.sh

@@ -9,6 +9,8 @@ MODEL_ID=${MODEL_ID:-Qwen/Qwen2.5-0.5B}
 MODEL_PATH=${MODEL_PATH:-${HOME}/models/${MODEL_ID}}
 huggingface-cli download "${MODEL_ID}" --local-dir "${MODEL_PATH}"

+ENGINE=${ENGINE:-vllm}


I mean here we can also include SGLang python scripts below the vLLM, so that we can test two systems with the same configuration.

OK, I will modify it configuration after CI test finished

Ok, I'll modify this.

ETOgaosion · 2025-05-20T13:24:14Z

verl/workers/sharding_manager/megatron_sglang.py

+        return data
+
+    @GPUMemoryLogger(role="megatron sglang sharding_manager", logger=logger)
+    def postprocess_data(self, data: DataProto) -> DataProto:


I read somewhere that the SGLang's TP rank output might be different from vLLM's, as now the dispatch method is the same with FSDP, can we just borrow the fsdp_sglang.py's implementation to avoid misalignment?
In here

It seems that we currently lack a unified weight conversion method for FSDP as Megatron Core, so we cannot merge these two classes for now. Are there any better approaches to implement this?

Oh, I mean not merging these classes, we can borrow the implementation there, just like this vLLM's implementation

ETOgaosion · 2025-05-21T03:25:10Z

@SwordFaith Thanks for contribution, could you rebase main and add some megatron_sglang support of training side expert parallel as #1467 ? And you can use the latest image to add some tests~

SwordFaith · 2025-05-21T12:05:05Z

@SwordFaith Thanks for contribution, could you rebase main and add some megatron_sglang support of training side expert parallel as #1467 ? And you can use the latest image to add some tests~

For sglang, it seems that support has been added in utils/megatron_utils as part of #1467 . It would be better to share the megatron_utils per_tensor_generator implementation between megatron vllm and sglang. However, I believe such a refactor would require end-to-end training verification, which could take days to modify and debug. For now, we can address sglang support in the current PR and plan the refactor for the future.

vermouth1992 · 2025-05-24T10:35:01Z

Shall we merge this? @ETOgaosion

ETOgaosion · 2025-05-24T10:37:19Z

Yes, finally, and we can test larger models based on SGLang backend.

SwordFaith requested review from ETOgaosion and BearBiscuit05 May 20, 2025 12:24

SwordFaith mentioned this pull request May 20, 2025

Multi-turn rollout Status & Roadmap zhaochenyang20/Awesome-ML-SYS-Tutorial#131

Open

21 tasks

ETOgaosion reviewed May 20, 2025

View reviewed changes

SwordFaith force-pushed the fix/sgl_megatron_support branch from e557838 to 20897b0 Compare May 21, 2025 11:58

ETOgaosion force-pushed the fix/sgl_megatron_support branch from 4f0a646 to 3d368eb Compare May 24, 2025 05:22

SwordFaith and others added 13 commits May 24, 2025 06:19

Fix megatron support in sglang and add ci tasks

de507e1

Change ci config and script for megatron test

109bab8

Add ray stop --force at different rollout test boundary

b14dcf2

Increase timeout minutes due to doubled rollout engine in testing

071fa45

Remove assertion and add megatron demo example

9b09702

Fix sglang part of volcengine#1555 override tf config in megatron

d034ab8

FIx image version

63a5d4d

Fix megatron_utils and clean megatron_sglang

d5ba80b

Fix pre-commit check missing

c769e53

Fix test script mismatch

1f624df

clean PR

9fb92c1

fix engine for loop

6e0de7b

seems not able to use arguments

15a94c3

ETOgaosion force-pushed the fix/sgl_megatron_support branch from c62e5e1 to 15a94c3 Compare May 24, 2025 06:19

fix vllm per_tensor_generator

eb5342e

ETOgaosion approved these changes May 24, 2025

View reviewed changes

ETOgaosion merged commit cf731e8 into volcengine:main May 24, 2025
35 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

SwordFaith commented May 20, 2025 •

edited

Loading

Uh oh!

ETOgaosion commented May 20, 2025

Uh oh!

ETOgaosion May 20, 2025

Uh oh!

ETOgaosion May 20, 2025

Uh oh!

SwordFaith May 20, 2025

Uh oh!

SwordFaith May 20, 2025

Uh oh!

ETOgaosion May 20, 2025 •

edited

Loading

Uh oh!

SwordFaith May 20, 2025

Uh oh!

ETOgaosion May 20, 2025 •

edited

Loading

Uh oh!

ETOgaosion commented May 21, 2025

Uh oh!

SwordFaith commented May 21, 2025 •

edited

Loading

Uh oh!

vermouth1992 commented May 24, 2025

Uh oh!

ETOgaosion commented May 24, 2025

Uh oh!

Uh oh!

Uh oh!

[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

Conversation

SwordFaith commented May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist Before Starting

What does this PR do?

Test

Additional Info.

Checklist Before Submitting

Uh oh!

ETOgaosion commented May 20, 2025

Uh oh!

ETOgaosion May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion May 20, 2025

Choose a reason for hiding this comment

Uh oh!

SwordFaith May 20, 2025

Choose a reason for hiding this comment

Uh oh!

SwordFaith May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SwordFaith May 20, 2025

Choose a reason for hiding this comment

Uh oh!

ETOgaosion May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ETOgaosion commented May 21, 2025

Uh oh!

SwordFaith commented May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vermouth1992 commented May 24, 2025

Uh oh!

ETOgaosion commented May 24, 2025

Uh oh!

Uh oh!

Uh oh!

SwordFaith commented May 20, 2025 •

edited

Loading

ETOgaosion May 20, 2025 •

edited

Loading

ETOgaosion May 20, 2025 •

edited

Loading

SwordFaith commented May 21, 2025 •

edited

Loading