Skip to content

[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
May 24, 2025

Conversation

SwordFaith
Copy link
Collaborator

@SwordFaith SwordFaith commented May 20, 2025

Checklist Before Starting

  • Search for similar PR(s).

What does this PR do?

Add one-line overview of what this PR aims to achieve or accomplish.

  • Fix sglang megatron support
  • Add sglang_async megatron support
  • Add CI task to protect megatron-sglang impl

Test

For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluatuion results, etc.

https://wandb.ai/swordfaith/gsm8k_async_rl/runs/6h7apmbn?nw=nwuserswordfaith

Additional Info.

  • Issue Number: Fixes issue # or discussion # if any.
  • Training: [Note which backend this PR will affect: FSDP, Megatron, both, or none]
  • Inference: SGLang

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title if it breaks any API.
  • Update the documentation about your changes in the docs.
  • Add CI test(s) if necessary.

@ETOgaosion
Copy link
Collaborator

Great work! Thanks a lot for your efforts to help support sglang and megatron to catch up with vLLM.

@@ -265,3 +265,229 @@ jobs:
- name: clean up
run: |
rm -rf checkpoints

e2e_ppo_trainer_megatron-qwen-sgl:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now this CI file has become so complicate that is hard to maintain both vLLM and SGLang tests. My suggestion is that we can add both vLLM and SGLang tests in a single shell script test file https://github.com/volcengine/verl/blob/main/tests/e2e/run_ppo_trainer_megatron.sh, and we simply need to call this file once.

@@ -9,6 +9,8 @@ MODEL_ID=${MODEL_ID:-Qwen/Qwen2.5-0.5B}
MODEL_PATH=${MODEL_PATH:-${HOME}/models/${MODEL_ID}}
huggingface-cli download "${MODEL_ID}" --local-dir "${MODEL_PATH}"

ENGINE=${ENGINE:-vllm}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean here we can also include SGLang python scripts below the vLLM, so that we can test two systems with the same configuration.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I will modify it configuration after CI test finished

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll modify this.

return data

@GPUMemoryLogger(role="megatron sglang sharding_manager", logger=logger)
def postprocess_data(self, data: DataProto) -> DataProto:
Copy link
Collaborator

@ETOgaosion ETOgaosion May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read somewhere that the SGLang's TP rank output might be different from vLLM's, as now the dispatch method is the same with FSDP, can we just borrow the fsdp_sglang.py's implementation to avoid misalignment?
In here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that we currently lack a unified weight conversion method for FSDP as Megatron Core, so we cannot merge these two classes for now. Are there any better approaches to implement this?

Copy link
Collaborator

@ETOgaosion ETOgaosion May 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I mean not merging these classes, we can borrow the implementation there, just like this vLLM's implementation

@ETOgaosion
Copy link
Collaborator

@SwordFaith Thanks for contribution, could you rebase main and add some megatron_sglang support of training side expert parallel as #1467 ? And you can use the latest image to add some tests~

@SwordFaith SwordFaith force-pushed the fix/sgl_megatron_support branch from e557838 to 20897b0 Compare May 21, 2025 11:58
@SwordFaith
Copy link
Collaborator Author

SwordFaith commented May 21, 2025

@SwordFaith Thanks for contribution, could you rebase main and add some megatron_sglang support of training side expert parallel as #1467 ? And you can use the latest image to add some tests~

For sglang, it seems that support has been added in utils/megatron_utils as part of #1467 . It would be better to share the megatron_utils per_tensor_generator implementation between megatron vllm and sglang. However, I believe such a refactor would require end-to-end training verification, which could take days to modify and debug. For now, we can address sglang support in the current PR and plan the refactor for the future.

@ETOgaosion ETOgaosion force-pushed the fix/sgl_megatron_support branch from 4f0a646 to 3d368eb Compare May 24, 2025 05:22
@ETOgaosion ETOgaosion force-pushed the fix/sgl_megatron_support branch from c62e5e1 to 15a94c3 Compare May 24, 2025 06:19
@vermouth1992
Copy link
Collaborator

Shall we merge this? @ETOgaosion

@ETOgaosion
Copy link
Collaborator

Yes, finally, and we can test larger models based on SGLang backend.

@ETOgaosion ETOgaosion merged commit cf731e8 into volcengine:main May 24, 2025
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants