-
Notifications
You must be signed in to change notification settings - Fork 1k
[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sglang] Fix megatron support in sglang and add sglang_async support & CI tasks #1602
Conversation
Great work! Thanks a lot for your efforts to help support sglang and megatron to catch up with vLLM. |
@@ -265,3 +265,229 @@ jobs: | |||
- name: clean up | |||
run: | | |||
rm -rf checkpoints | |||
|
|||
e2e_ppo_trainer_megatron-qwen-sgl: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this CI file has become so complicate that is hard to maintain both vLLM and SGLang tests. My suggestion is that we can add both vLLM and SGLang tests in a single shell script test file https://github.com/volcengine/verl/blob/main/tests/e2e/run_ppo_trainer_megatron.sh, and we simply need to call this file once.
@@ -9,6 +9,8 @@ MODEL_ID=${MODEL_ID:-Qwen/Qwen2.5-0.5B} | |||
MODEL_PATH=${MODEL_PATH:-${HOME}/models/${MODEL_ID}} | |||
huggingface-cli download "${MODEL_ID}" --local-dir "${MODEL_PATH}" | |||
|
|||
ENGINE=${ENGINE:-vllm} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean here we can also include SGLang python scripts below the vLLM, so that we can test two systems with the same configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I will modify it configuration after CI test finished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll modify this.
return data | ||
|
||
@GPUMemoryLogger(role="megatron sglang sharding_manager", logger=logger) | ||
def postprocess_data(self, data: DataProto) -> DataProto: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read somewhere that the SGLang's TP rank output might be different from vLLM's, as now the dispatch method is the same with FSDP, can we just borrow the fsdp_sglang.py's implementation to avoid misalignment?
In here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that we currently lack a unified weight conversion method for FSDP as Megatron Core, so we cannot merge these two classes for now. Are there any better approaches to implement this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I mean not merging these classes, we can borrow the implementation there, just like this vLLM's implementation
@SwordFaith Thanks for contribution, could you rebase main and add some megatron_sglang support of training side expert parallel as #1467 ? And you can use the latest image to add some tests~ |
e557838
to
20897b0
Compare
For sglang, it seems that support has been added in utils/megatron_utils as part of #1467 . It would be better to share the megatron_utils |
4f0a646
to
3d368eb
Compare
c62e5e1
to
15a94c3
Compare
Shall we merge this? @ETOgaosion |
Yes, finally, and we can test larger models based on SGLang backend. |
Checklist Before Starting
What does this PR do?
Test
https://wandb.ai/swordfaith/gsm8k_async_rl/runs/6h7apmbn?nw=nwuserswordfaith
Additional Info.
Checklist Before Submitting
[BREAKING]
to the PR title if it breaks any API.