Skip to content

Commit ef072ac

Browse files
ISEEKYANwuxibin89
andauthored
[megatron, model] feat: qwen3.5 example (#5381)
### What does this PR do? thanks to @LiuXTao 's great work on ISEEKYAN/mbridge#83, the mbridge has supported qwen3.5. This PR succeeded in running qwen3.5 SFT on verl based on mbridge supports for qwen3.5 ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test see `examples/sft/gsm8k/run_qwen3_5_megatron.sh` and `examples/grpo_trainer/run_qwen3_5-35b-megatron.sh` ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [x] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [x] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: wuxibin <wuxibin@bytedance.com>
1 parent 016c1d5 commit ef072ac

File tree

17 files changed

+746
-74
lines changed

17 files changed

+746
-74
lines changed

.github/workflows/cpu_unit_tests.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,7 @@ jobs:
9595
run: |
9696
pip3 install -r requirements-test.txt
9797
pip3 install --no-deps -e .
98-
pip3 install --upgrade "transformers<5.0.0"
98+
pip3 install --upgrade "transformers>=5.0.0"
9999
- name: Download datasets
100100
run: |
101101
python3 examples/data_preprocess/gsm8k.py --local_dataset_path ${HOME}/models/hf_data/gsm8k
Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
#!/usr/bin/env bash
2+
# Qwen3.5-35B-A3B MoE GRPO RL with Megatron (single node, 8 GPUs, geo3k dataset)
3+
#
4+
# notes on vllm:
5+
# by 20260225, the latest vllm nightly does not support qwen3.5 rollout, to use this script, you need to
6+
# 1. wait until vllm supports qwen3.5 officially, and build a verl docker with that version of vllm
7+
# 2. self build a verl docker image with vllm from source code with qwen3.5 support (main branch 20260225 is OK)
8+
# I succeeded in running this script with the main branch of vllm on 20260225, yet there are still some minor issues
9+
# the vllm qwen3.5 during initialization, need to be fixed. Also, the cuda_graph is somehow not working, need to be
10+
# fixed, either by verl team with supoorts to vllm0.16, or by vllm team.
11+
# Requirements:
12+
# - 8 GPUs (80GB each, e.g. 1x8 H100/H200)
13+
# - Additional packages on top of the base image:
14+
# pip install --upgrade transformers
15+
# pip install flash-linear-attention
16+
# pip install -U git+https://github.com/ISEEKYAN/mbridge.git
17+
# - Megatron-LM==0.16.0
18+
#
19+
# Qwen3.5 architecture notes:
20+
# Qwen3.5 uses Gated Delta Net (GDN) linear attention which currently does
21+
# NOT support packed sequences (THD format) in Megatron-LM. Therefore:
22+
# - model.use_remove_padding=False (deprecated option, will be removed in the future forces bshd compute format)
23+
# - actor.megatron.use_remove_padding=False (forces bshd compute format)
24+
# - actor.use_dynamic_bsz=False (required for bshd mode)
25+
#
26+
# Once Megatron-LM adds THD support for Qwen3.5 GDN, use_remove_padding
27+
# can be set to True for better performance.
28+
#
29+
# Tested parallelism config (8 GPUs / 1 node):
30+
# TP=2 PP=1 CP=1 EP=8 ETP=1 GEN_TP=8
31+
#
32+
33+
export CUDA_DEVICE_MAX_CONNECTIONS=1
34+
export VLLM_USE_V1=1
35+
export VLLM_ALLREDUCE_USE_SYMM_MEM=0
36+
37+
set -xeuo pipefail
38+
39+
########################### Quick Config ###########################
40+
41+
TP=${TP:-2}
42+
PP=${PP:-1}
43+
CP=${CP:-1}
44+
EP=${EP:-8}
45+
ETP=${ETP:-1}
46+
GEN_TP=${GEN_TP:-8}
47+
48+
ALL_OFFLOAD=${ALL_OFFLOAD:-True}
49+
50+
rollout_name="vllm"
51+
project_name='verl_grpo_qwen3_5_35b_geo3k'
52+
exp_name='qwen3_5_35b_megatron'
53+
adv_estimator=grpo
54+
55+
HF_MODEL_PATH=${HF_MODEL_PATH:-"Qwen3.5-35B-A3B"}
56+
train_path=${train_path:-$HOME/data/geo3k/train.parquet}
57+
test_path=${test_path:-$HOME/data/geo3k/test.parquet}
58+
59+
########################### Parameter Arrays ###########################
60+
61+
DATA=(
62+
data.train_files=${train_path}
63+
data.val_files=${test_path}
64+
data.train_batch_size=32
65+
data.max_prompt_length=1024
66+
data.max_response_length=2048
67+
data.truncation='error'
68+
data.filter_overlong_prompts=True
69+
)
70+
71+
MODEL=(
72+
actor_rollout_ref.model.path=${HF_MODEL_PATH}
73+
actor_rollout_ref.model.trust_remote_code=True
74+
actor_rollout_ref.model.use_remove_padding=False
75+
)
76+
77+
ACTOR=(
78+
actor_rollout_ref.actor.optim.lr=1e-6
79+
actor_rollout_ref.actor.ppo_mini_batch_size=32
80+
actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1
81+
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=4096
82+
actor_rollout_ref.actor.use_dynamic_bsz=False
83+
actor_rollout_ref.actor.use_kl_loss=True
84+
actor_rollout_ref.actor.kl_loss_coef=0.01
85+
actor_rollout_ref.actor.kl_loss_type=low_var_kl
86+
actor_rollout_ref.actor.entropy_coeff=0
87+
actor_rollout_ref.actor.megatron.use_mbridge=True
88+
actor_rollout_ref.actor.megatron.vanilla_mbridge=True
89+
actor_rollout_ref.actor.megatron.use_remove_padding=False
90+
actor_rollout_ref.actor.megatron.tensor_model_parallel_size=${TP}
91+
actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=${PP}
92+
actor_rollout_ref.actor.megatron.context_parallel_size=${CP}
93+
actor_rollout_ref.actor.megatron.expert_model_parallel_size=${EP}
94+
actor_rollout_ref.actor.megatron.expert_tensor_parallel_size=${ETP}
95+
actor_rollout_ref.actor.megatron.param_offload=${ALL_OFFLOAD}
96+
actor_rollout_ref.actor.megatron.optimizer_offload=${ALL_OFFLOAD}
97+
actor_rollout_ref.actor.megatron.grad_offload=${ALL_OFFLOAD}
98+
actor_rollout_ref.actor.megatron.dtype=bfloat16
99+
++actor_rollout_ref.actor.megatron.override_transformer_config.attention_backend=auto
100+
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_method=uniform
101+
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_granularity=full
102+
+actor_rollout_ref.actor.megatron.override_transformer_config.recompute_num_layers=1
103+
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_aux_loss_coeff=0.01
104+
+actor_rollout_ref.actor.megatron.override_transformer_config.moe_z_loss_coeff=0.001
105+
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_offload_fraction=1
106+
+actor_rollout_ref.actor.optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True
107+
+actor_rollout_ref.actor.optim.override_optimizer_config.use_precision_aware_optimizer=True
108+
+actor_rollout_ref.actor.optim.override_optimizer_config.optimizer_cpu_offload=True
109+
)
110+
111+
ROLLOUT=(
112+
actor_rollout_ref.rollout.name=${rollout_name}
113+
actor_rollout_ref.rollout.tensor_model_parallel_size=${GEN_TP}
114+
actor_rollout_ref.rollout.gpu_memory_utilization=0.6
115+
actor_rollout_ref.rollout.n=5
116+
actor_rollout_ref.rollout.mode=async
117+
actor_rollout_ref.rollout.dtype=bfloat16
118+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1
119+
actor_rollout_ref.rollout.log_prob_use_dynamic_bsz=False
120+
actor_rollout_ref.rollout.log_prob_max_token_len_per_gpu=4096
121+
)
122+
123+
REF=(
124+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1
125+
actor_rollout_ref.ref.log_prob_use_dynamic_bsz=False
126+
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=4096
127+
actor_rollout_ref.ref.megatron.tensor_model_parallel_size=${TP}
128+
actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=${PP}
129+
actor_rollout_ref.ref.megatron.context_parallel_size=${CP}
130+
actor_rollout_ref.ref.megatron.expert_model_parallel_size=${EP}
131+
actor_rollout_ref.ref.megatron.expert_tensor_parallel_size=${ETP}
132+
actor_rollout_ref.ref.megatron.param_offload=${ALL_OFFLOAD}
133+
)
134+
135+
ALGORITHM=(
136+
algorithm.adv_estimator=${adv_estimator}
137+
algorithm.use_kl_in_reward=False
138+
)
139+
140+
TRAINER=(
141+
trainer.critic_warmup=0
142+
trainer.logger='["console","wandb"]'
143+
trainer.project_name=${project_name}
144+
trainer.experiment_name=${exp_name}
145+
trainer.n_gpus_per_node=8
146+
trainer.nnodes=1
147+
trainer.save_freq=20
148+
trainer.val_before_train=False
149+
trainer.test_freq=5
150+
trainer.total_epochs=15
151+
)
152+
153+
########################### Launch ###########################
154+
155+
python3 -m verl.trainer.main_ppo \
156+
--config-path=config \
157+
--config-name='ppo_megatron_trainer.yaml' \
158+
"${DATA[@]}" \
159+
"${ALGORITHM[@]}" \
160+
"${MODEL[@]}" \
161+
"${ROLLOUT[@]}" \
162+
"${ACTOR[@]}" \
163+
"${REF[@]}" \
164+
"${TRAINER[@]}" \
165+
"$@"
Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
#!/usr/bin/env bash
2+
# Qwen3.5-397B-A17B SFT with Megatron backend + mbridge
3+
#
4+
# Requirements:
5+
# - 128+ GPUs (80GB each, e.g. 16x8 H100/H200)
6+
# - Docker: verlai/verl:vllm015 (or equivalent)
7+
# - Additional packages on top of the base image:
8+
# pip install --upgrade transformers
9+
# pip install flash-linear-attention
10+
# pip install -U git+https://github.com/ISEEKYAN/mbridge.git
11+
# - Megatron-LM==0.16.0
12+
#
13+
# Qwen3.5 architecture notes:
14+
# Qwen3.5 uses Gated Delta Net (GDN) linear attention which currently does
15+
# NOT support packed sequences (THD format) in Megatron-LM. Therefore:
16+
# - engine.use_remove_padding=False (forces bshd compute format)
17+
# - data.use_dynamic_bsz=False (required for bshd mode)
18+
#
19+
# Once https://github.com/NVIDIA/Megatron-LM/pull/2644 is merged, THD
20+
# format will be supported and engine.use_remove_padding can be set to True
21+
# for better performance.
22+
#
23+
# Tested parallelism config (128 GPUs / 16 nodes):
24+
# TP=2 PP=4 EP=32 CP=1
25+
26+
set -xeuo pipefail
27+
28+
# ============================================================
29+
# Distributed
30+
# ============================================================
31+
NUM_GPUS=${NUM_GPUS:-8}
32+
MASTER_ADDR=${MASTER_ADDR:-localhost}
33+
MASTER_PORT=${MASTER_PORT:-29500}
34+
NNODES=${NNODES:-16}
35+
NODE_RANK=${NODE_RANK:-0}
36+
37+
# ============================================================
38+
# Data
39+
# ============================================================
40+
DATASET_DIR=${DATASET_DIR:-~/dataset}
41+
TRAIN_FILES=${TRAIN_FILES:-${DATASET_DIR}/train.parquet}
42+
43+
# ============================================================
44+
# Model
45+
# ============================================================
46+
MODEL_PATH=${MODEL_PATH:-Qwen/Qwen3.5-397B-A17B}
47+
48+
# ============================================================
49+
# Parallelism
50+
# ============================================================
51+
TP_SIZE=${TP_SIZE:-2}
52+
PP_SIZE=${PP_SIZE:-4}
53+
VPP_SIZE=${VPP_SIZE:-null}
54+
CP_SIZE=${CP_SIZE:-1}
55+
EP_SIZE=${EP_SIZE:-32}
56+
ETP_SIZE=${ETP_SIZE:-1}
57+
58+
# ============================================================
59+
# Training
60+
# ============================================================
61+
TRAIN_BATCH_SIZE=${TRAIN_BATCH_SIZE:-128}
62+
MICRO_BATCH_SIZE=${MICRO_BATCH_SIZE:-2}
63+
MAX_LENGTH=${MAX_LENGTH:-2048}
64+
LR=${LR:-2e-5}
65+
MIN_LR=${MIN_LR:-2e-6}
66+
DTYPE=${DTYPE:-bfloat16}
67+
68+
BACKEND=megatron
69+
RESUME_MODE=${RESUME_MODE:-disable}
70+
71+
project_name=verl_sft_qwen3_5
72+
exp_name=qwen3_5-${BACKEND}-tp${TP_SIZE}-pp${PP_SIZE}-cp${CP_SIZE}-ep${EP_SIZE}
73+
ckpts_home=${ckpts_home:-~/verl/checkpoints/${project_name}/${exp_name}}
74+
mkdir -p "${ckpts_home}"
75+
76+
# ============================================================
77+
# Engine config
78+
# ============================================================
79+
# Key Qwen3.5 settings:
80+
# engine.use_remove_padding=False - GDN requires bshd format (no THD)
81+
# engine.vanilla_mbridge=True - use mbridge (not megatron-bridge)
82+
ENGINE_CONFIG="\
83+
engine=${BACKEND} \
84+
optim=${BACKEND} \
85+
optim.lr=${LR} \
86+
optim.min_lr=${MIN_LR} \
87+
optim.lr_warmup_steps=10 \
88+
optim.weight_decay=0.1 \
89+
optim.betas='[0.9,0.95]' \
90+
optim.clip_grad=1.0 \
91+
optim.lr_warmup_init=0 \
92+
optim.lr_decay_style=cosine \
93+
+optim.override_optimizer_config.optimizer_offload_fraction=1 \
94+
+optim.override_optimizer_config.overlap_cpu_optimizer_d2h_h2d=True \
95+
+optim.override_optimizer_config.use_precision_aware_optimizer=True \
96+
+optim.override_optimizer_config.optimizer_cpu_offload=True \
97+
engine.tensor_model_parallel_size=${TP_SIZE} \
98+
engine.pipeline_model_parallel_size=${PP_SIZE} \
99+
engine.virtual_pipeline_model_parallel_size=${VPP_SIZE} \
100+
engine.context_parallel_size=${CP_SIZE} \
101+
engine.expert_model_parallel_size=${EP_SIZE} \
102+
engine.expert_tensor_parallel_size=${ETP_SIZE} \
103+
engine.use_mbridge=True \
104+
engine.vanilla_mbridge=True \
105+
engine.dtype=${DTYPE} \
106+
engine.use_remove_padding=False \
107+
engine.override_transformer_config.attention_backend=auto \
108+
+engine.override_transformer_config.recompute_method=uniform \
109+
+engine.override_transformer_config.recompute_granularity=full \
110+
+engine.override_transformer_config.recompute_num_layers=1"
111+
112+
# ============================================================
113+
# Launch
114+
# ============================================================
115+
torchrun \
116+
--nproc_per_node=${NUM_GPUS} \
117+
--nnodes=${NNODES} \
118+
--node_rank=${NODE_RANK} \
119+
--master_addr=${MASTER_ADDR} \
120+
--master_port=${MASTER_PORT} \
121+
-m verl.trainer.sft_trainer \
122+
data.train_files="${TRAIN_FILES}" \
123+
data.train_batch_size=${TRAIN_BATCH_SIZE} \
124+
data.micro_batch_size_per_gpu=${MICRO_BATCH_SIZE} \
125+
data.max_length=${MAX_LENGTH} \
126+
data.pad_mode=no_padding \
127+
data.truncation=error \
128+
data.use_dynamic_bsz=False \
129+
data.max_token_len_per_gpu=${MAX_LENGTH} \
130+
data.messages_key=messages \
131+
model.path=${MODEL_PATH} \
132+
model.use_remove_padding=False \
133+
model.trust_remote_code=True \
134+
${ENGINE_CONFIG} \
135+
trainer.test_freq=-1 \
136+
trainer.save_freq=500 \
137+
trainer.logger="['console']" \
138+
trainer.project_name="${project_name}" \
139+
trainer.experiment_name="${exp_name}" \
140+
trainer.total_epochs=1 \
141+
trainer.default_local_dir="${ckpts_home}" \
142+
trainer.resume_mode=${RESUME_MODE}

tests/checkpoint_engine/test_special_server_adapter.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -106,7 +106,7 @@ async def _run_server_manager_without_resume(
106106
)
107107

108108
# wait a while and update weights to interrupt the generation
109-
await asyncio.sleep(3)
109+
await asyncio.sleep(2)
110110
await checkpoint_manager.update_weights(global_steps=global_steps)
111111

112112
outputs = await asyncio.gather(*tasks)
@@ -149,7 +149,7 @@ async def _run_server_manager_with_resume(
149149
# 2. trainer update weights to rollout multiple times
150150
for global_steps in range(initial_steps, initial_steps + train_steps):
151151
# wait a while and update weights to interrupt the generation
152-
await asyncio.sleep(3)
152+
await asyncio.sleep(2)
153153
await checkpoint_manager.update_weights(global_steps=global_steps)
154154

155155
# 3. wait for rollout generate responses finished

0 commit comments

Comments
 (0)