Skip to content

Commit 53eff12

Browse files
authored
Merge branch 'main' into feat/perf_metric
2 parents 379f194 + 48328e8 commit 53eff12

File tree

13 files changed

+288
-105
lines changed

13 files changed

+288
-105
lines changed

docs/en/advanced/speculative-decoding.md

Lines changed: 0 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -18,42 +18,3 @@ And for external draft model (e.g. draft models from [SpecForge](https://docs.sg
1818
```
1919

2020
For details on parameter meanings and configuration, see the [SGLang speculative decoding documentation](https://docs.sglang.ai/advanced_features/speculative_decoding.html).
21-
22-
### Known Issues
23-
24-
#### [SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888) or [SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)
25-
26-
* Error occurs during CUDA graph padding in the speculative decoding draft stage.
27-
* Workarounds:
28-
29-
1. Switch the inference backend to **fa3 Triton** (bug only occurs in **flashInfer**).
30-
2. Specify a broader range for `--sglang-cuda-graph-bs` to avoid batch sizes that trigger CUDA graph padding.
31-
3. Disable CUDA graph (not recommended due to significant performance loss).
32-
4. **Notice:** Disabling CUDA graph padding with `--sglang-disable-cuda-graph-padding` is currently ineffective for speculative decoding. See [SGLang `cuda_graph_runner.py`](tbd).
33-
* For debugging, enable slime’s `--debug-rollout-only` flag to isolate rollout behavior from parameter updates or model offloading.
34-
35-
```bash
36-
# If speculative decoding fails, this can help debug
37-
--debug-rollout-only
38-
39-
# If flashInfer causes issues with speculative decoding, use fa3 or triton instead
40-
--sglang-attention-backend fa3
41-
42-
# If CUDA graph fails due to padding, extend the CUDA graph batch size
43-
--sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)
44-
45-
# Improve performance by enlarging the running batch size limit
46-
--sglang-max-running-requests 128
47-
```
48-
49-
#### [SGLang issue #9481](https://github.com/sgl-project/sglang/issues/9481)
50-
51-
* Solution:
52-
53-
1. Apply the latest SGLang patch.
54-
2. See [PR #9687](https://github.com/sgl-project/sglang/pull/9687) for reference changes.
55-
56-
#### [SGLang PR #9388](https://github.com/sgl-project/sglang/pull/9388)
57-
58-
* If using an external draft model results in **illegal memory access**, it may be caused by a context length mismatch between the draft and target models.
59-
* Please update to **SGLang ≥ 0.5.1** (and update `sgl-kernel`) to apply this fix.

docs/en/get_started/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
9393
```
9494

9595
For larger models, you can use `torchrun` to start the covnersion script to convert with multi-gpus or even multi-nodes.
96+
Note: When converting the kimi-k2 model weights, you need to open config.json in the model path and change "model_type": "kimi_k2" to "model_type": "deepseek_v3".
9697

9798
### Convert from Megatron Format to Hugging Face Format
9899

Lines changed: 0 additions & 32 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,6 @@
11
# 投机采样
22

33

4-
Speculative decoding is an important optimization for making faster rollout during RL training. Currently slime only supports speculative decoding without training.
5-
64
投机采样是加速 rollout 的重要优化手段,目前 slime 支持不通过训练更新 draft model 式的投机采样。
75

86
对于有 MTP 层支持的模型(例如,GLM-4.6、Deepseek-V3/R1),只需要添加:
@@ -21,33 +19,3 @@ Speculative decoding is an important optimization for making faster rollout duri
2119
```
2220

2321
详细参数含义及配置方法,请参考 SGLang 的 speculative decoding [文档](https://docs.sglang.ai/advanced_features/speculative_decoding.html)
24-
25-
### 已知问题
26-
[SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888)[SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)
27-
- 报错发生在 speculative decoding draft 阶段的 cuda graph padding
28-
- 解决方法: 
29-
1. 切换推理后端为 fa3 triton。该 bug 仅发生在 flashInfer 。
30-
2. 覆盖更宽的 `--sglang-cuda-graph-bs` 来避免某些 batch size 做 cuda graph padding
31-
3. 禁用 cuda graph(性能损失太大,不推荐)
32-
4. Notice:禁用 cuda graph padding `--sglang-disable-cuda-graph-padding` 目前对 speculative decoding 不生效。参考 [SGLang cuda_graph_runner.py](tbd)
33-
- 如需 debug,可尝试开启 slime 的 `--debug-rollout-only` 参数,来排除参数更新或模型 offload 的影响
34-
```bash
35-
# if speculative decoding has bug, this can help debug
36-
--debug-rollout-only
37-
38-
# If flashInfer has bug with speculative decoding, use fa3 or triton instead
39-
--sglang-attention-backend fa3
40-
41-
# If bug exists when cuda graph do padding, extend the cuda graph batch size
42-
--sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)
43-
44-
# Improve performance by enlarging running batch size limit
45-
--sglang-max-running-requests 128
46-
```
47-
[SGLang issue #9481](https://github.com/sgl-project/sglang/issues/9481)
48-
- 解决方法:
49-
1. 应用最新的 sglang patch。
50-
2. 参考这个 pr 修改 sglang https://github.com/sgl-project/sglang/pull/9687
51-
[SGLang PR #9388](https://github.com/sgl-project/sglang/pull/9388)
52-
- 如果使用外部 draft model 出现 illegal memory access,可能是由于 draft model 和 target model 的 context length 不匹配导致的 bug。
53-
- 请更新 SGLang >= 0.5.1 来应用这个 PR。(并更新 `sgl-kernel`

docs/zh/get_started/quick_start.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
9393
```
9494

9595
对于更大的模型,可以使用 `torchrun` 来启动转换脚本,从而使用多张 GPU 甚至多机进行权重转换。
96+
注意:kimi-k2模型权重转换时,需打开模型路径中的config.json,将"model_type": "kimi_k2"修改为"model_type": "deepseek_v3"。
9697

9798
### Megatron 格式 转换为 Hugging Face 格式
9899

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
emoji
2+
immutabledict
23
nltk
4+
numpy==1.26.4
35
spacy==3.7.4
46
syllapy
5-
numpy==1.26.4
6-
immutabledict

examples/train_infer_mismatch_helper/mis.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,7 @@ def compute_mis_weights(
218218
def compute_mis_weights_with_cp(
219219
args,
220220
*,
221+
pg_loss: torch.Tensor,
221222
train_log_probs: list[torch.Tensor],
222223
rollout_log_probs: list[torch.Tensor],
223224
loss_masks: list[torch.Tensor],
@@ -274,7 +275,9 @@ def slice_cp_and_concat(
274275
values = slice_cp_and_concat(values, total_lengths, response_lengths)
275276
result_metrics[key_name] = values
276277

277-
return is_weights, result_metrics
278+
pg_loss = pg_loss * is_weights
279+
280+
return pg_loss, result_metrics
278281

279282

280283
def add_ppl_metrics(

scripts/models/kimi-k2.sh

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
NLAYERS=61
2+
FIRST_K_DENSE_REPLACE=1
3+
4+
arr=()
5+
for ((i=0; i<NLAYERS; i++)); do
6+
if (( i < FIRST_K_DENSE_REPLACE )); then
7+
arr+=(0)
8+
else
9+
arr+=(1)
10+
fi
11+
done
12+
13+
printf -v MOE_LAYER_FREQ "[%s]" "$(IFS=', '; echo "${arr[*]}")"
14+
15+
# kimi-k2
16+
MODEL_ARGS=(
17+
--disable-bias-linear
18+
--num-layers 61
19+
--hidden-size 7168
20+
--ffn-hidden-size 18432
21+
--num-attention-heads 64
22+
--kv-channels 64
23+
--normalization RMSNorm
24+
--position-embedding-type rope
25+
--norm-epsilon 1e-6
26+
--swiglu
27+
--untie-embeddings-and-output-weights
28+
--vocab-size 163840
29+
30+
--multi-latent-attention
31+
--q-lora-rank 1536
32+
--kv-lora-rank 512
33+
--qk-head-dim 128
34+
--qk-pos-emb-head-dim 64
35+
--v-head-dim 128
36+
--qk-layernorm
37+
--rotary-scaling-factor 32.0
38+
--rotary-base 50000
39+
--mscale 1.0
40+
--mscale-all-dim 1.0
41+
--attention-softmax-in-fp32
42+
--no-rope-fusion
43+
44+
# moe
45+
--num-experts 384
46+
--moe-layer-freq $MOE_LAYER_FREQ
47+
--moe-ffn-hidden-size 2048
48+
--moe-router-topk 8
49+
--moe-shared-expert-intermediate-size 2048
50+
--moe-router-pre-softmax
51+
--moe-router-score-function sigmoid
52+
--moe-router-enable-expert-bias
53+
--moe-router-load-balancing-type seq_aux_loss
54+
--moe-token-dispatcher-type alltoall
55+
--moe-aux-loss-coeff 0
56+
--moe-router-bias-update-rate 0
57+
--moe-router-group-topk 1
58+
--moe-router-num-groups 1
59+
--moe-grouped-gemm
60+
--moe-router-topk-scaling-factor 2.827
61+
--moe-router-dtype fp32
62+
--moe-permute-fusion
63+
)

scripts/run_qwen3_4b_fsdp.py

Lines changed: 19 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66

77
import command_utils as U
88

9-
MODEL_NAME = os.environ.get("SLIME_SCRIPT_MODEL_NAME", "Qwen3-4B")
9+
MODEL_NAME = os.environ.get("SLIME_SCRIPT_MODEL_NAME", "Qwen3-4B-Instruct-2507")
1010
NUM_GPUS = 8
1111

1212
EXTRA_ARGS = os.environ.get("SLIME_SCRIPT_EXTRA_ARGS", "")
@@ -28,6 +28,8 @@ def prepare():
2828

2929

3030
def execute():
31+
run_id = U.create_run_id()
32+
3133
ckpt_args = (
3234
f"--hf-checkpoint /root/models/{MODEL_NAME} "
3335
# "--ref-load /root/models/{MODEL_NAME} "
@@ -38,33 +40,37 @@ def execute():
3840
"--input-key prompt "
3941
"--label-key label "
4042
"--apply-chat-template "
43+
# By default it is thinking mode
44+
# """--apply-chat-template-kwargs '{"enable_thinking":false}' """
4145
"--rollout-shuffle "
4246
"--rm-type deepscaler "
4347
"--num-rollout 3000 "
4448
"--rollout-batch-size 32 "
4549
"--n-samples-per-prompt 8 "
46-
f"--rollout-max-response-len {100 if MODE == 'debug_minimal' else 8192} "
50+
f"--rollout-max-response-len {100 if MODE == 'debug_minimal' else 32768} "
4751
"--rollout-temperature 0.8 "
4852
"--global-batch-size 256 "
4953
"--balance-data "
5054
)
5155

52-
# when using tiny response len, cannot do dynamic sampling
53-
if MODE != "debug_minimal":
54-
rollout_args += (
55-
"--over-sampling-batch-size 64 "
56-
"--dynamic-sampling-filter-path slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std "
57-
)
56+
# We disable dynamic sampling currently
57+
# # when using tiny response len, cannot do dynamic sampling
58+
# if MODE != "debug_minimal":
59+
# rollout_args += (
60+
# "--over-sampling-batch-size 64 "
61+
# "--dynamic-sampling-filter-path slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std "
62+
# )
5863

5964
# sometimes disable eval to speed up debugging
6065
eval_args = ""
6166
if (MODE != "debug_minimal") and bool(int(os.environ.get("SLIME_SCRIPT_ENABLE_EVAL", "1"))):
67+
eval_max_response_len = 32768
6268
eval_args += "--eval-interval 20 "
6369
if MULTI_EVAL:
64-
eval_config_text = """
70+
eval_config_text = f"""
6571
eval:
6672
defaults:
67-
max_response_len: 16384
73+
max_response_len: {eval_max_response_len}
6874
top_p: 0.7
6975
datasets:
7076
- name: aime
@@ -85,7 +91,7 @@ def execute():
8591
eval_args += (
8692
"--eval-prompt-data aime /root/datasets/aime-2024/aime-2024.jsonl "
8793
"--n-samples-per-eval-prompt 16 "
88-
"--eval-max-response-len 16384 "
94+
f"--eval-max-response-len {eval_max_response_len} "
8995
"--eval-top-p 0.7 "
9096
)
9197

@@ -132,6 +138,7 @@ def execute():
132138
"--offload-train-mode move "
133139
"""--train-env-vars '{"PYTORCH_CUDA_ALLOC_CONF":"expandable_segments:True"}' """
134140
"--use-fault-tolerance "
141+
f"--save-debug-rollout-data /root/shared_data/{run_id}/{{rollout_id}}.pt "
135142
)
136143

137144
true_on_policy_args = ""
@@ -158,7 +165,7 @@ def execute():
158165
f"{rollout_args} "
159166
f"{optimizer_args} "
160167
f"{grpo_args} "
161-
f"{U.get_default_wandb_args(__file__)} "
168+
f"{U.get_default_wandb_args(__file__, run_id=run_id)} "
162169
f"{perf_args} "
163170
f"{eval_args} "
164171
f"{sglang_args} "

slime/backends/fsdp_utils/actor.py

Lines changed: 18 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,7 @@
2222
from slime.utils.timer import Timer, timer
2323
from slime.utils.wandb_utils import init_wandb_secondary
2424

25+
from . import checkpoint
2526
from .data_packing import pack_sequences, unpack_sequences
2627
from .fsdp_cpu_adam_wrapper import FSDPCPUAdamWrapper
2728
from .update_weight_utils import UpdateWeightFromDistributed, UpdateWeightFromTensor
@@ -67,6 +68,9 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
6768
self.args = args
6869
torch.manual_seed(args.seed)
6970

71+
if getattr(self.args, "start_rollout_id", None) is None:
72+
self.args.start_rollout_id = 0
73+
7074
if args.record_memory_history:
7175
profile_utils.attach_oom_dump_memory_history(profile_utils.get_memory_snapshot_full_path(args))
7276

@@ -91,6 +95,11 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
9195
if args.gradient_checkpointing:
9296
model.gradient_checkpointing_enable()
9397

98+
checkpoint_payload = checkpoint.load(self)
99+
if checkpoint_payload is not None and checkpoint_payload.get("model") is not None:
100+
model.load_state_dict(checkpoint_payload["model"], strict=True)
101+
checkpoint_payload["model"] = None
102+
94103
# Create FSDP v2 model using FSDP
95104
self.model = apply_fsdp2(model)
96105

@@ -120,8 +129,9 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
120129
f"Unsupported optimizer: {args.optimizer}. Supported options: 'adam', 'deepspeed_cpu_adam'"
121130
)
122131

123-
# TODO: load
124-
132+
self.global_step = 0
133+
self.micro_step = 0
134+
self._latest_checkpoint_iteration: int | None = None
125135
self.weights = {"actor": {}}
126136

127137
self.ref_model = None
@@ -136,16 +146,16 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
136146
else UpdateWeightFromDistributed(self.args, self.model, self.weights)
137147
)
138148

149+
checkpoint.finalize_load(self, checkpoint_payload)
150+
139151
# Initialize data packing parameters
140152
self.max_tokens_per_gpu = args.max_tokens_per_gpu # From main arguments
141153

142154
if self.args.offload_train:
143155
self.sleep()
144156

145157
Timer().start("train_wait")
146-
self.global_step = 0
147-
self.micro_step = 0
148-
return 0
158+
return int(getattr(self.args, "start_rollout_id", 0))
149159

150160
def sleep(self) -> None:
151161
"""Pause CUDA memory for all tracked tensors."""
@@ -204,16 +214,11 @@ def wake_up(self) -> None:
204214
print_memory("after wake_up model")
205215

206216
def save_model(self, iteration: int) -> None:
207-
"""Save model state and optimizer state for the given iteration.
208-
209-
Parameters:
210-
iteration: Global training step to associate with the checkpoint.
211-
212-
"""
213-
if self.args.debug_rollout_only:
217+
"""Delegate checkpoint saving to the shared checkpoint utilities."""
218+
if self.args.debug_rollout_only or self.args.save is None:
214219
return
215220

216-
raise NotImplementedError()
221+
checkpoint.save(self, iteration)
217222

218223
def compute_log_prob(
219224
self,

0 commit comments

Comments
 (0)