Merge branch 'main' into feat/perf_metric

zhuzilin · web-flow · commit 53eff1267371 · 2025-10-30T11:13:20.000+08:00
diff --git a/docs/en/advanced/speculative-decoding.md b/docs/en/advanced/speculative-decoding.md
@@ -18,42 +18,3 @@ And for external draft model (e.g. draft models from [SpecForge](https://docs.sg
 ```
 
 For details on parameter meanings and configuration, see the [SGLang speculative decoding documentation](https://docs.sglang.ai/advanced_features/speculative_decoding.html).
-
-### Known Issues
-
-#### [SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888) or [SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)
-
-* Error occurs during CUDA graph padding in the speculative decoding draft stage.
-* Workarounds:
-
-  1. Switch the inference backend to **fa3 Triton** (bug only occurs in **flashInfer**).
-  2. Specify a broader range for `--sglang-cuda-graph-bs` to avoid batch sizes that trigger CUDA graph padding.
-  3. Disable CUDA graph (not recommended due to significant performance loss).
-  4. **Notice:** Disabling CUDA graph padding with `--sglang-disable-cuda-graph-padding` is currently ineffective for speculative decoding. See [SGLang `cuda_graph_runner.py`](tbd).
-* For debugging, enable slime’s `--debug-rollout-only` flag to isolate rollout behavior from parameter updates or model offloading.
-
-```bash
-# If speculative decoding fails, this can help debug
---debug-rollout-only
-
-# If flashInfer causes issues with speculative decoding, use fa3 or triton instead
---sglang-attention-backend fa3
-
-# If CUDA graph fails due to padding, extend the CUDA graph batch size
---sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)
-
-# Improve performance by enlarging the running batch size limit
---sglang-max-running-requests 128
-```
-
-#### [SGLang issue #9481](https://github.com/sgl-project/sglang/issues/9481)
-
-* Solution:
-
-  1. Apply the latest SGLang patch.
-  2. See [PR #9687](https://github.com/sgl-project/sglang/pull/9687) for reference changes.
-
-#### [SGLang PR #9388](https://github.com/sgl-project/sglang/pull/9388)
-
-* If using an external draft model results in **illegal memory access**, it may be caused by a context length mismatch between the draft and target models.
-* Please update to **SGLang ≥ 0.5.1** (and update `sgl-kernel`) to apply this fix.
diff --git a/docs/en/get_started/quick_start.md b/docs/en/get_started/quick_start.md
@@ -93,6 +93,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
 ```
 
 For larger models, you can use `torchrun` to start the covnersion script to convert with multi-gpus or even multi-nodes.
+Note: When converting the kimi-k2 model weights, you need to open config.json in the model path and change "model_type": "kimi_k2" to "model_type": "deepseek_v3".
 
 ### Convert from Megatron Format to Hugging Face Format
 
diff --git a/docs/zh/advanced/speculative-decoding.md b/docs/zh/advanced/speculative-decoding.md
@@ -1,8 +1,6 @@
 # 投机采样
 
 
-Speculative decoding is an important optimization for making faster rollout during RL training. Currently slime only supports speculative decoding without training.
-
 投机采样是加速 rollout 的重要优化手段，目前 slime 支持不通过训练更新 draft model 式的投机采样。
 
 对于有 MTP 层支持的模型（例如，GLM-4.6、Deepseek-V3/R1），只需要添加：
@@ -21,33 +19,3 @@ Speculative decoding is an important optimization for making faster rollout duri
 ```
 
 详细参数含义及配置方法，请参考 SGLang 的 speculative decoding [文档](https://docs.sglang.ai/advanced_features/speculative_decoding.html)
-
-### 已知问题
-[SGLang issue #9888](https://github.com/sgl-project/sglang/issues/9888) 或 [SGLang issue #9521](https://github.com/sgl-project/sglang/issues/9521)
-- 报错发生在 speculative decoding draft 阶段的 cuda graph padding
-- 解决方法: 
-	1. 切换推理后端为 fa3 triton。该 bug 仅发生在 flashInfer 。
-	2. 覆盖更宽的 `--sglang-cuda-graph-bs` 来避免某些 batch size 做 cuda graph padding
-	3. 禁用 cuda graph（性能损失太大，不推荐）
-	4. Notice：禁用 cuda graph padding `--sglang-disable-cuda-graph-padding` 目前对 speculative decoding 不生效。参考 [SGLang cuda_graph_runner.py](tbd)
-- 如需 debug，可尝试开启 slime 的 `--debug-rollout-only` 参数，来排除参数更新或模型 offload 的影响
-```bash
-# if speculative decoding has bug, this can help debug
---debug-rollout-only
-
-# If flashInfer has bug with speculative decoding, use fa3 or triton instead
---sglang-attention-backend fa3
-
-# If bug exists when cuda graph do padding, extend the cuda graph batch size
---sglang-cuda-graph-bs $(seq 1 32) $(seq 40 8 64) $(seq 80 16 160)
-
-# Improve performance by enlarging running batch size limit
---sglang-max-running-requests 128
-```
-[SGLang issue #9481](https://github.com/sgl-project/sglang/issues/9481)
-- 解决方法：
-	1. 应用最新的 sglang patch。
-	2. 参考这个 pr 修改 sglang https://github.com/sgl-project/sglang/pull/9687 
-[SGLang PR #9388](https://github.com/sgl-project/sglang/pull/9388)
-- 如果使用外部 draft model 出现 illegal memory access，可能是由于 draft model 和 target model 的 context length 不匹配导致的 bug。
-- 请更新 SGLang >= 0.5.1 来应用这个 PR。（并更新 `sgl-kernel`）
diff --git a/docs/zh/get_started/quick_start.md b/docs/zh/get_started/quick_start.md
@@ -93,6 +93,7 @@ PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
 ```
 
 对于更大的模型，可以使用 `torchrun` 来启动转换脚本，从而使用多张 GPU 甚至多机进行权重转换。
+注意：kimi-k2模型权重转换时，需打开模型路径中的config.json，将"model_type": "kimi_k2"修改为"model_type": "deepseek_v3"。
 
 ### Megatron 格式 转换为 Hugging Face 格式
 
diff --git a/examples/eval_multi_task/requirements_ifbench.txt b/examples/eval_multi_task/requirements_ifbench.txt
@@ -1,6 +1,6 @@
 emoji
+immutabledict
 nltk
+numpy==1.26.4
 spacy==3.7.4
 syllapy
-numpy==1.26.4
-immutabledict
diff --git a/examples/train_infer_mismatch_helper/mis.py b/examples/train_infer_mismatch_helper/mis.py
@@ -218,6 +218,7 @@ def compute_mis_weights(
 def compute_mis_weights_with_cp(
     args,
     *,
+    pg_loss: torch.Tensor,
     train_log_probs: list[torch.Tensor],
     rollout_log_probs: list[torch.Tensor],
     loss_masks: list[torch.Tensor],
@@ -274,7 +275,9 @@ def slice_cp_and_concat(
         values = slice_cp_and_concat(values, total_lengths, response_lengths)
         result_metrics[key_name] = values
 
-    return is_weights, result_metrics
+    pg_loss = pg_loss * is_weights
+
+    return pg_loss, result_metrics
 
 
 def add_ppl_metrics(
diff --git a/scripts/models/kimi-k2.sh b/scripts/models/kimi-k2.sh
@@ -0,0 +1,63 @@
+NLAYERS=61
+FIRST_K_DENSE_REPLACE=1
+
+arr=()
+for ((i=0; i<NLAYERS; i++)); do
+  if (( i < FIRST_K_DENSE_REPLACE )); then
+    arr+=(0)
+  else
+    arr+=(1)
+  fi
+done
+
+printf -v MOE_LAYER_FREQ "[%s]" "$(IFS=', '; echo "${arr[*]}")"
+
+# kimi-k2
+MODEL_ARGS=(
+    --disable-bias-linear
+    --num-layers 61
+    --hidden-size 7168
+    --ffn-hidden-size 18432
+    --num-attention-heads 64
+    --kv-channels 64
+    --normalization RMSNorm
+    --position-embedding-type rope
+    --norm-epsilon 1e-6
+    --swiglu
+    --untie-embeddings-and-output-weights
+    --vocab-size 163840
+    
+    --multi-latent-attention
+    --q-lora-rank 1536
+    --kv-lora-rank 512
+    --qk-head-dim 128
+    --qk-pos-emb-head-dim 64
+    --v-head-dim 128
+    --qk-layernorm
+    --rotary-scaling-factor 32.0
+    --rotary-base 50000
+    --mscale 1.0
+    --mscale-all-dim 1.0
+    --attention-softmax-in-fp32
+    --no-rope-fusion
+
+    # moe
+    --num-experts 384
+    --moe-layer-freq $MOE_LAYER_FREQ
+    --moe-ffn-hidden-size 2048
+    --moe-router-topk 8
+    --moe-shared-expert-intermediate-size 2048
+    --moe-router-pre-softmax
+    --moe-router-score-function sigmoid
+    --moe-router-enable-expert-bias
+    --moe-router-load-balancing-type seq_aux_loss
+    --moe-token-dispatcher-type alltoall
+    --moe-aux-loss-coeff 0
+    --moe-router-bias-update-rate 0
+    --moe-router-group-topk 1
+    --moe-router-num-groups 1
+    --moe-grouped-gemm
+    --moe-router-topk-scaling-factor 2.827
+    --moe-router-dtype fp32
+    --moe-permute-fusion
+)
diff --git a/scripts/run_qwen3_4b_fsdp.py b/scripts/run_qwen3_4b_fsdp.py
@@ -6,7 +6,7 @@
 
 import command_utils as U
 
-MODEL_NAME = os.environ.get("SLIME_SCRIPT_MODEL_NAME", "Qwen3-4B")
+MODEL_NAME = os.environ.get("SLIME_SCRIPT_MODEL_NAME", "Qwen3-4B-Instruct-2507")
 NUM_GPUS = 8
 
 EXTRA_ARGS = os.environ.get("SLIME_SCRIPT_EXTRA_ARGS", "")
@@ -28,6 +28,8 @@ def prepare():
 
 
 def execute():
+    run_id = U.create_run_id()
+
     ckpt_args = (
         f"--hf-checkpoint /root/models/{MODEL_NAME} "
         # "--ref-load /root/models/{MODEL_NAME} "
@@ -38,33 +40,37 @@ def execute():
         "--input-key prompt "
         "--label-key label "
         "--apply-chat-template "
+        # By default it is thinking mode
+        # """--apply-chat-template-kwargs '{"enable_thinking":false}' """
         "--rollout-shuffle "
         "--rm-type deepscaler "
         "--num-rollout 3000 "
         "--rollout-batch-size 32 "
         "--n-samples-per-prompt 8 "
-        f"--rollout-max-response-len {100 if MODE == 'debug_minimal' else 8192} "
+        f"--rollout-max-response-len {100 if MODE == 'debug_minimal' else 32768} "
         "--rollout-temperature 0.8 "
         "--global-batch-size 256 "
         "--balance-data "
     )
 
-    # when using tiny response len, cannot do dynamic sampling
-    if MODE != "debug_minimal":
-        rollout_args += (
-            "--over-sampling-batch-size 64 "
-            "--dynamic-sampling-filter-path slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std "
-        )
+    # We disable dynamic sampling currently
+    # # when using tiny response len, cannot do dynamic sampling
+    # if MODE != "debug_minimal":
+    #     rollout_args += (
+    #         "--over-sampling-batch-size 64 "
+    #         "--dynamic-sampling-filter-path slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std "
+    #     )
 
     # sometimes disable eval to speed up debugging
     eval_args = ""
     if (MODE != "debug_minimal") and bool(int(os.environ.get("SLIME_SCRIPT_ENABLE_EVAL", "1"))):
+        eval_max_response_len = 32768
         eval_args += "--eval-interval 20 "
         if MULTI_EVAL:
-            eval_config_text = """
+            eval_config_text = f"""
 eval:
   defaults:
-    max_response_len: 16384
+    max_response_len: {eval_max_response_len}
     top_p: 0.7
   datasets:
     - name: aime
@@ -85,7 +91,7 @@ def execute():
             eval_args += (
                 "--eval-prompt-data aime /root/datasets/aime-2024/aime-2024.jsonl "
                 "--n-samples-per-eval-prompt 16 "
-                "--eval-max-response-len 16384 "
+                f"--eval-max-response-len {eval_max_response_len} "
                 "--eval-top-p 0.7 "
             )
 
@@ -132,6 +138,7 @@ def execute():
         "--offload-train-mode move "
         """--train-env-vars '{"PYTORCH_CUDA_ALLOC_CONF":"expandable_segments:True"}' """
         "--use-fault-tolerance "
+        f"--save-debug-rollout-data /root/shared_data/{run_id}/{{rollout_id}}.pt "
     )
 
     true_on_policy_args = ""
@@ -158,7 +165,7 @@ def execute():
         f"{rollout_args} "
         f"{optimizer_args} "
         f"{grpo_args} "
-        f"{U.get_default_wandb_args(__file__)} "
+        f"{U.get_default_wandb_args(__file__, run_id=run_id)} "
         f"{perf_args} "
         f"{eval_args} "
         f"{sglang_args} "
diff --git a/slime/backends/fsdp_utils/actor.py b/slime/backends/fsdp_utils/actor.py
@@ -22,6 +22,7 @@
 from slime.utils.timer import Timer, timer
 from slime.utils.wandb_utils import init_wandb_secondary
 
+from . import checkpoint
 from .data_packing import pack_sequences, unpack_sequences
 from .fsdp_cpu_adam_wrapper import FSDPCPUAdamWrapper
 from .update_weight_utils import UpdateWeightFromDistributed, UpdateWeightFromTensor
@@ -67,6 +68,9 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
         self.args = args
         torch.manual_seed(args.seed)
 
+        if getattr(self.args, "start_rollout_id", None) is None:
+            self.args.start_rollout_id = 0
+
         if args.record_memory_history:
             profile_utils.attach_oom_dump_memory_history(profile_utils.get_memory_snapshot_full_path(args))
 
@@ -91,6 +95,11 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
         if args.gradient_checkpointing:
             model.gradient_checkpointing_enable()
 
+        checkpoint_payload = checkpoint.load(self)
+        if checkpoint_payload is not None and checkpoint_payload.get("model") is not None:
+            model.load_state_dict(checkpoint_payload["model"], strict=True)
+            checkpoint_payload["model"] = None
+
         # Create FSDP v2 model using FSDP
         self.model = apply_fsdp2(model)
 
@@ -120,8 +129,9 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
                 f"Unsupported optimizer: {args.optimizer}. Supported options: 'adam', 'deepspeed_cpu_adam'"
             )
 
-        # TODO: load
-
+        self.global_step = 0
+        self.micro_step = 0
+        self._latest_checkpoint_iteration: int | None = None
         self.weights = {"actor": {}}
 
         self.ref_model = None
@@ -136,16 +146,16 @@ def init(self, args: Namespace, role: str, wandb_run_id: str, with_ref: bool = F
             else UpdateWeightFromDistributed(self.args, self.model, self.weights)
         )
 
+        checkpoint.finalize_load(self, checkpoint_payload)
+
         # Initialize data packing parameters
         self.max_tokens_per_gpu = args.max_tokens_per_gpu  # From main arguments
 
         if self.args.offload_train:
             self.sleep()
 
         Timer().start("train_wait")
-        self.global_step = 0
-        self.micro_step = 0
-        return 0
+        return int(getattr(self.args, "start_rollout_id", 0))
 
     def sleep(self) -> None:
         """Pause CUDA memory for all tracked tensors."""
@@ -204,16 +214,11 @@ def wake_up(self) -> None:
         print_memory("after wake_up model")
 
     def save_model(self, iteration: int) -> None:
-        """Save model state and optimizer state for the given iteration.
-
-        Parameters:
-            iteration: Global training step to associate with the checkpoint.
-
-        """
-        if self.args.debug_rollout_only:
+        """Delegate checkpoint saving to the shared checkpoint utilities."""
+        if self.args.debug_rollout_only or self.args.save is None:
             return
 
-        raise NotImplementedError()
+        checkpoint.save(self, iteration)
 
     def compute_log_prob(
         self,
diff --git a/slime/backends/fsdp_utils/checkpoint.py b/slime/backends/fsdp_utils/checkpoint.py
diff --git a/slime/backends/megatron_utils/loss.py b/slime/backends/megatron_utils/loss.py
diff --git a/slime/rollout/rm_hub/__init__.py b/slime/rollout/rm_hub/__init__.py
diff --git a/slime/utils/arguments.py b/slime/utils/arguments.py