[doc] move fp8 doc to qwen3-30B-a3B as qwen3-4B doesn't perform good on fp8 rollout (#952)

zhuzilin · web-flow · commit fc51d25e6a1f · 2025-11-27T11:17:31.000+08:00
diff --git a/docs/en/examples/qwen3-30B-A3B.md b/docs/en/examples/qwen3-30B-A3B.md
@@ -74,6 +74,25 @@ Here, we will briefly introduce the MoE-related parts in the [run-qwen3-30B-A3B.
        --sglang-dp-size 8
     ```
 
+### BF16 Training with FP8 Inference
+
+slime also supports BF16 training with FP8 inference. For the Qwen3-30B-A3B model, you just need to download the following model:
+
+```bash
+huggingface-cli download Qwen/Qwen3-30B-A3B-FP8 --local-dir /root/Qwen3-30B-A3B-FP8
+```
+
+And replace `--hf-checkpoint` with:
+
+```bash
+#--hf-checkpoint /root/Qwen3-30B-A3B
+--hf-checkpoint /root/Qwen3-30B-A3B-FP8
+```
+
+This will trigger FP8 inference. Currently, we directly cast the BF16 weights to FP8. In the future, we will gradually add more sophisticated quantization schemes that have less impact on precision.
+
+⚠️ The Megatron checkpoint for training still needs to be the one that was originally converted from the BF16 Hugging Face model.
+
 ### Multi-Node Support
 
 For a multi-node environment, the following modifications are necessary:
diff --git a/docs/en/examples/qwen3-4B.md b/docs/en/examples/qwen3-4B.md
@@ -250,25 +250,6 @@ This means that each time, the data corresponding to the first `num_samples` pro
 
 ⚠️ The `sample.metadata` of each partial rollout sample stores the rollout ID from its initial generation, which can be used for data filtering.
 
-### BF16 Training with FP8 Inference
-
-slime also supports BF16 training with FP8 inference. For the Qwen3-4B model, you just need to download the following model:
-
-```bash
-huggingface-cli download Qwen/Qwen3-4B-FP8 --local-dir /root/Qwen3-4B-FP8
-```
-
-And replace `--hf-checkpoint` with:
-
-```bash
-#--hf-checkpoint /root/Qwen3-4B
---hf-checkpoint /root/Qwen3-4B-FP8
-```
-
-This will trigger FP8 inference. Currently, we directly cast the BF16 weights to FP8. In the future, we will gradually add more sophisticated quantization schemes that have less impact on precision.
-
-⚠️ The Megatron checkpoint for training still needs to be the one that was originally converted from the BF16 Hugging Face model.
-
 ### Decoupled Training and Inference
 
 In the original script, the resource configuration is as follows:
diff --git a/docs/zh/examples/qwen3-30B-A3B.md b/docs/zh/examples/qwen3-30B-A3B.md
@@ -73,6 +73,25 @@ bash scripts/run-qwen3-30B-A3B.sh
       --sglang-dp-size 8
    ```
 
+### bf16 训练 fp8 推理
+
+slime 还支持 bf16 训练，fp8 推理。对于 Qwen3-30B-A3B 模型，只需要下载如下模型：
+
+```bash
+huggingface-cli download Qwen/Qwen3-30B-A3B-FP8 --local-dir /root/Qwen3-30B-A3B-FP8
+```
+
+并将 `--hf-checkpoint` 替换为：
+
+```bash
+#--hf-checkpoint /root/Qwen3-30B-A3B
+--hf-checkpoint /root/Qwen3-30B-A3B-FP8
+```
+
+即可触发 fp8 训练。目前我们会将 bf16 权重直接 cast 为 fp8，后续会逐渐添加对精度影响更小的量化方案。
+
+⚠️  训练的 megatron checkpoint 还需要是最开始用 bf16 的 huggingface 转换的。
+
 ### 多机支持
 
 对于多机环境，需要进行如下的几点修改：
diff --git a/docs/zh/examples/qwen3-4B.md b/docs/zh/examples/qwen3-4B.md
@@ -250,25 +250,6 @@ def pop_first(args, rollout_id, buffer: list[list[Sample]], num_samples: int) ->
 
 ⚠️  每条 partial rollout sample 的 `sample.metadata` 中存储了第一次进行生成的 rollout id，可以用于数据过滤。
 
-### bf16 训练 fp8 推理
-
-slime 还支持 bf16 训练，fp8 推理。对于 Qwen3-4B 模型，只需要下载如下模型：
-
-```bash
-huggingface-cli download Qwen/Qwen3-4B-FP8 --local-dir /root/Qwen3-4B-FP8
-```
-
-并将 `--hf-checkpoint` 替换为：
-
-```bash
-#--hf-checkpoint /root/Qwen3-4B
---hf-checkpoint /root/Qwen3-4B-FP8
-```
-
-即可触发 fp8 训练。目前我们会将 bf16 权重直接 cast 为 fp8，后续会逐渐添加对精度影响更小的量化方案。
-
-⚠️  训练的 megatron checkpoint 还需要是最开始用 bf16 的 huggingface 转换的。
-
 ### 训推分离
 
 在原始的脚本中，资源配置如下：