[docs] move low precision example into main doc (#1432)

zhuzilin · web-flow · commit e699b5d30da3 · 2026-01-16T19:42:26.000+08:00
diff --git a/docs/en/advanced/low-precision.md b/docs/en/advanced/low-precision.md
@@ -1,50 +1,61 @@
-# FP8 training examples
+# Low Precision Training
 
-This is an example of FP8 training and FP8 inference. Under FP8 training and inference, it can achieve more efficient inference throughput and lower training-inference mismatch, resulting in more stable training. More details can be found in [this blog](https://lmsys.org/blog/2025-11-25-fp8-rl/).
+- [FP8 rollout and FP8 training](#FP8-rollout-and-FP8-training)
+- [FP8 rollout and FP8 training](#FP8-rollout-and-FP8-training)
+- [INT4 QAT Training](#INT4-QAT-Training)
 
-## Files
+## FP8 rollout and BF16 training
 
-* `run-qwen3-4b-fp8.sh`: example launch script with Qwen3‑4B in FP8.
+You can run FP8 rollout simply by setting `--hf-checkpoint` with an blockwise quantized huggingface checkpoint, which can be converted by:
 
-* `run-qwen3-30b-a3b-fp8-two-nodes.sh`: example launch script for running Qwen3‑30B‑A3B in FP8 across two nodes.
+```bash
+python tools/convert_hf_to_fp8.py \
+    --model-dir $BF16_MODEL \
+    --save-dir $FP8_model \
+    --strategy block --block-size 128 128 \
+    --max-workers 4
+```
+
+Please ensure that the converted checkpoint points to a directory where the `config.json` contains the correct `quantization_config` so that slime can automatically use FP8 quantization during weight updates.
+
+## FP8 rollout and FP8 training
+
+We also observed that under FP8 training and inference, it can achieve more efficient inference throughput and lower training-inference mismatch, resulting in more stable training. More details can be found in [this blog](https://lmsys.org/blog/2025-11-25-fp8-rl/).
 
-## Quick Start
+### Quick Start
+
+1. Convert your HuggingFace model weights to FP8 format using the above `tools/convert_hf_to_fp8.py`.
 
-1. Check if your training script is properly configured. 
+2. Setting up the running script: 
 
 For training tasks, we need to add these flags:
+
 ```bash
 --fp8-format e4m3
 --fp8-recipe blockwise
 # --fp8-param-gather # [optional] Currently incompatible with CPU Adam
 ```
+
 Then ensure the `NVTE_FP8_BLOCK_SCALING_FP32_SCALES` environment variable is enabled.
 
 Note that only `Linear` and `GroupLinear` layers in TransformerEngine use fp8 format. `embedding` and `lm_head` remain in their original precision. If `--fp8-param-gather` is not enabled, weights in TransformerEngine remain in bf16 format, only being cast to fp8 format during `GEMM` or `GroupGEMM` operations.
 
-2. Convert your HuggingFace model weights to FP8 format. 
-
-You can use `tools/convert_hf_to_fp8.py` to convert bf16 weights to fp8 format. Ensure that the `--hf-checkpoint` parameter points to a directory where the `config.json` contains the correct `quantization_config`. slime will automatically use FP8 quantization during weight updates. 
+3. Start FP8 training with
 
-3. Start FP8 training.
-
-```
-cd slime
-
-# Qwen3‑4B FP8 training (single node)
-bash examples/low_precision/run-qwen3-4b-fp8.sh
+```bash
+# Qwen3-4B Int4 training
+bash scripts/low_precision/run-qwen3-4b-fp8.sh
 
-# Qwen3‑30B‑A3B FP8 training (two nodes)
-bash examples/low_precision/run-qwen3-30b-a3b-fp8-two-nodes.sh
+# Qwen3-30B-A3B (2 nodes)
+bash scripts/low_precision/run-qwen3-30b-a3b-fp8.sh
 ```
-Following the above command will launch FP8 training. 
 
 4. Use the saved checkpoint for evaluation. 
 
 Note that TransformerEngine does not specifically save FP8 quantized weights; the saved torch dist remains in original precision (usually bf16). If you want to evaluate under FP8, you need to convert the checkpoint from `torch_dist` to HuggingFace format, then convert to FP8 HuggingFace format.
 
 
-## Quick Explanation
+### Quick Explanation
 
 Here's a quick explanation of how FP8 training is currently implemented in slime:
 
@@ -57,43 +68,34 @@ Here's a quick explanation of how FP8 training is currently implemented in slime
 4. Save checkpoint: Similar to weight updates, if checkpoints need to be saved from the training engine, they will also be dequantized back to bf16 and saved to `torch_dist` format checkpoints.
 
 
-## TODO
+### TODO
 
 Currently, FP8 is far from being a complete feature and still has the following bugs, for examples:
 
 - FP8 weights (`--fp8-param-gather`) can provide memory savings benefits, but currently FP8 weights must be used with TransformerEngine's FusedAdam, which conflicts with the commonly used Adam CPU offload technique in Megatron-LM.
 
-The slime team will continue to collaborate with the NVIDIA team to contribute more complete FP8 training infrastructure to the community.
-
-***
-
-## INT4 Training Examples
+## INT4 QAT Training
 
 This guide provides examples for INT4 STE (Straight-Through Estimator) training and INT4 inference. Utilizing INT4 inference significantly improves throughput, thereby accelerating the training pipeline (specifically during the rollout generation phase).
 
-### Files
-
-*   `run-moonlight-16B-A3B-int4.sh`: Launch script for **Moonlight-16B-A3B** (INT4) on 4x H200 GPUs.
-*   `run-qwen3‑30B‑A3B-int4.sh`: Launch script for **Qwen3‑30B‑A3B** (INT4) on 8x H200 GPUs.
-*   `run-qwen3-235B-A22B-int4.sh`: Launch script for **Qwen3-235B-A22B** (INT4) on 64x H200 GPUs.
-*   `run-kimi-k2-Thinking-int4.sh`: Launch script for **Kimi-k2-Thinking** (INT4) on 256x H200 GPUs.
-
 ### Quick Start
 
-#### 1. Convert HuggingFace Weights to INT4
+1. Convert HuggingFace Weights to INT4
 First, download the PTQ (Post-Training Quantization) calibration dataset from HuggingFace:
 [https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1)
 
-Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. slime will automatically utilize INT4 quantization during weight updates.
+Next, use the `tools/convert_hf_to_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. slime will automatically utilize INT4 quantization during weight updates.
 
 ```bash
-python tools/convert_hf_to_hf_int4.py \
+python tools/convert_hf_to_int4.py \
   --input-dir /path/to/your/original/models \
   --output-dir /path/to/your/save/models \
   --data-dir /path/to/your/wikitext
 ```
 
-#### 2. Start INT4 Training
+Note: If you only hope to run with INT4 rollout, you only need to set the `--hf-checkpoint` to the converted INT4 checkpoint.
+
+2. Start INT4 QAT Training
 
 You need to configure the specific environment variables for quantization settings.
 
@@ -120,16 +122,16 @@ RUNTIME_ENV_JSON="{
 
 ```bash
 # Moonlight-16B-A3B Int4 training
-bash examples/low_precision/run-moonlight-16B-A3B-int4.sh
+bash scripts/low_precision/run-moonlight-16B-A3B-int4.sh
 
 # Qwen3‑30B‑A3B Int4 training
-bash examples/low_precision/run-qwen3‑30B‑A3B-int4.sh
+bash scripts/low_precision/run-qwen3‑30B‑A3B-int4.sh
 
 # Qwen3-235B-A22B Int4 training (8 nodes)
-bash examples/low_precision/run-qwen3-235B-A22B-int4.sh
+bash scripts/low_precision/run-qwen3-235B-A22B-int4.sh
 
 # Kimi-k2-Thinking Int4 training (32 nodes)
-bash examples/low_precision/run-kimi-k2-Thinking-int4.sh
+bash scripts/low_precision/run-kimi-k2-Thinking-int4.sh
 ```
 
-- For multi-node environments, please start the Ray service according to your cluster configuration.
+- For multi-node environments, please start the Ray service according to your cluster configuration.
diff --git a/docs/en/index.rst b/docs/en/index.rst
@@ -42,9 +42,10 @@ slime is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
 
    _examples_synced/reproducibility/README.md
    advanced/speculative-decoding.md
+   advanced/low-precision.md
    advanced/fault-tolerance.md
-   advanced/arch-support-beyond-megatron.md
    advanced/pd-disaggregation.md
+   advanced/arch-support-beyond-megatron.md
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/zh/advanced/fault-tolerance.md b/docs/zh/advanced/fault-tolerance.md
diff --git a/docs/zh/advanced/low-precision.md b/docs/zh/advanced/low-precision.md
@@ -0,0 +1,120 @@
+# 低精度训练
+
+- [FP8 推理与 BF16 训练](#FP8-推理与-BF16-训练)
+- [FP8 推理与 FP8 训练](#FP8-推理与-FP8-训练)
+- [INT4 QAT 训练](#INT4-QAT-训练)
+
+## FP8 推理与 BF16 训练
+
+你可以通过在 `--hf-checkpoint` 中设置块缩放（blockwise）量化的 HuggingFace 权重来运行 FP8 推演。转换命令如下：
+
+```bash
+python tools/convert_hf_to_fp8.py \
+    --model-dir $BF16_MODEL \
+    --save-dir $FP8_model \
+    --strategy block --block-size 128 128 \
+    --max-workers 4
+```
+
+请确保转换后的权重目录中的 `config.json` 包含正确的 `quantization_config`，以便 slime 在权重更新期间自动使用 FP8 量化。
+
+## FP8 推理与 FP8 训练
+
+我们观察到，在训练和推理阶段同时使用 FP8，可以获得更高效的推理吞吐量，并降低训推不一致，从而使训练更稳定。更多细节请参考 [此博客](https://lmsys.org/blog/2025-11-25-fp8-rl/)。
+
+### 快速开始
+
+1. 使用上述 `tools/convert_hf_to_fp8.py` 将 HuggingFace 模型权重转换为 FP8 格式。
+2. 对于训练任务，需要添加以下参数：
+```bash
+--fp8-format e4m3
+--fp8-recipe blockwise
+# --fp8-param-gather # [可选] 目前与 CPU Adam 优化器不兼容
+
+```
+
+同时，确保启用了环境变量 `NVTE_FP8_BLOCK_SCALING_FP32_SCALES`，目前我们会默认将这个参数设置为 `1`。
+
+注意：目前只有 TransformerEngine 中的 `Linear` 和 `GroupLinear` 层使用 FP8 格式。`embedding` 和 `lm_head` 仍保持原始精度。如果未开启 `--fp8-param-gather`，TransformerEngine 中的权重将以 BF16 格式存储，仅在 `GEMM` 或 `GroupGEMM` 运算期间临时转换为 FP8。
+
+3. 启动训练：
+
+```bash
+# Qwen3-4B Int4 training
+bash scripts/low_precision/run-qwen3-4b-fp8.sh
+
+# Qwen3-30B-A3B (2 nodes)
+bash scripts/low_precision/run-qwen3-30b-a3b-fp8.sh
+```
+
+4. 使用保存的 ckpt：TransformerEngine 不会专门保存 FP8 量化后的权重；保存的 `torch_dist` 检查点仍为原始精度（通常是 BF16）。如果你想在 FP8 下进行评估，需要先将 `torch_dist` 转换为 HuggingFace 格式，然后再转换为 FP8 HuggingFace 格式。
+
+### 原理简述
+
+以下是 slime 中 FP8 训练目前的实现方式：
+
+1. **初始化**：如果启用了 FP8 方案，相关层将在 FP8 上下文中构建。
+2. **训练过程**：在训练期间，权重和激活值会在线量化为 `nvfp8` 格式，并在前向和反向传播中调用 `cuBLAS FP8 GEMM` 进行计算。
+3. **权重更新**：在强化学习（RL）权重更新期间，Megatron 首先将 FP8 权重反量化为 BF16 格式，然后 slime 再将这些 BF16 权重重新量化为 FP8 并发送给 sglang。（这种“反量化+再量化”的操作虽然不够优雅，但为了框架兼容性，目前尚未修改接口。）
+4. **保存检查点**：与权重更新类似，从训练引擎保存检查点时，也会反量化回 BF16 并以 `torch_dist` 格式保存。
+
+### 待办事项 (TODO)
+
+目前 FP8 功能尚不完全成熟，仍存在以下已知问题：
+
+* FP8 权重存储（`--fp8-param-gather`）虽然能节省显存，但目前必须配合 TransformerEngine 的 `FusedAdam` 使用，这与 Megatron-LM 中的 CPU Adam 技术冲突。
+
+## INT4 QAT 训练
+
+本指南提供了 INT4 STE（直通估计器，Straight-Through Estimator）训练和 INT4 推理的示例。使用 INT4 推理可显著提升吞吐量，从而加速整个训练流水线（特别是在 rollout 生成阶段）。
+
+### 快速开始
+
+1. **将 HuggingFace 权重转换为 INT4**
+首先，从 HuggingFace 下载 PTQ（训练后量化）校准数据集：
+[wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1)
+接着，使用 `tools/convert_hf_to_int4.py` 脚本进行转换。确保 `--hf-checkpoint` 指向的目录中 `config.json` 包含正确的 `quantization_config`。
+```bash
+python tools/convert_hf_to_int4.py \
+  --input-dir /path/to/your/original/models \
+  --output-dir /path/to/your/save/models \
+  --data-dir /path/to/your/wikitext
+
+```
+
+**提示**：如果你只想运行 INT4 推演（Rollout），只需将 `--hf-checkpoint` 设置为转换后的 INT4 路径即可。
+2. **启动 INT4 QAT 训练**
+你需要配置特定的环境变量来设定量化参数。
+**环境变量说明：**
+* **`OPEN_TRAINING_INT4_FAKE_QAT_FLAG`**: 启用 INT4 训练的伪量化（Fake Quantization）操作。
+* **`OPEN_TRAINING_INT4_GROUP_SIZE`**: 指定模型量化的块大小（Group Size）。
+* `moonlight-16B-A3B`、`qwen3-30B-A3B` 和 `qwen3-235B-A22B-int4` 设置为 **128**。
+* `kimi-k2-Thinking-int4` 设置为 **32**。
+
+**配置示例：**
+```json
+RUNTIME_ENV_JSON="{
+  \"env_vars\": {
+    ...
+    \"OPEN_TRAINING_INT4_FAKE_QAT_FLAG\": \"1\",
+    \"OPEN_TRAINING_INT4_GROUP_SIZE\": \"128\"
+  }
+}"
+```
+
+**启动命令：**
+```bash
+# Moonlight-16B-A3B Int4 training
+bash scripts/low_precision/run-moonlight-16B-A3B-int4.sh
+
+# Qwen3‑30B‑A3B Int4 training
+bash scripts/low_precision/run-qwen3‑30B‑A3B-int4.sh
+
+# Qwen3-235B-A22B Int4 training (8 nodes)
+bash scripts/low_precision/run-qwen3-235B-A22B-int4.sh
+
+# Kimi-k2-Thinking Int4 training (32 nodes)
+bash scripts/low_precision/run-kimi-k2-Thinking-int4.sh
+```
+
+*对于多节点环境，请根据您的集群配置启动 Ray 服务。*
diff --git a/docs/zh/index.rst b/docs/zh/index.rst
@@ -42,9 +42,10 @@ slime 是 GLM-4.7、GLM-4.6、GLM-4.5 背后的 RL 训练框架。除此之外
 
    _examples_synced/reproducibility/README.md
    advanced/speculative-decoding.md
-   advanced/fault-torlance.md
-   advanced/arch-support-beyond-megatron.md
+   advanced/low-precision.md
+   advanced/fault-tolerance.md
    advanced/pd-disaggregation.md
+   advanced/arch-support-beyond-megatron.md
 
 .. toctree::
    :maxdepth: 1
diff --git a/scripts/low_precision/run-kimi-k2-Thinking-int4.sh b/scripts/low_precision/run-kimi-k2-Thinking-int4.sh
@@ -24,7 +24,7 @@ fi
 echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
 
 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
-source "${SCRIPT_DIR}/../../models/kimi-k2-thinking.sh"
+source "${SCRIPT_DIR}/../models/kimi-k2-thinking.sh"
 
 CKPT_ARGS=(
    --hf-checkpoint /root/Kimi-K2-Thinking/
diff --git a/scripts/low_precision/run-moonlight-16B-A3B-int4.sh b/scripts/low_precision/run-moonlight-16B-A3B-int4.sh
@@ -25,7 +25,7 @@ fi
 echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
 
 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
-source "${SCRIPT_DIR}/../../models/moonlight.sh"
+source "${SCRIPT_DIR}/../models/moonlight.sh"
 
 CKPT_ARGS=(
    --hf-checkpoint /root/Moonlight-16B-A3B-Instruct-INT4
diff --git a/scripts/low_precision/run-qwen3-235B-A22B-int4.sh b/scripts/low_precision/run-qwen3-235B-A22B-int4.sh
@@ -24,7 +24,7 @@ fi
 echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
 
 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
-source "${SCRIPT_DIR}/../../models/qwen3-235B-A22B.sh"
+source "${SCRIPT_DIR}/../models/qwen3-235B-A22B.sh"
 
 CKPT_ARGS=(
    --hf-checkpoint /root/Qwen3-235B-A22B-INT4/
diff --git a/scripts/low_precision/run-qwen3-30B-A3B-int4.sh b/scripts/low_precision/run-qwen3-30B-A3B-int4.sh
@@ -24,7 +24,7 @@ fi
 echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
 
 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
-source "${SCRIPT_DIR}/../../models/qwen3-30B-A3B.sh"
+source "${SCRIPT_DIR}/../models/qwen3-30B-A3B.sh"
 
 CKPT_ARGS=(
    --hf-checkpoint /root/Qwen3-30B-A3B-INT4/
diff --git a/scripts/low_precision/run-qwen3-30b-a3b-fp8.sh b/scripts/low_precision/run-qwen3-30b-a3b-fp8.sh
@@ -25,7 +25,7 @@ fi
 echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
 
 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
-source "${SCRIPT_DIR}/../../scripts/models/qwen3-30B-A3B.sh"
+source "${SCRIPT_DIR}/../scripts/models/qwen3-30B-A3B.sh"
 
 # Base directory for checkpoints and related files (adjust if necessary)
 BASE_DIR="/root" 
diff --git a/scripts/low_precision/run-qwen3-4b-fp8.sh b/scripts/low_precision/run-qwen3-4b-fp8.sh
@@ -24,7 +24,7 @@ fi
 echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
 
 SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
-source "${SCRIPT_DIR}/../../scripts/models/qwen3-4B.sh"
+source "${SCRIPT_DIR}/../scripts/models/qwen3-4B.sh"
 
 CKPT_ARGS=(
    --hf-checkpoint /root/Qwen3-4B-FP8
diff --git a/slime/ray/actor_group.py b/slime/ray/actor_group.py
@@ -54,7 +54,7 @@ def _allocate_gpus_for_actor(self, pg, num_gpus_per_actor):
             # because sglang will always set NCCL_CUMEM_ENABLE to 0
             # we need also set it to 0 to prevent nccl error.
             "NCCL_CUMEM_ENABLE": os.environ.get("NCCL_CUMEM_ENABLE", "0"),
-            "NVTE_FP8_BLOCK_SCALING_FP32_SCALES": "1",
+            "NVTE_FP8_BLOCK_SCALING_FP32_SCALES": os.environ.get("NVTE_FP8_BLOCK_SCALING_FP32_SCALES", "1"),
             **{name: "1" for name in NOSET_VISIBLE_DEVICES_ENV_VARS_LIST},
             **self.args.train_env_vars,
         }
diff --git a/tools/convert_hf_to_int4.py b/tools/convert_hf_to_int4.py