Skip to content

Commit e699b5d

Browse files
authored
[docs] move low precision example into main doc (#1432)
1 parent 2876bf8 commit e699b5d

File tree

13 files changed

+177
-53
lines changed

13 files changed

+177
-53
lines changed
Lines changed: 45 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -1,50 +1,61 @@
1-
# FP8 training examples
1+
# Low Precision Training
22

3-
This is an example of FP8 training and FP8 inference. Under FP8 training and inference, it can achieve more efficient inference throughput and lower training-inference mismatch, resulting in more stable training. More details can be found in [this blog](https://lmsys.org/blog/2025-11-25-fp8-rl/).
3+
- [FP8 rollout and FP8 training](#FP8-rollout-and-FP8-training)
4+
- [FP8 rollout and FP8 training](#FP8-rollout-and-FP8-training)
5+
- [INT4 QAT Training](#INT4-QAT-Training)
46

5-
## Files
7+
## FP8 rollout and BF16 training
68

7-
* `run-qwen3-4b-fp8.sh`: example launch script with Qwen3‑4B in FP8.
9+
You can run FP8 rollout simply by setting `--hf-checkpoint` with an blockwise quantized huggingface checkpoint, which can be converted by:
810

9-
* `run-qwen3-30b-a3b-fp8-two-nodes.sh`: example launch script for running Qwen3‑30B‑A3B in FP8 across two nodes.
11+
```bash
12+
python tools/convert_hf_to_fp8.py \
13+
--model-dir $BF16_MODEL \
14+
--save-dir $FP8_model \
15+
--strategy block --block-size 128 128 \
16+
--max-workers 4
17+
```
18+
19+
Please ensure that the converted checkpoint points to a directory where the `config.json` contains the correct `quantization_config` so that slime can automatically use FP8 quantization during weight updates.
20+
21+
## FP8 rollout and FP8 training
22+
23+
We also observed that under FP8 training and inference, it can achieve more efficient inference throughput and lower training-inference mismatch, resulting in more stable training. More details can be found in [this blog](https://lmsys.org/blog/2025-11-25-fp8-rl/).
1024

11-
## Quick Start
25+
### Quick Start
26+
27+
1. Convert your HuggingFace model weights to FP8 format using the above `tools/convert_hf_to_fp8.py`.
1228

13-
1. Check if your training script is properly configured.
29+
2. Setting up the running script:
1430

1531
For training tasks, we need to add these flags:
32+
1633
```bash
1734
--fp8-format e4m3
1835
--fp8-recipe blockwise
1936
# --fp8-param-gather # [optional] Currently incompatible with CPU Adam
2037
```
38+
2139
Then ensure the `NVTE_FP8_BLOCK_SCALING_FP32_SCALES` environment variable is enabled.
2240

2341
Note that only `Linear` and `GroupLinear` layers in TransformerEngine use fp8 format. `embedding` and `lm_head` remain in their original precision. If `--fp8-param-gather` is not enabled, weights in TransformerEngine remain in bf16 format, only being cast to fp8 format during `GEMM` or `GroupGEMM` operations.
2442

25-
2. Convert your HuggingFace model weights to FP8 format.
26-
27-
You can use `tools/convert_hf_to_fp8.py` to convert bf16 weights to fp8 format. Ensure that the `--hf-checkpoint` parameter points to a directory where the `config.json` contains the correct `quantization_config`. slime will automatically use FP8 quantization during weight updates.
43+
3. Start FP8 training with
2844

29-
3. Start FP8 training.
30-
31-
```
32-
cd slime
33-
34-
# Qwen3‑4B FP8 training (single node)
35-
bash examples/low_precision/run-qwen3-4b-fp8.sh
45+
```bash
46+
# Qwen3-4B Int4 training
47+
bash scripts/low_precision/run-qwen3-4b-fp8.sh
3648

37-
# Qwen330BA3B FP8 training (two nodes)
38-
bash examples/low_precision/run-qwen3-30b-a3b-fp8-two-nodes.sh
49+
# Qwen3-30B-A3B (2 nodes)
50+
bash scripts/low_precision/run-qwen3-30b-a3b-fp8.sh
3951
```
40-
Following the above command will launch FP8 training.
4152

4253
4. Use the saved checkpoint for evaluation.
4354

4455
Note that TransformerEngine does not specifically save FP8 quantized weights; the saved torch dist remains in original precision (usually bf16). If you want to evaluate under FP8, you need to convert the checkpoint from `torch_dist` to HuggingFace format, then convert to FP8 HuggingFace format.
4556

4657

47-
## Quick Explanation
58+
### Quick Explanation
4859

4960
Here's a quick explanation of how FP8 training is currently implemented in slime:
5061

@@ -57,43 +68,34 @@ Here's a quick explanation of how FP8 training is currently implemented in slime
5768
4. Save checkpoint: Similar to weight updates, if checkpoints need to be saved from the training engine, they will also be dequantized back to bf16 and saved to `torch_dist` format checkpoints.
5869

5970

60-
## TODO
71+
### TODO
6172

6273
Currently, FP8 is far from being a complete feature and still has the following bugs, for examples:
6374

6475
- FP8 weights (`--fp8-param-gather`) can provide memory savings benefits, but currently FP8 weights must be used with TransformerEngine's FusedAdam, which conflicts with the commonly used Adam CPU offload technique in Megatron-LM.
6576

66-
The slime team will continue to collaborate with the NVIDIA team to contribute more complete FP8 training infrastructure to the community.
67-
68-
***
69-
70-
## INT4 Training Examples
77+
## INT4 QAT Training
7178

7279
This guide provides examples for INT4 STE (Straight-Through Estimator) training and INT4 inference. Utilizing INT4 inference significantly improves throughput, thereby accelerating the training pipeline (specifically during the rollout generation phase).
7380

74-
### Files
75-
76-
* `run-moonlight-16B-A3B-int4.sh`: Launch script for **Moonlight-16B-A3B** (INT4) on 4x H200 GPUs.
77-
* `run-qwen3‑30B‑A3B-int4.sh`: Launch script for **Qwen3‑30B‑A3B** (INT4) on 8x H200 GPUs.
78-
* `run-qwen3-235B-A22B-int4.sh`: Launch script for **Qwen3-235B-A22B** (INT4) on 64x H200 GPUs.
79-
* `run-kimi-k2-Thinking-int4.sh`: Launch script for **Kimi-k2-Thinking** (INT4) on 256x H200 GPUs.
80-
8181
### Quick Start
8282

83-
#### 1. Convert HuggingFace Weights to INT4
83+
1. Convert HuggingFace Weights to INT4
8484
First, download the PTQ (Post-Training Quantization) calibration dataset from HuggingFace:
8585
[https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1)
8686

87-
Next, use the `tools/convert_hf_to_hf_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. slime will automatically utilize INT4 quantization during weight updates.
87+
Next, use the `tools/convert_hf_to_int4.py` script to convert BF16 weights to INT4 format. Ensure that the `--hf-checkpoint` parameter points to a directory where `config.json` contains the correct `quantization_config`. slime will automatically utilize INT4 quantization during weight updates.
8888

8989
```bash
90-
python tools/convert_hf_to_hf_int4.py \
90+
python tools/convert_hf_to_int4.py \
9191
--input-dir /path/to/your/original/models \
9292
--output-dir /path/to/your/save/models \
9393
--data-dir /path/to/your/wikitext
9494
```
9595

96-
#### 2. Start INT4 Training
96+
Note: If you only hope to run with INT4 rollout, you only need to set the `--hf-checkpoint` to the converted INT4 checkpoint.
97+
98+
2. Start INT4 QAT Training
9799

98100
You need to configure the specific environment variables for quantization settings.
99101

@@ -120,16 +122,16 @@ RUNTIME_ENV_JSON="{
120122

121123
```bash
122124
# Moonlight-16B-A3B Int4 training
123-
bash examples/low_precision/run-moonlight-16B-A3B-int4.sh
125+
bash scripts/low_precision/run-moonlight-16B-A3B-int4.sh
124126

125127
# Qwen3‑30B‑A3B Int4 training
126-
bash examples/low_precision/run-qwen3‑30B‑A3B-int4.sh
128+
bash scripts/low_precision/run-qwen3‑30B‑A3B-int4.sh
127129

128130
# Qwen3-235B-A22B Int4 training (8 nodes)
129-
bash examples/low_precision/run-qwen3-235B-A22B-int4.sh
131+
bash scripts/low_precision/run-qwen3-235B-A22B-int4.sh
130132

131133
# Kimi-k2-Thinking Int4 training (32 nodes)
132-
bash examples/low_precision/run-kimi-k2-Thinking-int4.sh
134+
bash scripts/low_precision/run-kimi-k2-Thinking-int4.sh
133135
```
134136

135-
- For multi-node environments, please start the Ray service according to your cluster configuration.
137+
- For multi-node environments, please start the Ray service according to your cluster configuration.

docs/en/index.rst

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,10 @@ slime is the RL-framework behind GLM-4.7, GLM-4.6 and GLM-4.5. Apart from models
4242

4343
_examples_synced/reproducibility/README.md
4444
advanced/speculative-decoding.md
45+
advanced/low-precision.md
4546
advanced/fault-tolerance.md
46-
advanced/arch-support-beyond-megatron.md
4747
advanced/pd-disaggregation.md
48+
advanced/arch-support-beyond-megatron.md
4849

4950
.. toctree::
5051
:maxdepth: 1
File renamed without changes.

docs/zh/advanced/low-precision.md

Lines changed: 120 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,120 @@
1+
# 低精度训练
2+
3+
- [FP8 推理与 BF16 训练](#FP8-推理与-BF16-训练)
4+
- [FP8 推理与 FP8 训练](#FP8-推理与-FP8-训练)
5+
- [INT4 QAT 训练](#INT4-QAT-训练)
6+
7+
## FP8 推理与 BF16 训练
8+
9+
你可以通过在 `--hf-checkpoint` 中设置块缩放(blockwise)量化的 HuggingFace 权重来运行 FP8 推演。转换命令如下:
10+
11+
```bash
12+
python tools/convert_hf_to_fp8.py \
13+
--model-dir $BF16_MODEL \
14+
--save-dir $FP8_model \
15+
--strategy block --block-size 128 128 \
16+
--max-workers 4
17+
```
18+
19+
请确保转换后的权重目录中的 `config.json` 包含正确的 `quantization_config`,以便 slime 在权重更新期间自动使用 FP8 量化。
20+
21+
## FP8 推理与 FP8 训练
22+
23+
我们观察到,在训练和推理阶段同时使用 FP8,可以获得更高效的推理吞吐量,并降低训推不一致,从而使训练更稳定。更多细节请参考 [此博客](https://lmsys.org/blog/2025-11-25-fp8-rl/)
24+
25+
### 快速开始
26+
27+
1. 使用上述 `tools/convert_hf_to_fp8.py` 将 HuggingFace 模型权重转换为 FP8 格式。
28+
2. 对于训练任务,需要添加以下参数:
29+
```bash
30+
--fp8-format e4m3
31+
--fp8-recipe blockwise
32+
# --fp8-param-gather # [可选] 目前与 CPU Adam 优化器不兼容
33+
34+
```
35+
36+
同时,确保启用了环境变量 `NVTE_FP8_BLOCK_SCALING_FP32_SCALES`,目前我们会默认将这个参数设置为 `1`
37+
38+
注意:目前只有 TransformerEngine 中的 `Linear``GroupLinear` 层使用 FP8 格式。`embedding``lm_head` 仍保持原始精度。如果未开启 `--fp8-param-gather`,TransformerEngine 中的权重将以 BF16 格式存储,仅在 `GEMM``GroupGEMM` 运算期间临时转换为 FP8。
39+
40+
3. 启动训练:
41+
42+
```bash
43+
# Qwen3-4B Int4 training
44+
bash scripts/low_precision/run-qwen3-4b-fp8.sh
45+
46+
# Qwen3-30B-A3B (2 nodes)
47+
bash scripts/low_precision/run-qwen3-30b-a3b-fp8.sh
48+
```
49+
50+
4. 使用保存的 ckpt:TransformerEngine 不会专门保存 FP8 量化后的权重;保存的 `torch_dist` 检查点仍为原始精度(通常是 BF16)。如果你想在 FP8 下进行评估,需要先将 `torch_dist` 转换为 HuggingFace 格式,然后再转换为 FP8 HuggingFace 格式。
51+
52+
### 原理简述
53+
54+
以下是 slime 中 FP8 训练目前的实现方式:
55+
56+
1. **初始化**:如果启用了 FP8 方案,相关层将在 FP8 上下文中构建。
57+
2. **训练过程**:在训练期间,权重和激活值会在线量化为 `nvfp8` 格式,并在前向和反向传播中调用 `cuBLAS FP8 GEMM` 进行计算。
58+
3. **权重更新**:在强化学习(RL)权重更新期间,Megatron 首先将 FP8 权重反量化为 BF16 格式,然后 slime 再将这些 BF16 权重重新量化为 FP8 并发送给 sglang。(这种“反量化+再量化”的操作虽然不够优雅,但为了框架兼容性,目前尚未修改接口。)
59+
4. **保存检查点**:与权重更新类似,从训练引擎保存检查点时,也会反量化回 BF16 并以 `torch_dist` 格式保存。
60+
61+
### 待办事项 (TODO)
62+
63+
目前 FP8 功能尚不完全成熟,仍存在以下已知问题:
64+
65+
* FP8 权重存储(`--fp8-param-gather`)虽然能节省显存,但目前必须配合 TransformerEngine 的 `FusedAdam` 使用,这与 Megatron-LM 中的 CPU Adam 技术冲突。
66+
67+
## INT4 QAT 训练
68+
69+
本指南提供了 INT4 STE(直通估计器,Straight-Through Estimator)训练和 INT4 推理的示例。使用 INT4 推理可显著提升吞吐量,从而加速整个训练流水线(特别是在 rollout 生成阶段)。
70+
71+
### 快速开始
72+
73+
1. **将 HuggingFace 权重转换为 INT4**
74+
首先,从 HuggingFace 下载 PTQ(训练后量化)校准数据集:
75+
[wikitext-2-raw-v1](https://huggingface.co/datasets/Salesforce/wikitext/tree/main/wikitext-2-raw-v1)
76+
接着,使用 `tools/convert_hf_to_int4.py` 脚本进行转换。确保 `--hf-checkpoint` 指向的目录中 `config.json` 包含正确的 `quantization_config`
77+
```bash
78+
python tools/convert_hf_to_int4.py \
79+
--input-dir /path/to/your/original/models \
80+
--output-dir /path/to/your/save/models \
81+
--data-dir /path/to/your/wikitext
82+
83+
```
84+
85+
**提示**:如果你只想运行 INT4 推演(Rollout),只需将 `--hf-checkpoint` 设置为转换后的 INT4 路径即可。
86+
2. **启动 INT4 QAT 训练**
87+
你需要配置特定的环境变量来设定量化参数。
88+
**环境变量说明:**
89+
* **`OPEN_TRAINING_INT4_FAKE_QAT_FLAG`**: 启用 INT4 训练的伪量化(Fake Quantization)操作。
90+
* **`OPEN_TRAINING_INT4_GROUP_SIZE`**: 指定模型量化的块大小(Group Size)。
91+
* `moonlight-16B-A3B``qwen3-30B-A3B``qwen3-235B-A22B-int4` 设置为 **128**
92+
* `kimi-k2-Thinking-int4` 设置为 **32**
93+
94+
**配置示例:**
95+
```json
96+
RUNTIME_ENV_JSON="{
97+
\"env_vars\": {
98+
...
99+
\"OPEN_TRAINING_INT4_FAKE_QAT_FLAG\": \"1\",
100+
\"OPEN_TRAINING_INT4_GROUP_SIZE\": \"128\"
101+
}
102+
}"
103+
```
104+
105+
**启动命令:**
106+
```bash
107+
# Moonlight-16B-A3B Int4 training
108+
bash scripts/low_precision/run-moonlight-16B-A3B-int4.sh
109+
110+
# Qwen3‑30B‑A3B Int4 training
111+
bash scripts/low_precision/run-qwen3‑30B‑A3B-int4.sh
112+
113+
# Qwen3-235B-A22B Int4 training (8 nodes)
114+
bash scripts/low_precision/run-qwen3-235B-A22B-int4.sh
115+
116+
# Kimi-k2-Thinking Int4 training (32 nodes)
117+
bash scripts/low_precision/run-kimi-k2-Thinking-int4.sh
118+
```
119+
120+
*对于多节点环境,请根据您的集群配置启动 Ray 服务。*

docs/zh/index.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,9 +42,10 @@ slime 是 GLM-4.7、GLM-4.6、GLM-4.5 背后的 RL 训练框架。除此之外
4242

4343
_examples_synced/reproducibility/README.md
4444
advanced/speculative-decoding.md
45-
advanced/fault-torlance.md
46-
advanced/arch-support-beyond-megatron.md
45+
advanced/low-precision.md
46+
advanced/fault-tolerance.md
4747
advanced/pd-disaggregation.md
48+
advanced/arch-support-beyond-megatron.md
4849

4950
.. toctree::
5051
:maxdepth: 1

examples/low_precision/run-kimi-k2-Thinking-int4.sh renamed to scripts/low_precision/run-kimi-k2-Thinking-int4.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ fi
2424
echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
2525

2626
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
27-
source "${SCRIPT_DIR}/../../models/kimi-k2-thinking.sh"
27+
source "${SCRIPT_DIR}/../models/kimi-k2-thinking.sh"
2828

2929
CKPT_ARGS=(
3030
--hf-checkpoint /root/Kimi-K2-Thinking/

examples/low_precision/run-moonlight-16B-A3B-int4.sh renamed to scripts/low_precision/run-moonlight-16B-A3B-int4.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ fi
2525
echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
2626

2727
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
28-
source "${SCRIPT_DIR}/../../models/moonlight.sh"
28+
source "${SCRIPT_DIR}/../models/moonlight.sh"
2929

3030
CKPT_ARGS=(
3131
--hf-checkpoint /root/Moonlight-16B-A3B-Instruct-INT4

examples/low_precision/run-qwen3-235B-A22B-int4.sh renamed to scripts/low_precision/run-qwen3-235B-A22B-int4.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ fi
2424
echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
2525

2626
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
27-
source "${SCRIPT_DIR}/../../models/qwen3-235B-A22B.sh"
27+
source "${SCRIPT_DIR}/../models/qwen3-235B-A22B.sh"
2828

2929
CKPT_ARGS=(
3030
--hf-checkpoint /root/Qwen3-235B-A22B-INT4/

examples/low_precision/run-qwen3-30B-A3B-int4.sh renamed to scripts/low_precision/run-qwen3-30B-A3B-int4.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ fi
2424
echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
2525

2626
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
27-
source "${SCRIPT_DIR}/../../models/qwen3-30B-A3B.sh"
27+
source "${SCRIPT_DIR}/../models/qwen3-30B-A3B.sh"
2828

2929
CKPT_ARGS=(
3030
--hf-checkpoint /root/Qwen3-30B-A3B-INT4/

examples/low_precision/run-qwen3-30b-a3b-fp8-two-nodes.sh renamed to scripts/low_precision/run-qwen3-30b-a3b-fp8.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ fi
2525
echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"
2626

2727
SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
28-
source "${SCRIPT_DIR}/../../scripts/models/qwen3-30B-A3B.sh"
28+
source "${SCRIPT_DIR}/../scripts/models/qwen3-30B-A3B.sh"
2929

3030
# Base directory for checkpoints and related files (adjust if necessary)
3131
BASE_DIR="/root"

0 commit comments

Comments
 (0)