You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is an example of FP8 training and FP8 inference. Under FP8 training and inference, it can achieve more efficient inference throughput and lower training-inference mismatch, resulting in more stable training.
4
+
5
+
### Files
6
+
7
+
*`run-qwen3-4b-fp8.sh`: example launch script with Qwen3‑4B in FP8.
8
+
9
+
### Quick Start
10
+
11
+
1.[optional] Convert your HuggingFace weights to FP8 format. You can use `tools/convert_hf_to_fp8`, or directly write an FP8 format model config.
12
+
13
+
2. Start FP8 training
14
+
15
+
```
16
+
cd slime
17
+
bash examples/fp8/run-qwen3-4b-fp8.sh
18
+
```
19
+
Following the above command will launch FP8 training. According to slime's design, if the model under `--hf-checkpoint` is FP8, it will automatically use FP8 quantization in weight updates.
20
+
21
+
3. Use the saved checkpoint for evaluation
22
+
23
+
Note that TransformerEngine does not specifically save FP8 quantized weights; the saved torch dist remains in original precision (usually bf16). If you want to evaluate under FP8, you need to convert the checkpoint from `torch_dist` to HuggingFace format, then convert to FP8 HuggingFace format.
24
+
25
+
26
+
### Quick Explanation
27
+
28
+
Here's a quick explanation of how FP8 training is currently implemented in slime:
29
+
30
+
1. Initialization: If FP8 recipe is enabled, layers will be built in FP8 context.
31
+
32
+
2. Training: During training, weights and activations are quantized online to nvfp8 format, and cuBLAS FP8 GEMM is called for various GEMM computations in forward and backward passes.
33
+
34
+
3. Update weight: In RL weight updates, the training engine will attempt to save model weights. The saved results will be dequantized from FP8 to bf16, but since the config under `--hf-checkpoint` is FP8, slime will quantize this bf16.
35
+
36
+
4. Save checkpoint: Similar to weight updates, if checkpoints need to be saved from the training engine, they will also be dequantized back to bf16 and saved to `torch_dist` format checkpoints.
37
+
38
+
39
+
### TODO
40
+
41
+
Currently, FP8 is far from being a complete feature and still has the following bugs, for examples:
42
+
43
+
- FP8 weights (`--fp8-param-gather`) can provide memory savings benefits, but currently FP8 weights must be used with TransformerEngine's FusedAdam, which conflicts with the commonly used Adam CPU offload technique in Megatron-LM.
44
+
45
+
The slime team will continue to collaborate with the NVIDIA team to contribute more complete FP8 training infrastructure to the community.
0 commit comments