Skip to content

Commit 2476cae

Browse files
authored
Doc of true on policy done 3 week ago (#711)
1 parent 66880d4 commit 2476cae

File tree

1 file changed

+43
-0
lines changed

1 file changed

+43
-0
lines changed

examples/true_on_policy/README.md

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
# True On-Policy between Training and Inference
2+
3+
True on-policy ensures that the log probs generated by inference engine (SGLang) is strictly equal to the one generated by the training Engine.
4+
5+
## Examples
6+
7+
### Example 1
8+
9+
In this script, we provide a minimal example to use true-on-policy.
10+
11+
```bash
12+
python examples/true_on_policy/run_simple.py
13+
```
14+
15+
### Example 2
16+
17+
This script contains more features for various use cases, and one flag is about the true on policy feature.
18+
19+
```bash
20+
python scripts/run_qwen3_4b_fsdp.py --true-on-policy
21+
```
22+
23+
In order to quickly see the curve, you may use `--mode debug_minimal`, which will skip evaluation and run generation with a very short output sequence length. Since true on policy is unrelated to OSL or answer correctness, this can be used for quick experiments.
24+
25+
### Other Cases
26+
27+
In order to support true on policy for other cases, please refer to the flags changed in the examples above.
28+
29+
### What is Expected to Observe
30+
31+
After running the training, you can see in wandb that the metric `train/train_rollout_logprob_abs_diff` should be exactly `0`. This indicates that there is no difference between the log probabilities from the training and the inference. Without the feature enabled, this value should be nonzero.
32+
33+
## How it is Implemented
34+
35+
The core idea is to make each and every operation in training and inference be bitwise equal. The main code is implemented in [#566](https://github.com/THUDM/slime/pull/566) and [SGLang#12058](https://github.com/sgl-project/sglang/pull/12058).
36+
37+
Briefly speaking, we handled the following components to make them aligned:
38+
39+
* Attention: We use FA3 backend for both training and inference, since it achieves bitwise equal between prefill and decode operations.
40+
* GEMM: We use DeepGEMM for fast matrix multiplication while preserving true-on-policy, thanks to its algorithm to pick things like tensor core instructions ([SGLang#12142](https://github.com/sgl-project/sglang/pull/12142)).
41+
* For other kernels, we align numeric operation details between the two systems for simplicity, such as op dtype, detailed kernels, etc, besides using batch-invariant kernels as a prerequisite. Some operations can also be compiled to speedup ([#603](https://github.com/THUDM/slime/pull/603), [SGLang#12161](https://github.com/sgl-project/sglang/pull/12161)).
42+
43+
In order to more easily align the two parts, we use SGLang's [dumper](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/debug_utils/dumper.py) tool for quick comparisons. (Need [#12622](https://github.com/sgl-project/sglang/pull/12622) and [#12623](https://github.com/sgl-project/sglang/pull/12623) for most convenience.)

0 commit comments

Comments
 (0)