Skip to content

Commit cef6361

Browse files
[docs] lora: fix lora image and add GRPO docs (#1788)
### Checklist Before Starting - [ ] Search for similar PR(s). ### What does this PR do? Fix image rendering
1 parent ab97d9b commit cef6361

File tree

10 files changed

+416
-32
lines changed

10 files changed

+416
-32
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ verl is fast with:
8181
- LLM alignment recipes such as [Self-play preference optimization (SPPO)](https://github.com/volcengine/verl/tree/main/recipe/sppo)
8282
- Flash attention 2, [sequence packing](examples/ppo_trainer/run_qwen2-7b_seq_balance.sh), [sequence parallelism](examples/ppo_trainer/run_deepseek7b_llm_sp2.sh) support via DeepSpeed Ulysses, [LoRA](examples/sft/gsm8k/run_qwen_05_peft.sh), [Liger-kernel](examples/sft/gsm8k/run_qwen_05_sp2_liger.sh).
8383
- Scales up to 70B models and hundreds of GPUs.
84-
- Lora RL support to save memory.
84+
- Multi-gpu [LoRA RL](https://verl.readthedocs.io/en/latest/advance/ppo_lora.html) support to save memory.
8585
- Experiment tracking with wandb, swanlab, mlflow and tensorboard.
8686

8787
## Upcoming Features and Changes

docs/advance/ppo_lora.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -50,7 +50,7 @@ Best Practices and Notes
5050
- For a 32B model,with lora_rank=128,the training convergence speed and final performance are also almost identical to non-LoRA training.
5151
- More comprehensive reference results are coming soon.
5252

53-
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/f2b80b8b26829124dd393b7a795a0640eff11644/docs/lora.jpg
53+
.. image:: https://github.com/eric-haibin-lin/verl-community/blob/f2b80b8b26829124dd393b7a795a0640eff11644/docs/lora.jpg?raw=true
5454

5555
3. Reference configuration for RL training with the Qwen2.5-72B model using 8 x 80GB GPUs (increase lora_rank if needed):
5656

docs/algo/dapo.md

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# Recipe: Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO)
2+
3+
> Open-Source Algorithm Implementation & Expriement Running: [Yuxuan Tong](https://tongyx361.github.io/), [Guangming Sheng](https://hk.linkedin.com/in/guangming-sheng-b50640211)
4+
5+
🏠 [Homepage](https://dapo-sia.github.io/) | 📝 [Paper](https://dapo-sia.github.io/static/pdf/dapo_paper.pdf) | 🤗 [Datasets&Models@HF](https://huggingface.co/collections/BytedTsinghua-SIA/dapo-67d7f1517ee33c8aed059da0) | 🐱 [Code@GitHub](https://github.com/volcengine/verl/tree/gm-tyx/puffin/main/recipe/dapo) | 🐱 [Repo@GitHub](https://github.com/BytedTsinghua-SIA/DAPO)
6+
7+
8+
> We propose the **D**ecoupled Clip and Dynamic s**A**mpling **P**olicy **O**ptimization (DAPO) algorithm. By making our work publicly available, we provide the broader research community and society with practical access to scalable reinforcement learning, enabling all to benefit from these advancements. Applying DAPO training to Qwen2.5-32B base model proves to outperform the previous state-of-the-art DeepSeek-R1-Zero-Qwen-32B on AIME 2024, achieving **50%** accuracy with **50%** less training steps.
9+
>
10+
> ![dapo-main-result](https://dapo-sia.github.io/static/images/score.png)
11+
12+
## Quickstart
13+
14+
1. Prepare the datasets **on the Ray cluster**:
15+
16+
```bash
17+
bash prepare_dapo_data.sh # This downloads the datasets to ${HOME}/verl/data by default
18+
```
19+
20+
2. Submit the job to the Ray cluster **from any machine**:
21+
22+
```bash
23+
cd verl # Repo root
24+
export RAY_ADDRESS="http://${RAY_IP:-localhost}:8265" # The Ray cluster address to connect to
25+
export WORKING_DIR="${PWD}" # The local directory to package to the Ray cluster
26+
# Set the runtime environment like env vars and pip packages for the Ray cluster in yaml
27+
export RUNTIME_ENV="./verl/trainer/runtime_env.yaml"
28+
bash recipe/dapo/run_dapo_qwen2.5_32b.sh
29+
```
30+
31+
## Reproduction Runs
32+
33+
| Setup | AIME 2024 Acc. | Training Script | Training Record |
34+
| -------------------------------------------- | -------------- | ---------------------------------------------------------------- | ----------------------------------------------------------------------------------------- |
35+
| DAPO w/o Token-level Loss & Dynamic Sampling | 44% | [run_dapo_early_qwen2.5_32b.sh](./run_dapo_early_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
36+
| DAPO w/o Dynamic Sampling | 50% | [run_dapo_wo_ds_qwen2.5_32b.sh](./run_dapo_wo_ds_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
37+
| DAPO | 52% | [run_dapo_qwen2.5_32b.sh](./run_dapo_qwen2.5_32b.sh) | [W&B](https://wandb.ai/verl-org/DAPO%20Reproduction%20on%20verl/workspace?nw=wmb4qxfht0n) |
38+
39+
## Configuration
40+
41+
> [!NOTE]
42+
> Most experiments in the paper, including the best-performant one, are run without Overlong Filtering because it's somehow overlapping with Overlong Reward Shaping in terms of properly learning from the longest outputs. So we don't implement it here.
43+
44+
### Separated Clip Epsilons (-> Clip-Higher)
45+
46+
An example configuration:
47+
48+
```yaml
49+
actor_rollout_ref:
50+
actor:
51+
clip_ratio_low: 0.2
52+
clip_ratio_high: 0.28
53+
```
54+
55+
`clip_ratio_low` and `clip_ratio_high` specify the $\varepsilon_{\text {low }}$ and $\varepsilon_{\text {high }}$ in the DAPO objective.
56+
57+
Core relevant code:
58+
59+
```python
60+
pg_losses1 = -advantages * ratio
61+
pg_losses2 = -advantages * torch.clamp(ratio, 1 - cliprange_low, 1 + cliprange_high)
62+
pg_losses = torch.maximum(pg_losses1, pg_losses2)
63+
```
64+
65+
### Dynamic Sampling (with Group Filtering)
66+
67+
An example configuration:
68+
69+
```yaml
70+
data:
71+
gen_batch_size: 1536
72+
train_batch_size: 512
73+
algorithm:
74+
filter_groups:
75+
enable: True
76+
metric: acc # score / seq_reward / seq_final_reward / ...
77+
max_num_gen_batches: 10 # Non-positive values mean no upper limit
78+
```
79+
80+
Setting `filter_groups.enable` to `True` will filter out groups whose outputs' `metric` are all the same, e.g., for `acc`, groups whose outputs' accuracies are all 1 or 0.
81+
82+
The trainer will repeat sampling with `gen_batch_size` until there are enough qualified groups for `train_batch_size` or reaching the upper limit specified by `max_num_gen_batches`.
83+
84+
Core relevant code:
85+
86+
```python
87+
prompt_bsz = self.config.data.train_batch_size
88+
if num_prompt_in_batch < prompt_bsz:
89+
print(f'{num_prompt_in_batch=} < {prompt_bsz=}')
90+
num_gen_batches += 1
91+
max_num_gen_batches = self.config.algorithm.filter_groups.max_num_gen_batches
92+
if max_num_gen_batches <= 0 or num_gen_batches < max_num_gen_batches:
93+
print(f'{num_gen_batches=} < {max_num_gen_batches=}. Keep generating...')
94+
continue
95+
else:
96+
raise ValueError(
97+
f'{num_gen_batches=} >= {max_num_gen_batches=}. Generated too many. Please check your data.'
98+
)
99+
else:
100+
# Align the batch
101+
traj_bsz = self.config.data.train_batch_size * self.config.actor_rollout_ref.rollout.n
102+
batch = batch[:traj_bsz]
103+
```
104+
105+
### Flexible Loss Aggregation Mode (-> Token-level Loss)
106+
107+
An example configuration:
108+
109+
```yaml
110+
actor_rollout_ref:
111+
actor:
112+
loss_agg_mode: "token-mean" # / "seq-mean-token-sum" / "seq-mean-token-mean"
113+
# NOTE: "token-mean" is the default behavior
114+
```
115+
116+
Setting `loss_agg_mode` to `token-mean` will mean the (policy gradient) loss across all the tokens in all the sequences in a mini-batch.
117+
118+
Core relevant code:
119+
120+
```python
121+
if loss_agg_mode == "token-mean":
122+
loss = verl_F.masked_mean(loss_mat, loss_mask)
123+
elif loss_agg_mode == "seq-mean-token-sum":
124+
seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) # token-sum
125+
loss = torch.mean(seq_losses) # seq-mean
126+
elif loss_agg_mode == "seq-mean-token-mean":
127+
seq_losses = torch.sum(loss_mat * loss_mask, dim=-1) / torch.sum(loss_mask, dim=-1) # token-mean
128+
loss = torch.mean(seq_losses) # seq-mean
129+
else:
130+
raise ValueError(f"Invalid loss_agg_mode: {loss_agg_mode}")
131+
```
132+
133+
### Overlong Reward Shaping
134+
135+
An example configuration:
136+
137+
```yaml
138+
data:
139+
max_response_length: 20480 # 16384 + 4096
140+
reward_model:
141+
overlong_buffer:
142+
enable: True
143+
len: 4096
144+
penalty_factor: 1.0
145+
```
146+
147+
Setting `overlong_buffer.enable` to `True` will penalize the outputs whose lengths are overlong but still within the hard context limit.
148+
149+
Specifically, the penalty increases linearly from `0` to `overlong_buffer.penalty_factor` when the length of the output exceeds the `max_response_length` by `0` to `overlong_buffer.len` tokens.
150+
151+
Core relevant code:
152+
153+
```python
154+
if self.overlong_buffer_cfg.enable:
155+
overlong_buffer_len = self.overlong_buffer_cfg.len
156+
expected_len = self.max_resp_len - overlong_buffer_len
157+
exceed_len = valid_response_length - expected_len
158+
overlong_penalty_factor = self.overlong_buffer_cfg.penalty_factor
159+
overlong_reward = min(-exceed_len / overlong_buffer_len * overlong_penalty_factor, 0)
160+
reward += overlong_reward
161+
```

docs/algo/grpo.md

Lines changed: 69 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,69 @@
1+
# Group Relative Policy Optimization (GRPO)
2+
3+
In reinforcement learning, classic algorithms like PPO rely on a "critic" model to estimate the value of actions, guiding the learning process. However, training this critic model can be resource-intensive.
4+
5+
GRPO simplifies this process by eliminating the need for a separate critic model. Instead, it operates as follows:
6+
- Group Sampling: For a given problem, the model generates multiple possible solutions, forming a "group" of outputs.
7+
- Reward Assignment: Each solution is evaluated and assigned a reward based on its correctness or quality.
8+
- Baseline Calculation: The average reward of the group serves as a baseline.
9+
- Policy Update: The model updates its parameters by comparing each solution's reward to the group baseline, reinforcing better-than-average solutions and discouraging worse-than-average ones.
10+
11+
This approach reduces computational overhead by avoiding the training of a separate value estimation model, making the learning process more efficient. For more details, refer to the original paper [DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models](https://arxiv.org/pdf/2402.03300)
12+
13+
## Key Components
14+
15+
- No Value Function (Critic-less): unlike PPO, GRPO does not train a separate value network (critic)
16+
- Group Sampling (Grouped Rollouts): instead of evaluating one rollout per input, GRPO generates multiple completions (responses) from the current policy for each prompt. This set of completions is referred to as a group.
17+
- Relative Rewards: within each group, completions are scored (e.g., based on correctness), and rewards are normalized relative to the group.
18+
19+
## Configuration
20+
21+
Note that all configs containing `micro_batch_size` are used to configure the maximum sample or token count per forward or backward pass to avoid GPU OOMs, whose value should not change algorithmic/convergence behavior.
22+
23+
Despite that many configurations start with the `ppo_` prefix, they work across different RL algorithms in verl, as the GRPO training loop is similar to that of PPO (without critic).
24+
25+
![image](https://github.com/user-attachments/assets/16aebad1-0da6-4eb3-806d-54a74e712c2d)
26+
27+
- `actor_rollout.ref.rollout.n`: For each prompt, sample n times. Default to 1. For GRPO, please set it to a value larger than 1 for group sampling.
28+
29+
- `data.train_batch_size`: The global batch size of prompts used to generate a set of sampled trajectories/rollouts. The number of responses/trajectories is `data.train_batch_size * actor_rollout.ref.rollout.n`
30+
31+
- `actor_rollout_ref.actor.ppo_mini_batch_size`: The set of sampled trajectories is split into multiple mini-batches with batch_size=ppo_mini_batch_size for PPO actor updates. The ppo_mini_batch_size is a global size across all workers.
32+
33+
- `actor_rollout_ref.actor.ppo_epochs`: Number of epochs for GRPO updates on one set of sampled trajectories for actor
34+
35+
- `actor_rollout_ref.actor.clip_ratio`: The GRPO clip range. Default to 0.2
36+
37+
- `algorithm.adv_estimator`: Default is gae. Please set it to grpo instead
38+
39+
- `actor_rollout_ref.actor.loss_agg_mode`: Default is "token-mean". Options include "token-mean", "seq-mean-token-sum", "seq-mean-token-mean". The original GRPO paper takes the sample-level loss (seq-mean-token-mean), which may be unstable in long-CoT scenarios. All GRPO example scripts provided in verl uses the default configuration "token-mean" for loss aggregation instead.
40+
41+
Instead of adding KL penalty in the reward, GRPO regularizes by directly adding the KL divergence between the trained policy and the reference policy to the loss:
42+
43+
- `actor_rollout_ref.actor.use_kl_loss`: To use kl loss in the actor. When used, we are not applying KL in the reward function. Default is False. Please set it to True for GRPO.
44+
45+
- `actor_rollout_ref.actor.kl_loss_coef`: The coefficient of kl loss. Default is 0.001.
46+
47+
- `actor_rollout_ref.actor.kl_loss_type`: Support kl(k1), abs, mse(k2), low_var_kl(k3) and full. How to calculate the kl divergence between actor and reference policy. See this blog post for detailed analysis: http://joschu.net/blog/kl-approx.html
48+
49+
## Advanced Extensions
50+
51+
### DrGRPO
52+
53+
[Understanding R1-Zero-Like Training: A Critical Perspective](https://arxiv.org/pdf/2503.20783) claims there's optimization bias in GRPO, which leads to artificially longer responses, especially for incorrect outputs. This inefficiency stems from the way GRPO calculates advantages using group-based reward normalization. Instead, DrGRPO aggregates token-level losses by normalizing with a global constant to eliminate length bias.
54+
55+
Configure the following to enable DrGRPO, with all other parameters the same as GRPO's:
56+
57+
- `actor_rollout_ref.actor.loss_agg_mode`: "seq-mean-token-sum-norm", which turns off seq-dim averaging
58+
- `actor_rollout_ref.actor.use_kl_loss`: Please set it to False for DrGRPO
59+
- `algorithm.norm_adv_by_std_in_grpo`: False, which turns off standard deviation norm
60+
61+
## Reference Example
62+
63+
Qwen2.5 GRPO training log and commands: [link](https://github.com/eric-haibin-lin/verl-data/blob/experiments/gsm8k/qwen2-7b-fsdp2.log)
64+
65+
```bash
66+
bash examples/grpo_trainer/run_qwen3-8b.sh
67+
```
68+
69+
For more reference performance, please see https://verl.readthedocs.io/en/latest/algo/baseline.html

docs/algo/spin.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPIN: Self-Play Fine-Tuning
1+
# Recipe: Self-Play Fine-Tuning (SPIN)
22

33
`verl` provides a recipe inspired by the paper **"Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models"** (SPIN). SPIN is a language model finetuning algorithm that enables iterative self-improvement through a self-play mechanism inspired by game theory.
44

docs/algo/sppo.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPPO: Self-Play Preference Optimization
1+
# Recipe: Self-Play Preference Optimization (SPPO)
22

33
verl provides a community recipe implementation for the paper [Self-Play Preference Optimization for Language Model Alignment](https://arxiv.org/abs/2405.00675). SPPO can significantly enhance the performance of an LLM without strong external signals such as responses or preferences from GPT-4. It can outperform the model trained with iterative direct preference optimization (DPO), among other methods. SPPO is theoretically grounded, ensuring that the LLM can converge to the von Neumann winner (i.e., Nash equilibrium) under general, potentially intransitive preference, and empirically validated through extensive evaluations on multiple datasets.
44

docs/index.rst

Lines changed: 10 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@ verl is fast with:
2525
.. _Contents:
2626

2727
.. toctree::
28-
:maxdepth: 5
28+
:maxdepth: 2
2929
:caption: Quickstart
3030

3131
start/install
@@ -34,40 +34,42 @@ verl is fast with:
3434
start/ray_debug_tutorial
3535

3636
.. toctree::
37-
:maxdepth: 4
37+
:maxdepth: 2
3838
:caption: Programming guide
3939

4040
hybrid_flow
4141
single_controller
4242

4343
.. toctree::
44-
:maxdepth: 5
44+
:maxdepth: 1
4545
:caption: Data Preparation
4646

4747
preparation/prepare_data
4848
preparation/reward_function
4949

5050
.. toctree::
51-
:maxdepth: 5
51+
:maxdepth: 2
5252
:caption: Configurations
5353

5454
examples/config
5555

5656
.. toctree::
57-
:maxdepth: 2
57+
:maxdepth: 1
5858
:caption: PPO Example
5959

6060
examples/ppo_code_architecture
6161
examples/gsm8k_example
6262
examples/multi_modal_example
6363

6464
.. toctree::
65-
:maxdepth: 2
65+
:maxdepth: 1
6666
:caption: Algorithms
6767

6868
algo/ppo.md
69-
algo/sppo.md
69+
algo/grpo.md
70+
algo/dapo.md
7071
algo/spin.md
72+
algo/sppo.md
7173
algo/baseline.md
7274

7375
.. toctree::
@@ -117,7 +119,7 @@ verl is fast with:
117119

118120

119121
.. toctree::
120-
:maxdepth: 1
122+
:maxdepth: 2
121123
:caption: FAQ
122124

123125
faq/faq

0 commit comments

Comments
 (0)