Skip to content

Commit 8955606

Browse files
committed
modify readme.md
1 parent b37fa61 commit 8955606

File tree

3 files changed

+104
-150
lines changed

3 files changed

+104
-150
lines changed

README.md

Lines changed: 46 additions & 69 deletions
Original file line numberDiff line numberDiff line change
@@ -1,119 +1,96 @@
1-
2-
# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-Tail Generation
3-
4-
[Chinese Version](./README_zh.md)
5-
1+
# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
62
## About
3+
### Background: Why the sampling-training loop of synchronous RL is dragged down by the "long tail"
74

8-
### Background: Why synchronous RL sampling–training loops suffer from “long tails”
5+
In on-policy RLHF/GR?O training, the system enters an update phase only after collecting **N** rollout samples in a "round." Due to the inconsistent lengths of generated samples, the system has to wait for a few **long-tail samples** to complete before starting the training phase. This leads to decreased GPU utilization and lower throughput in the later stages of the rollout phase.
96

10-
- In on-policy RLHF / GRxO training, the system collects **N** rollout samples per round before applying one update. Because generation lengths, refusals/retries, routing queues, etc. are stochastic, the round time is dominated by **the slowest few samples** (a classic long-tail). GPUs idle while “waiting for the tail,” dragging down effective throughput.
11-
- Common mitigations (larger timeouts/early truncation, higher concurrency, faster decoding kernels/continuous batching, or switching to asynchronous RL) all have trade-offs: they may hurt sample quality or policy consistency, complicate scheduling assumptions, or still fail to remove the root cause—**wasted time waiting on unfinished samples**.
12-
### What we built: Active Partial Rollout (APRIL)
7+
### What We Did: Active Partial Rollout (APRIL)
138

14-
**Core idea.** In each round, **oversample by design** (**N' > N**). As soon as the target **N** is met, **proactively abort** straggling requests; persist their **unfinished response segments** (with context and decoding state) into a **cross-round buffer**; and **resume them first** in the next round. This eliminates idle time spent “waiting for the tail.”
15-
(TODO: add architecture diagram)
9+
**Core Idea**: In each round, we **over-sample** (N' > N) and **actively interrupt** the remaining in-progress requests once the target of **N** completed samples is reached. The **unfinished responses** are stored in a **buffer** and are **prioritized for continued rollout** in the next round, thereby mitigating the efficiency degradation caused by long-tail requests.
1610

1711
![scheduling](./imgs/partial_scheduling.png)
18-
1912
### Highlights
2013

21-
- **Long-tail killer.** Launch N' > N rollouts; once N is reached, immediately abort the remainder, **buffer their partial responses**, and **resume them next round**.
22-
- **Stable training.** Plays nicely with mainstream on-policy variants like PPO/GRPO/DAPO/GSPO; in practice we observe comparable or slightly better accuracy.
23-
- **Low-intrusion engineering.** Operates at the scheduling layer without changing decoding/batching kernels; integrated with **slime**, and works on both NVIDIA and AMD.
24-
- **Algorithm-compatible.** Resuming samples can introduce a small amount of “light off-policy,” which we have not found to destabilize training; it often acts as a mild regularizer (faster convergence, slight accuracy gains).
25-
## Quickstart
14+
- **Over-sampling**: Assuming the training phase requires `rollout_batch_size=32` complete samples per round, we actually initiate a larger sampling request, i.e., `over_sampling_batch_size=64`.
15+
- **Stop upon collection**: As soon as the number of collected complete sample groups reaches `rollout_batch_size`, an `abort` signal is immediately sent to the sglang router.
16+
- **Collect and reuse**: Upon receiving the `abort` signal, sglang stops the ongoing generation tasks and returns their partially generated portions (half-completed trajectories). This partial data is not discarded but is stored in a buffer. When the next rollout round begins, they continue generating from where they left off, along with new prompts, thus achieving seamless reuse across iteration steps.
17+
- **Elegant implementation**: Slime's partial rollout provides a more native and lightweight optimization solution that is less intrusive to the original pipeline. You can enable it out-of-the-box simply by setting the `--partial-rollout` flag and specifying `--over-sampling-batch-size`.
2618

27-
### 1) Environment
19+
## Three Steps to Get Started
2820

29-
**Recommended (Docker)**
30-
31-
- **AMD**
21+
### 1) Environment Setup (Requires an AMD GPU)
3222

23+
**Start docker**
3324
```bash
3425
docker run --rm --gpus all --ipc=host --shm-size=16g \
3526
--ulimit memlock=-1 --ulimit stack=67108864 \
3627
-it rlsys/slime:slime_ubuntu22.04_rocm6.3.4-patch-numa-patch_sglang0.4.9_megatron-patch_ray2.47.1_apex_torch-memory-saver0.0.8-patch-vim /bin/bash
3728
```
38-
39-
- **NVIDIA** (TODO: fill in the matching image tag)
40-
4129
### 2) Install APRIL
4230

4331
```bash
44-
git clone https://github.com/RLsys-Foundation/APRIL.git
32+
git clone [https://github.com/RLsys-Foundation/APRIL.git](https://github.com/RLsys-Foundation/APRIL.git)
4533
cd APRIL
4634
pip install -e .
4735
```
4836

49-
If you plan to run the example scripts, make sure `ray` is installed and at least one inference backend is available (SGLang or vLLM; **SGLang recommended**).
50-
51-
### 3) Run an example
37+
### 3) Run an Example
5238

53-
_The script: starts the backend → launches oversampled rollouts → aborts once the target is met → writes unfinished samples to the buffer and resumes them next round → prints per-round throughput/latency._
39+
All scripts are in the `scripts/partial_rollout/` directory.
5440

5541
```bash
5642
bash scripts/partial_rollout/qwen/grpo/run-qwen3-4B-dapo-partial.sh
5743
```
44+
### 4) Parameter Details
5845

59-
### 4) Parameters
60-
61-
APRIL’s core behavior is controlled by the following flags:
62-
46+
The core functionality of partial rollout is controlled by the following parameters:
6347
```bash
64-
# Enable partial rollout:
65-
# Turn on "meet target then abort" and reclaim unfinished samples into the buffer.
48+
# Enable the partial rollout feature
49+
# Set this parameter to enable the mechanism of stopping generation upon reaching the target count + recycling unfinished samples
6650
--partial-rollout
6751

68-
# Sampling batch size per shot. This controls the granularity of each sampling step.
69-
# If this > rollout_batch_size, you will oversample.
70-
# If this < rollout_batch_size, the system keeps sampling in chunks until the target is reached.
52+
# The batch size for sampling. This parameter controls the sampling granularity per round.
53+
# If this parameter > rollout_batch_size, over-sampling is performed.
54+
# If this parameter < rollout_batch_size, sampling will continue at this granularity until rollout_batch_size samples are collected.
7155
--over-sampling-batch-size 16
7256
```
57+
For other parameters, please refer to the arguments in [arguments.py](./slime/utils/arguments.py). For more details, you can consult the original [slime](https://github.com/THUDM/slime) repository.
58+
## Results and Comparison (Abridged)
7359

74-
For other options, see the arguments in [arguments.py](https://chatgpt.com/c/slime/utils/arguments.py). For more details, refer to the upstream [slime](https://github.com/THUDM/slime) repository.
60+
| Dataset | Model | Metric | APRIL vs. Baseline |
61+
|---------------|----------|------------------|-----------------------|
62+
| DAPO‑Math‑17k | Qwen3‑4B | Rollout Throughput | **+17%** |
63+
| DeepScaleR | Qwen3‑4B | Rollout Throughput | **+21%** |
64+
| DeepMath‑103K | Qwen3‑4B | Rollout Throughput | **+35%** |
7565

76-
## Results vs. Baselines (brief)
66+
![evaluation](./imgs/eval_dapo_qwen.png)
7767

78-
|Dataset|Model|Metric|APRIL vs. baseline|
79-
|---|---|---|---|
80-
|DAPO-Math-17k|Qwen3-4B|Rollout Throughput|**+17%**|
81-
|DeepScaleR|Qwen3-4B|Rollout Throughput|**+21%**|
82-
|DeepMath-103K|Qwen3-4B|Rollout Throughput|**+35%**|
83-
|AIME-2024|Various|Final Accuracy|**+2–5%** (data/algorithm-dependent)|
68+
## Frequently Asked Questions (FAQ)
8469

85-
![evaluation](./imgs/eval_dapo_qwen.png)
86-
## FAQ
70+
- **Q: Will APRIL affect policy purity and convergence?**
71+
- A: It will definitely have an impact on policy purity; the proportion of off-policy tokens in one round is about 40%. However, from both an engineering and experimental perspective, partial rollout has not introduced significant instability under the current settings. Further verification is needed for tasks with a much larger `max_response_length` (e.g., agent tasks, multi-turn tasks).
8772

88-
- **Q: Does APRIL hurt policy purity or convergence?**
89-
**A:** We have not observed instability in engineering or experiments. Monitor the off-policy token ratio, and use a mild setting like `oversample ≈ 2× roll_batch`.
90-
91-
- **Q: Do I need to modify decoding kernels?**
92-
**A:** No. APRIL operates at the **scheduling layer** and composes with speculative decoding, continuous batching, and other inference-level accelerations.
93-
94-
- **Q: Does it work on both NVIDIA and AMD?**
95-
**A:** Yes; we reproduced gains on 8×H100 and 8×MI300.
96-
73+
- **Q: Are changes to the decoding kernel required?**
74+
- A: No. APRIL operates at the **system scheduling layer** and does not conflict with inference acceleration techniques like speculative decoding or continuous batching. Instead, they are complementary and can be stacked.
9775

98-
## Repository Structure
76+
## Directory Structure
9977

100-
```text
78+
```
10179
APRIL/
10280
├── scripts/
10381
│ └── partial_rollout/
104-
│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B
105-
│ └── qwen/ # Experiment code for Qwen3-4B
82+
│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B
83+
│ └── qwen/ # Experiment code for qwen3-4B
10684
├── slime/
10785
│ ├── backends/
10886
│ ├── rollout/
109-
│ │ └── sglang_example.py # Core sampling example
110-
│ ├── ray/ # Core scheduling logic
111-
│ │ └── buffer.py # Buffer implementation
87+
│ │ └── sglang_example.py # Core sampling code
88+
│ ├── ray/ # Core scheduling logic
89+
│ │ └── buffer.py # Buffer implementation code
11290
│ └── utils/
113-
└── tools/ # Megatron-format model conversion
114-
```
91+
└── tools/ # Megatron format conversion tools
11592
116-
## Citation
93+
```
94+
## Paper
11795

118-
If APRIL is useful for your work, please cite the APRIL paper and star the repo.
119-
(TODO: add arXiv link)
96+
(TODO: arXiv link for the paper)

0 commit comments

Comments
 (0)