|
1 | | - |
2 | | -# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-Tail Generation |
3 | | - |
4 | | -[Chinese Version](./README_zh.md) |
5 | | - |
| 1 | +# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation |
6 | 2 | ## About |
| 3 | +### Background: Why the sampling-training loop of synchronous RL is dragged down by the "long tail" |
7 | 4 |
|
8 | | -### Background: Why synchronous RL sampling–training loops suffer from “long tails” |
| 5 | +In on-policy RLHF/GR?O training, the system enters an update phase only after collecting **N** rollout samples in a "round." Due to the inconsistent lengths of generated samples, the system has to wait for a few **long-tail samples** to complete before starting the training phase. This leads to decreased GPU utilization and lower throughput in the later stages of the rollout phase. |
9 | 6 |
|
10 | | -- In on-policy RLHF / GRxO training, the system collects **N** rollout samples per round before applying one update. Because generation lengths, refusals/retries, routing queues, etc. are stochastic, the round time is dominated by **the slowest few samples** (a classic long-tail). GPUs idle while “waiting for the tail,” dragging down effective throughput. |
11 | | -- Common mitigations (larger timeouts/early truncation, higher concurrency, faster decoding kernels/continuous batching, or switching to asynchronous RL) all have trade-offs: they may hurt sample quality or policy consistency, complicate scheduling assumptions, or still fail to remove the root cause—**wasted time waiting on unfinished samples**. |
12 | | -### What we built: Active Partial Rollout (APRIL) |
| 7 | +### What We Did: Active Partial Rollout (APRIL) |
13 | 8 |
|
14 | | -**Core idea.** In each round, **oversample by design** (**N' > N**). As soon as the target **N** is met, **proactively abort** straggling requests; persist their **unfinished response segments** (with context and decoding state) into a **cross-round buffer**; and **resume them first** in the next round. This eliminates idle time spent “waiting for the tail.” |
15 | | -(TODO: add architecture diagram) |
| 9 | +**Core Idea**: In each round, we **over-sample** (N' > N) and **actively interrupt** the remaining in-progress requests once the target of **N** completed samples is reached. The **unfinished responses** are stored in a **buffer** and are **prioritized for continued rollout** in the next round, thereby mitigating the efficiency degradation caused by long-tail requests. |
16 | 10 |
|
17 | 11 |  |
18 | | - |
19 | 12 | ### Highlights |
20 | 13 |
|
21 | | -- **Long-tail killer.** Launch N' > N rollouts; once N is reached, immediately abort the remainder, **buffer their partial responses**, and **resume them next round**. |
22 | | -- **Stable training.** Plays nicely with mainstream on-policy variants like PPO/GRPO/DAPO/GSPO; in practice we observe comparable or slightly better accuracy. |
23 | | -- **Low-intrusion engineering.** Operates at the scheduling layer without changing decoding/batching kernels; integrated with **slime**, and works on both NVIDIA and AMD. |
24 | | -- **Algorithm-compatible.** Resuming samples can introduce a small amount of “light off-policy,” which we have not found to destabilize training; it often acts as a mild regularizer (faster convergence, slight accuracy gains). |
25 | | -## Quickstart |
| 14 | +- **Over-sampling**: Assuming the training phase requires `rollout_batch_size=32` complete samples per round, we actually initiate a larger sampling request, i.e., `over_sampling_batch_size=64`. |
| 15 | +- **Stop upon collection**: As soon as the number of collected complete sample groups reaches `rollout_batch_size`, an `abort` signal is immediately sent to the sglang router. |
| 16 | +- **Collect and reuse**: Upon receiving the `abort` signal, sglang stops the ongoing generation tasks and returns their partially generated portions (half-completed trajectories). This partial data is not discarded but is stored in a buffer. When the next rollout round begins, they continue generating from where they left off, along with new prompts, thus achieving seamless reuse across iteration steps. |
| 17 | +- **Elegant implementation**: Slime's partial rollout provides a more native and lightweight optimization solution that is less intrusive to the original pipeline. You can enable it out-of-the-box simply by setting the `--partial-rollout` flag and specifying `--over-sampling-batch-size`. |
26 | 18 |
|
27 | | -### 1) Environment |
| 19 | +## Three Steps to Get Started |
28 | 20 |
|
29 | | -**Recommended (Docker)** |
30 | | - |
31 | | -- **AMD** |
| 21 | +### 1) Environment Setup (Requires an AMD GPU) |
32 | 22 |
|
| 23 | +**Start docker** |
33 | 24 | ```bash |
34 | 25 | docker run --rm --gpus all --ipc=host --shm-size=16g \ |
35 | 26 | --ulimit memlock=-1 --ulimit stack=67108864 \ |
36 | 27 | -it rlsys/slime:slime_ubuntu22.04_rocm6.3.4-patch-numa-patch_sglang0.4.9_megatron-patch_ray2.47.1_apex_torch-memory-saver0.0.8-patch-vim /bin/bash |
37 | 28 | ``` |
38 | | - |
39 | | -- **NVIDIA** (TODO: fill in the matching image tag) |
40 | | - |
41 | 29 | ### 2) Install APRIL |
42 | 30 |
|
43 | 31 | ```bash |
44 | | -git clone https://github.com/RLsys-Foundation/APRIL.git |
| 32 | +git clone [https://github.com/RLsys-Foundation/APRIL.git](https://github.com/RLsys-Foundation/APRIL.git) |
45 | 33 | cd APRIL |
46 | 34 | pip install -e . |
47 | 35 | ``` |
48 | 36 |
|
49 | | -If you plan to run the example scripts, make sure `ray` is installed and at least one inference backend is available (SGLang or vLLM; **SGLang recommended**). |
50 | | - |
51 | | -### 3) Run an example |
| 37 | +### 3) Run an Example |
52 | 38 |
|
53 | | -_The script: starts the backend → launches oversampled rollouts → aborts once the target is met → writes unfinished samples to the buffer and resumes them next round → prints per-round throughput/latency._ |
| 39 | +All scripts are in the `scripts/partial_rollout/` directory. |
54 | 40 |
|
55 | 41 | ```bash |
56 | 42 | bash scripts/partial_rollout/qwen/grpo/run-qwen3-4B-dapo-partial.sh |
57 | 43 | ``` |
| 44 | +### 4) Parameter Details |
58 | 45 |
|
59 | | -### 4) Parameters |
60 | | - |
61 | | -APRIL’s core behavior is controlled by the following flags: |
62 | | - |
| 46 | +The core functionality of partial rollout is controlled by the following parameters: |
63 | 47 | ```bash |
64 | | -# Enable partial rollout: |
65 | | -# Turn on "meet target then abort" and reclaim unfinished samples into the buffer. |
| 48 | +# Enable the partial rollout feature |
| 49 | +# Set this parameter to enable the mechanism of stopping generation upon reaching the target count + recycling unfinished samples |
66 | 50 | --partial-rollout |
67 | 51 |
|
68 | | -# Sampling batch size per shot. This controls the granularity of each sampling step. |
69 | | -# If this > rollout_batch_size, you will oversample. |
70 | | -# If this < rollout_batch_size, the system keeps sampling in chunks until the target is reached. |
| 52 | +# The batch size for sampling. This parameter controls the sampling granularity per round. |
| 53 | +# If this parameter > rollout_batch_size, over-sampling is performed. |
| 54 | +# If this parameter < rollout_batch_size, sampling will continue at this granularity until rollout_batch_size samples are collected. |
71 | 55 | --over-sampling-batch-size 16 |
72 | 56 | ``` |
| 57 | +For other parameters, please refer to the arguments in [arguments.py](./slime/utils/arguments.py). For more details, you can consult the original [slime](https://github.com/THUDM/slime) repository. |
| 58 | +## Results and Comparison (Abridged) |
73 | 59 |
|
74 | | -For other options, see the arguments in [arguments.py](https://chatgpt.com/c/slime/utils/arguments.py). For more details, refer to the upstream [slime](https://github.com/THUDM/slime) repository. |
| 60 | +| Dataset | Model | Metric | APRIL vs. Baseline | |
| 61 | +|---------------|----------|------------------|-----------------------| |
| 62 | +| DAPO‑Math‑17k | Qwen3‑4B | Rollout Throughput | **+17%** | |
| 63 | +| DeepScaleR | Qwen3‑4B | Rollout Throughput | **+21%** | |
| 64 | +| DeepMath‑103K | Qwen3‑4B | Rollout Throughput | **+35%** | |
75 | 65 |
|
76 | | -## Results vs. Baselines (brief) |
| 66 | + |
77 | 67 |
|
78 | | -|Dataset|Model|Metric|APRIL vs. baseline| |
79 | | -|---|---|---|---| |
80 | | -|DAPO-Math-17k|Qwen3-4B|Rollout Throughput|**+17%**| |
81 | | -|DeepScaleR|Qwen3-4B|Rollout Throughput|**+21%**| |
82 | | -|DeepMath-103K|Qwen3-4B|Rollout Throughput|**+35%**| |
83 | | -|AIME-2024|Various|Final Accuracy|**+2–5%** (data/algorithm-dependent)| |
| 68 | +## Frequently Asked Questions (FAQ) |
84 | 69 |
|
85 | | - |
86 | | -## FAQ |
| 70 | +- **Q: Will APRIL affect policy purity and convergence?** |
| 71 | + - A: It will definitely have an impact on policy purity; the proportion of off-policy tokens in one round is about 40%. However, from both an engineering and experimental perspective, partial rollout has not introduced significant instability under the current settings. Further verification is needed for tasks with a much larger `max_response_length` (e.g., agent tasks, multi-turn tasks). |
87 | 72 |
|
88 | | -- **Q: Does APRIL hurt policy purity or convergence?** |
89 | | - **A:** We have not observed instability in engineering or experiments. Monitor the off-policy token ratio, and use a mild setting like `oversample ≈ 2× roll_batch`. |
90 | | - |
91 | | -- **Q: Do I need to modify decoding kernels?** |
92 | | - **A:** No. APRIL operates at the **scheduling layer** and composes with speculative decoding, continuous batching, and other inference-level accelerations. |
93 | | - |
94 | | -- **Q: Does it work on both NVIDIA and AMD?** |
95 | | - **A:** Yes; we reproduced gains on 8×H100 and 8×MI300. |
96 | | - |
| 73 | +- **Q: Are changes to the decoding kernel required?** |
| 74 | + - A: No. APRIL operates at the **system scheduling layer** and does not conflict with inference acceleration techniques like speculative decoding or continuous batching. Instead, they are complementary and can be stacked. |
97 | 75 |
|
98 | | -## Repository Structure |
| 76 | +## Directory Structure |
99 | 77 |
|
100 | | -```text |
| 78 | +``` |
101 | 79 | APRIL/ |
102 | 80 | ├── scripts/ |
103 | 81 | │ └── partial_rollout/ |
104 | | -│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B |
105 | | -│ └── qwen/ # Experiment code for Qwen3-4B |
| 82 | +│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B |
| 83 | +│ └── qwen/ # Experiment code for qwen3-4B |
106 | 84 | ├── slime/ |
107 | 85 | │ ├── backends/ |
108 | 86 | │ ├── rollout/ |
109 | | -│ │ └── sglang_example.py # Core sampling example |
110 | | -│ ├── ray/ # Core scheduling logic |
111 | | -│ │ └── buffer.py # Buffer implementation |
| 87 | +│ │ └── sglang_example.py # Core sampling code |
| 88 | +│ ├── ray/ # Core scheduling logic |
| 89 | +│ │ └── buffer.py # Buffer implementation code |
112 | 90 | │ └── utils/ |
113 | | -└── tools/ # Megatron-format model conversion |
114 | | -``` |
| 91 | +└── tools/ # Megatron format conversion tools |
115 | 92 |
|
116 | | -## Citation |
| 93 | +``` |
| 94 | +## Paper |
117 | 95 |
|
118 | | -If APRIL is useful for your work, please cite the APRIL paper and star the repo. |
119 | | -(TODO: add arXiv link) |
| 96 | +(TODO: arXiv link for the paper) |
0 commit comments