|
1 | | -# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation |
2 | | -## About |
3 | | -### Background: Why the sampling-training loop of synchronous RL is dragged down by the "long tail" |
| 1 | +<div align="center"> |
| 2 | + <img src="imgs/APRIL.png" alt="APRIL Logo" width="400"> |
4 | 3 |
|
5 | | -In on-policy RLHF/GR?O training, the system enters an update phase only after collecting **N** rollout samples in a "round." Due to the inconsistent lengths of generated samples, the system has to wait for a few **long-tail samples** to complete before starting the training phase. This leads to decreased GPU utilization and lower throughput in the later stages of the rollout phase. |
| 4 | + # APRIL: Active Partial Rollouts in Reinforcement Learning |
6 | 5 |
|
7 | | -### What We Did: Active Partial Rollout (APRIL) |
| 6 | + **Accelerating LLM Training by Taming Long-tail Generation** |
8 | 7 |
|
9 | | -**Core Idea**: In each round, we **over-sample** (N' > N) and **actively interrupt** the remaining in-progress requests once the target of **N** completed samples is reached. The **unfinished responses** are stored in a **buffer** and are **prioritized for continued rollout** in the next round, thereby mitigating the efficiency degradation caused by long-tail requests. |
| 8 | + [](https://opensource.org/licenses/Apache-2.0) |
| 9 | + [](https://www.python.org/downloads/) |
| 10 | + [](https://pytorch.org/) |
| 11 | + |
| 12 | +</div> |
| 13 | + |
| 14 | +## 🚀 Overview |
| 15 | + |
| 16 | +**APRIL** (Active Partial Rollouts) is a compute-efficient method to accelerate rollout generation in reinforcement learning training for Large Language Models (LLMs). By addressing the critical "long-tail" problem in RL training where a few samples with exceptionally long responses cause the entire batch to stall, APRIL delivers: |
| 17 | + |
| 18 | +- **20-35% improvement** in rollout throughput |
| 19 | +- **2-5% higher** final model accuracy |
| 20 | +- **Faster convergence** during training |
| 21 | +- **Hardware agnostic** - supports both NVIDIA and AMD GPUs |
| 22 | + |
| 23 | +### The Problem: Long-tail Generation Bottleneck |
| 24 | + |
| 25 | +In on-policy RL training (RLHF/GRPO/DAPO), the rollout phase dominates runtime, typically accounting for **over 90%** of total training time. Due to the highly variable response lengths across samples, synchronous training paradigms suffer from severe GPU underutilization as faster-generating workers sit idle waiting for the longest-running instances to complete. |
| 26 | + |
| 27 | +### Our Solution: Active Partial Rollouts |
| 28 | + |
| 29 | +APRIL revolutionizes rollout efficiency through an innovative mechanism: |
| 30 | + |
| 31 | +1. **Over-provisioning**: Deliberately initiate more rollout requests than needed (N' > N) |
| 32 | +2. **Active interruption**: Once the target batch size is reached, actively stop remaining unfinished rollouts |
| 33 | +3. **Intelligent recycling**: Store partial results in a buffer and resume generation in the next iteration |
| 34 | +4. **Seamless integration**: Works with existing RL frameworks without modifying inference kernels |
10 | 35 |
|
11 | 36 |  |
12 | | -### Highlights |
13 | 37 |
|
14 | | -- **Over-sampling**: Assuming the training phase requires `rollout_batch_size=32` complete samples per round, we actually initiate a larger sampling request, i.e., `over_sampling_batch_size=64`. |
15 | | -- **Stop upon collection**: As soon as the number of collected complete sample groups reaches `rollout_batch_size`, an `abort` signal is immediately sent to the sglang router. |
16 | | -- **Collect and reuse**: Upon receiving the `abort` signal, sglang stops the ongoing generation tasks and returns their partially generated portions (half-completed trajectories). This partial data is not discarded but is stored in a buffer. When the next rollout round begins, they continue generating from where they left off, along with new prompts, thus achieving seamless reuse across iteration steps. |
17 | | -- **Elegant implementation**: Slime's partial rollout provides a more native and lightweight optimization solution that is less intrusive to the original pipeline. You can enable it out-of-the-box simply by setting the `--partial-rollout` flag and specifying `--over-sampling-batch-size`. |
| 38 | +## ✨ Key Features |
| 39 | + |
| 40 | +- **🔥 Plug-and-play**: Enable with just two command-line flags (`--partial-rollout` and `--over-sampling-batch-size`) |
| 41 | +- **🎯 Algorithm-agnostic**: Compatible with GRPO, DAPO, GSPO, and other popular RL algorithms |
| 42 | +- **🏗️ Framework-ready**: Already integrated into [slime](https://github.com/THUDM/slime) framework |
| 43 | +- **⚡ System-level optimization**: Operates at the scheduling layer, complementary to kernel-level optimizations |
| 44 | +- **🔧 Production-tested**: Evaluated on multiple LLMs including DeepSeek-R1, Qwen3, and GLM-4 |
18 | 45 |
|
19 | | -## Three Steps to Get Started |
| 46 | +## 🛠️ Installation |
20 | 47 |
|
21 | | -### 1) Environment Setup (Requires an AMD GPU) |
| 48 | +### Quick Start with Docker |
22 | 49 |
|
23 | | -**Start docker** |
| 50 | +#### For AMD GPUs: |
24 | 51 | ```bash |
25 | 52 | docker run --rm --gpus all --ipc=host --shm-size=16g \ |
26 | 53 | --ulimit memlock=-1 --ulimit stack=67108864 \ |
27 | 54 | -it rlsys/slime:slime_ubuntu22.04_rocm6.3.4-patch-numa-patch_sglang0.4.9_megatron-patch_ray2.47.1_apex_torch-memory-saver0.0.8-patch-vim /bin/bash |
28 | 55 | ``` |
29 | | -### 2) Install APRIL |
| 56 | + |
| 57 | +#### For NVIDIA GPUs: |
| 58 | +See [NVIDIA setup guide](./docs/en/build.md) |
| 59 | + |
| 60 | +### Install APRIL |
30 | 61 |
|
31 | 62 | ```bash |
32 | | -git clone [https://github.com/RLsys-Foundation/APRIL.git](https://github.com/RLsys-Foundation/APRIL.git) |
| 63 | +git clone https://github.com/RLsys-Foundation/APRIL.git |
33 | 64 | cd APRIL |
34 | 65 | pip install -e . |
35 | 66 | ``` |
36 | 67 |
|
37 | | -### 3) Run an Example |
| 68 | +## 🚦 Quick Start |
| 69 | + |
| 70 | +### Basic Usage |
38 | 71 |
|
39 | | -All scripts are in the `scripts/partial_rollout/` directory. |
| 72 | +Run a training example with APRIL enabled: |
40 | 73 |
|
41 | 74 | ```bash |
| 75 | +# Example: Qwen3-4B with DAPO |
42 | 76 | bash scripts/partial_rollout/qwen/grpo/run-qwen3-4B-dapo-partial.sh |
43 | 77 | ``` |
44 | | -### 4) Parameter Details |
45 | 78 |
|
46 | | -The core functionality of partial rollout is controlled by the following parameters: |
47 | | -```bash |
48 | | -# Enable the partial rollout feature |
49 | | -# Set this parameter to enable the mechanism of stopping generation upon reaching the target count + recycling unfinished samples |
| 79 | +### Key Parameters |
| 80 | + |
| 81 | +```python |
| 82 | +# Enable APRIL optimization |
50 | 83 | --partial-rollout |
51 | 84 |
|
52 | | -# The batch size for sampling. This parameter controls the sampling granularity per round. |
53 | | -# If this parameter > rollout_batch_size, over-sampling is performed. |
54 | | -# If this parameter < rollout_batch_size, sampling will continue at this granularity until rollout_batch_size samples are collected. |
55 | | ---over-sampling-batch-size 16 |
| 85 | +# Set over-sampling batch size (should be > rollout_batch_size) |
| 86 | +--over-sampling-batch-size 64 # e.g., 2x the rollout_batch_size |
| 87 | + |
| 88 | +# Standard rollout batch size |
| 89 | +--rollout-batch-size 32 |
56 | 90 | ``` |
57 | | -For other parameters, please refer to the arguments in [arguments.py](./slime/utils/arguments.py). For more details, you can consult the original [slime](https://github.com/THUDM/slime) repository. |
58 | | -## Results and Comparison (Abridged) |
59 | 91 |
|
60 | | -| Dataset | Model | Metric | APRIL vs. Baseline | |
61 | | -|---------------|----------|------------------|-----------------------| |
62 | | -| DAPO‑Math‑17k | Qwen3‑4B | Rollout Throughput | **+17%** | |
63 | | -| DeepScaleR | Qwen3‑4B | Rollout Throughput | **+21%** | |
64 | | -| DeepMath‑103K | Qwen3‑4B | Rollout Throughput | **+35%** | |
| 92 | +### Advanced Configuration |
| 93 | + |
| 94 | +For detailed parameter explanations, see [arguments.py](./slime/utils/arguments.py). |
| 95 | +## 📊 Performance Results |
| 96 | + |
| 97 | +### Throughput Improvements |
| 98 | + |
| 99 | +| Dataset | Model | Algorithm | Throughput Gain | Accuracy Improvement | |
| 100 | +|---------------|----------|-----------|-----------------|---------------------| |
| 101 | +| DAPO-Math-17k | Qwen3-4B | DAPO | **+17%** | +2.3% | |
| 102 | +| DeepScaleR | Qwen3-4B | GRPO | **+21%** | +3.1% | |
| 103 | +| DeepMath-103K | Qwen3-4B | GSPO | **+35%** | +4.7% | |
| 104 | +| Agent Tasks | DeepSeek-1.5B | GRPO | **+23%** | +2.8% | |
| 105 | + |
| 106 | +### Convergence Analysis |
65 | 107 |
|
66 | 108 |  |
67 | 109 |
|
68 | | -## Frequently Asked Questions (FAQ) |
| 110 | +APRIL not only improves training efficiency but also achieves: |
| 111 | +- **Faster convergence**: Reaches target accuracy 15-20% faster |
| 112 | +- **Higher final accuracy**: 2-5% improvement in final model performance |
| 113 | +- **Stable training**: No additional instability despite partial off-policy samples |
| 114 | + |
| 115 | +## 🏗️ Architecture |
| 116 | + |
| 117 | +### System Design |
| 118 | + |
| 119 | +``` |
| 120 | +┌─────────────────────────────────────────────────────┐ |
| 121 | +│ Training Pipeline │ |
| 122 | +├─────────────────────────────────────────────────────┤ |
| 123 | +│ │ |
| 124 | +│ ┌──────────────┐ ┌──────────────┐ │ |
| 125 | +│ │ Rollout │───▶│ Buffer │ │ |
| 126 | +│ │ Engine │ │ Management │ │ |
| 127 | +│ │ (SGLang) │ └──────────────┘ │ |
| 128 | +│ └──────────────┘ │ │ |
| 129 | +│ ▲ ▼ │ |
| 130 | +│ │ ┌──────────────┐ │ |
| 131 | +│ │ │ Training │ │ |
| 132 | +│ └────────────│ Engine │ │ |
| 133 | +│ │ (Megatron/FSDP)│ │ |
| 134 | +│ └──────────────┘ │ |
| 135 | +└─────────────────────────────────────────────────────┘ |
| 136 | +``` |
| 137 | + |
| 138 | +### Core Components |
| 139 | + |
| 140 | +| Component | Path | Description | |
| 141 | +|-----------|------|-------------| |
| 142 | +| **Rollout Engine** | `slime/rollout/sglang_example.py` | Manages generation with active interruption | |
| 143 | +| **Buffer System** | `slime/ray/buffer.py` | Stores and prioritizes partial rollouts | |
| 144 | +| **Scheduler** | `slime/ray/rollout.py` | Orchestrates over-sampling and batch management | |
| 145 | +| **Training Backend** | `slime/backends/` | Supports both Megatron and FSDP | |
| 146 | + |
| 147 | +## ❓ FAQ |
| 148 | + |
| 149 | +### Q: Does APRIL affect training stability? |
| 150 | + |
| 151 | +While APRIL introduces ~40% off-policy tokens per iteration, extensive experiments show: |
| 152 | +- No significant training instability |
| 153 | +- Improved final model accuracy |
| 154 | +- Consistent convergence patterns |
| 155 | + |
| 156 | +> **Note**: For extremely long sequences (e.g., multi-turn agent tasks), additional validation may be needed. |
| 157 | +
|
| 158 | +### Q: Is APRIL compatible with other optimizations? |
69 | 159 |
|
70 | | -- **Q: Will APRIL affect policy purity and convergence?** |
71 | | - - A: It will definitely have an impact on policy purity; the proportion of off-policy tokens in one round is about 40%. However, from both an engineering and experimental perspective, partial rollout has not introduced significant instability under the current settings. Further verification is needed for tasks with a much larger `max_response_length` (e.g., agent tasks, multi-turn tasks). |
| 160 | +Yes! APRIL operates at the **system scheduling layer** and is fully compatible with: |
| 161 | +- Kernel optimizations (FlashAttention, continuous batching) |
| 162 | +- Inference engines (vLLM, SGLang, TensorRT-LLM) |
| 163 | +- Speculative decoding techniques |
| 164 | +- Model parallelism strategies |
72 | 165 |
|
73 | | -- **Q: Are changes to the decoding kernel required?** |
74 | | - - A: No. APRIL operates at the **system scheduling layer** and does not conflict with inference acceleration techniques like speculative decoding or continuous batching. Instead, they are complementary and can be stacked. |
| 166 | +### Q: What hardware is supported? |
75 | 167 |
|
76 | | -## Directory Structure |
| 168 | +APRIL is hardware-agnostic and tested on: |
| 169 | +- **NVIDIA GPUs**: H100 |
| 170 | +- **AMD GPUs**: MI300X |
| 171 | + |
| 172 | +## 📁 Repository Structure |
77 | 173 |
|
78 | 174 | ``` |
79 | 175 | APRIL/ |
| 176 | +├── imgs/ # Documentation images |
| 177 | +│ ├── APRIL.png # Project logo |
| 178 | +│ └── partial_scheduling.png # Architecture diagrams |
80 | 179 | ├── scripts/ |
81 | | -│ └── partial_rollout/ |
82 | | -│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B |
83 | | -│ └── qwen/ # Experiment code for qwen3-4B |
84 | | -├── slime/ |
85 | | -│ ├── backends/ |
| 180 | +│ └── partial_rollout/ # Training scripts |
| 181 | +│ ├── deepseek/ # DeepSeek model experiments |
| 182 | +│ ├── qwen/ # Qwen model experiments |
| 183 | +│ └── README.md # Script documentation |
| 184 | +├── slime/ # Core framework |
| 185 | +│ ├── backends/ # Training backends |
| 186 | +│ │ ├── fsdp_utils/ # FSDP implementation |
| 187 | +│ │ └── megatron_utils/ # Megatron-LM support |
86 | 188 | │ ├── rollout/ |
87 | | -│ │ └── sglang_example.py # Core sampling code |
88 | | -│ ├── ray/ # Core scheduling logic |
89 | | -│ │ └── buffer.py # Buffer implementation code |
90 | | -│ └── utils/ |
91 | | -└── tools/ # Megatron format conversion tools |
| 189 | +│ │ ├── sglang_example.py # Core rollout implementation |
| 190 | +│ │ └── rm_hub/ # Reward model integrations |
| 191 | +│ ├── ray/ # Distributed orchestration |
| 192 | +│ │ ├── buffer.py # Partial rollout buffer |
| 193 | +│ │ └── rollout.py # Rollout scheduling |
| 194 | +│ └── utils/ # Utilities and helpers |
| 195 | +├── docs/ # Documentation |
| 196 | +│ ├── en/ # English docs |
| 197 | +│ └── zh/ # Chinese docs |
| 198 | +└── tools/ # Model conversion utilities |
| 199 | +``` |
| 200 | + |
| 201 | +## 🔬 Technical Details |
| 202 | + |
| 203 | +### How APRIL Works |
| 204 | + |
| 205 | +1. **Over-provisioning Phase**: Request N' = αN rollouts (α typically 1.5-2.0) |
| 206 | +2. **Active Monitoring**: Track completion status across all workers |
| 207 | +3. **Intelligent Interruption**: Send abort signal when N samples complete |
| 208 | +4. **Buffer Management**: Store partial results with generation state |
| 209 | +5. **Seamless Resumption**: Continue partial rollouts in next iteration |
| 210 | + |
| 211 | +### Integration with Existing Frameworks |
92 | 212 |
|
| 213 | +APRIL is designed as a drop-in enhancement for existing RL training pipelines: |
| 214 | +- **Minimal code changes**: Enable with command-line flags |
| 215 | +- **Framework agnostic**: Works with OpenRLHF, verl, Areal, slime |
| 216 | +- **Automatic optimization**: Self-tuning based on workload characteristics |
| 217 | + |
| 218 | +## 📚 Citation |
| 219 | + |
| 220 | +If you use APRIL in your research, please cite our paper: |
| 221 | + |
| 222 | +```bibtex |
| 223 | +@article{april2025, |
| 224 | + title={APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation}, |
| 225 | + author={RLsys Foundation Team}, |
| 226 | + journal={arXiv preprint}, |
| 227 | + year={2025} |
| 228 | +} |
93 | 229 | ``` |
94 | | -## Paper |
95 | 230 |
|
96 | | -(TODO: arXiv link for the paper) |
| 231 | +## 🤝 Contributing |
| 232 | + |
| 233 | +We welcome contributions! Please see our [Contributing Guide](CONTRIBUTING.md) for details. |
| 234 | + |
| 235 | +## 📄 License |
| 236 | + |
| 237 | +This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details. |
| 238 | + |
| 239 | +## 🙏 Acknowledgments |
| 240 | + |
| 241 | +APRIL builds upon the excellent work of: |
| 242 | +- [slime](https://github.com/THUDM/slime) - The base RL training framework |
| 243 | +- [SGLang](https://github.com/sgl-project/sglang) - High-performance inference backend |
| 244 | +- [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) - Distributed training backend |
| 245 | + |
| 246 | +## 📬 Contact |
| 247 | + |
| 248 | +For questions and support: |
| 249 | +- Open an issue on [GitHub](https://github.com/RLsys-Foundation/APRIL/issues) |
| 250 | + |
| 251 | +--- |
| 252 | + |
| 253 | +<div align="center"> |
| 254 | + <sub>Built with ❤️ by the RLsys Foundation Team</sub> |
| 255 | +</div> |
0 commit comments