Skip to content

Commit a72b585

Browse files
authored
Merge pull request #3 from RLsys-Foundation/try_clean
Clean codebase for APRIL and add doc
2 parents b372afe + a6d134a commit a72b585

25 files changed

+126
-2922
lines changed

README.md

Lines changed: 62 additions & 168 deletions
Original file line numberDiff line numberDiff line change
@@ -1,202 +1,96 @@
1-
# slime
1+
# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
2+
## About
3+
### Background: Why the sampling-training loop of synchronous RL is dragged down by the "long tail"
24

3-
[中文版](./README_zh.md)
5+
In on-policy RLHF/GR?O training, the system enters an update phase only after collecting **N** rollout samples in a "round." Due to the inconsistent lengths of generated samples, the system has to wait for a few **long-tail samples** to complete before starting the training phase. This leads to decreased GPU utilization and lower throughput in the later stages of the rollout phase.
46

5-
**slime** is an LLM post-training framework for RL scaling, providing two core capabilities:
7+
### What We Did: Active Partial Rollout (APRIL)
68

7-
1. **High-Performance Training**: Supports efficient training in various modes by connecting Megatron with SGLang;
8-
2. **Flexible Data Generation**: Enables arbitrary training data generation workflows through custom data generation interfaces and server-based engines.
9+
**Core Idea**: In each round, we **over-sample** (N' > N) and **actively interrupt** the remaining in-progress requests once the target of **N** completed samples is reached. The **unfinished responses** are stored in a **buffer** and are **prioritized for continued rollout** in the next round, thereby mitigating the efficiency degradation caused by long-tail requests.
910

10-
## Table of Contents
11+
![scheduling](./imgs/partial_scheduling.png)
12+
### Highlights
1113

12-
- [Architecture Overview](#architecture-overview)
13-
- [Quick Start](#quick-start)
14-
- [Environment Setup](#environment-setup)
15-
- [Examples](#examples)
16-
- [Dense Model Examples: GLM-4-9B and Qwen3-4B](#Dense-Model-Examples-GLM-4-9B-and-Qwen3-4B)
17-
- [MoE Model Example: Qwen3-30B-A3B](#MoE-Model-Example-Qwen3-30B-A3B)
18-
- [Multi-Turn + Tool Calling Example: Search-R1 lite](#Multi-Turn--Tool-Calling-Example-Search-R1-lite)
19-
- [SFT Example: Qwen3-4B-Base with OpenHermes-2.5](#SFT-Example-Qwen3-4B-Base-with-OpenHermes-25)
20-
- [Checkpoint Format Conversion](#checkpoint-format-conversion)
21-
- [Starting the Training Process](#starting-the-training-process)
22-
- [Argument Descriptions](#argument-descriptions)
23-
- [Developer Guide](#developer-guide)
24-
- [FAQ & Acknowledgements](#faq--acknowledgements)
14+
- **Over-sampling**: Assuming the training phase requires `rollout_batch_size=32` complete samples per round, we actually initiate a larger sampling request, i.e., `over_sampling_batch_size=64`.
15+
- **Stop upon collection**: As soon as the number of collected complete sample groups reaches `rollout_batch_size`, an `abort` signal is immediately sent to the sglang router.
16+
- **Collect and reuse**: Upon receiving the `abort` signal, sglang stops the ongoing generation tasks and returns their partially generated portions (half-completed trajectories). This partial data is not discarded but is stored in a buffer. When the next rollout round begins, they continue generating from where they left off, along with new prompts, thus achieving seamless reuse across iteration steps.
17+
- **Elegant implementation**: Slime's partial rollout provides a more native and lightweight optimization solution that is less intrusive to the original pipeline. You can enable it out-of-the-box simply by setting the `--partial-rollout` flag and specifying `--over-sampling-batch-size`.
2518

26-
## Architecture Overview
19+
## Three Steps to Get Started
2720

28-
![arch](./imgs/arch.png)
29-
30-
**Module Descriptions**:
31-
32-
- **training (Megatron)**: Responsible for the main training process, reads data from the Data Buffer, and synchronizes parameters to the rollout module after training.
33-
- **rollout (SGLang + router)**: Generates new data (including rewards/verifier outputs) and stores it in the Data Buffer.
34-
- **data buffer**: A bridge module that manages prompt initialization, custom data, and rollout generation methods.
35-
36-
## Quick Start
37-
38-
### Environment Setup
39-
40-
Based on the `zhuzilin/slime:latest` image (pre-installed with SGLang 0.4.7 and Megatron):
21+
### 1) Environment Setup (Requires an AMD GPU)
4122

23+
**Start docker**
4224
```bash
4325
docker run --rm --gpus all --ipc=host --shm-size=16g \
4426
--ulimit memlock=-1 --ulimit stack=67108864 \
45-
-it zhuzilin/slime:latest /bin/bash
46-
47-
git clone https://github.com/THUDM/slime.git
48-
cd slime
49-
pip install -e .
27+
-it rlsys/slime:slime_ubuntu22.04_rocm6.3.4-patch-numa-patch_sglang0.4.9_megatron-patch_ray2.47.1_apex_torch-memory-saver0.0.8-patch-vim /bin/bash
5028
```
51-
52-
- If you prefer not to use Docker, or if it's inconvenient, please refer to [Setting up the Environment from Scratch](./docs/en/build.md).
53-
- For AMD support, please refer to [AMD Tutorial](./docs/en/amd_tutorial.md).
54-
55-
### Examples
56-
57-
#### Dense Model Examples: GLM-4-9B and Qwen3-4B
58-
59-
We provide examples to use [GLM-4-9B](https://huggingface.co/THUDM/GLM-Z1-9B-0414) and [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), please refer to:
60-
61-
- [Example: GLM-4-9B Model](docs/en/models/glm4-9B.md).
62-
- [Example: Qwen3-4B Model](docs/en/models/qwen3-4B.md).
63-
64-
#### MoE Model Example: Qwen3-30B-A3B
65-
66-
For MoE example, please refer to:
67-
68-
- [Example: Qwen3-30B-A3B Model](docs/en/models/qwen3-30B-A3B.md).
69-
70-
#### Multi-Turn + Tool Calling Example: Search-R1 lite
71-
72-
For multi-turn and tool calling, we also provides an minimal reimplenmentation of Search-R1, please refer to:
73-
74-
- [Example: Search-R1 lite](examples/search-r1/README.md).
75-
76-
#### SFT Example: Qwen3-4B-Base with OpenHermes-2.5
77-
78-
slime is not just a RL framework, we support a diverse set of post-training setups. For an SFT example, please refer to:
79-
80-
- [Example: Qwen3-4B-Base with OpenHermes-2.5](docs/en/sft.md).
81-
82-
### Checkpoint Format Conversion
83-
84-
Since slime uses Megatron, and Megatron does not support loading Hugging Face checkpoints directly, we need to convert the model to the `torch_dist` format that Megatron supports.
85-
86-
#### HF → Megatron torch\_dist ckpt
87-
88-
We recommend using [Pai-Megatron-Patch](https://github.com/alibaba/Pai-Megatron-Patch) for mcore checkpoint conversion.
89-
90-
If the mode you are using are not supported by Pai-Megatron-Patch, you could use [mbridge](https://github.com/ISEEKYAN/mbridge.git) for conversion:
29+
### 2) Install APRIL
9130

9231
```bash
93-
cd slime/
94-
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
95-
--hf-checkpoint /root/GLM-Z1-9B-0414 \
96-
--save /root/GLM-Z1-9B-0414_torch_dist
32+
git clone [https://github.com/RLsys-Foundation/APRIL.git](https://github.com/RLsys-Foundation/APRIL.git)
33+
cd APRIL
34+
pip install -e .
9735
```
9836

99-
⚠️ If you encounter an issue where slime cannot be found, please run `pip install -e .` in the slime directory.
100-
101-
#### Megatron torch\_dist → HF ckpt
37+
### 3) Run an Example
10238

103-
To convert a `torch_dist` checkpoint saved during training back to a Hugging Face checkpoint:
39+
All scripts are in the `scripts/partial_rollout/` directory.
10440

10541
```bash
106-
cd slime/
107-
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
108-
--input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
109-
--output-dir /root/GLM-Z1-9B-0414-iter_xxx \
110-
--origin-hf-dir /root/GLM-Z1-9B-0414
42+
bash scripts/partial_rollout/qwen/grpo/run-qwen3-4B-dapo-partial.sh
11143
```
44+
### 4) Parameter Details
11245

113-
⚠️ Since the `torch_dist` checkpoint converted by mbridge does not currently save args, you cannot convert the checkpoint from the previous step back to HF format.
114-
115-
#### Any Megatron ckpt → HF
116-
117-
Applicable for custom save formats (e.g., `--ckpt-format torch`).
118-
119-
The principle behind this conversion method is to reuse the function that updates parameters from Megatron to SGLang during training. This means reusing the training script and changing the original command from:
120-
46+
The core functionality of partial rollout is controlled by the following parameters:
12147
```bash
122-
ray job submit --address="http://127.0.0.1:8265" \
123-
--runtime-env-json='{
124-
"env_vars": { ...}
125-
}' \
126-
-- python3 train.py \
127-
... # Other training args
48+
# Enable the partial rollout feature
49+
# Set this parameter to enable the mechanism of stopping generation upon reaching the target count + recycling unfinished samples
50+
--partial-rollout
51+
52+
# The batch size for sampling. This parameter controls the sampling granularity per round.
53+
# If this parameter > rollout_batch_size, over-sampling is performed.
54+
# If this parameter < rollout_batch_size, sampling will continue at this granularity until rollout_batch_size samples are collected.
55+
--over-sampling-batch-size 16
12856
```
57+
For other parameters, please refer to the arguments in [arguments.py](./slime/utils/arguments.py). For more details, you can consult the original [slime](https://github.com/THUDM/slime) repository.
58+
## Results and Comparison (Abridged)
12959

130-
To:
131-
132-
```bash
133-
torchrun --nproc_per_node ${NUM_GPU} tools/convert_to_hf.py \
134-
--load /your/saved/megatron_ckpt \
135-
--output-dir /your/converted/hf_ckpt \
136-
... # Other training args
137-
```
60+
| Dataset | Model | Metric | APRIL vs. Baseline |
61+
|---------------|----------|------------------|-----------------------|
62+
| DAPO‑Math‑17k | Qwen3‑4B | Rollout Throughput | **+17%** |
63+
| DeepScaleR | Qwen3‑4B | Rollout Throughput | **+21%** |
64+
| DeepMath‑103K | Qwen3‑4B | Rollout Throughput | **+35%** |
13865

139-
That is, keep all other arguments the same, and:
66+
![evaluation](./imgs/eval_dapo_qwen.png)
14067

141-
1. Change the task launcher from `ray` to `torchrun`. Set the number of GPUs to the minimum required for Megatron's parallelism without data parallelism (DP). For example, if you are using `tp4`, set it to 4.
142-
2. Make sure to change `--load` to the path of the checkpoint you want to load.
143-
3. Add the `--output-dir` argument to specify where the converted Hugging Face checkpoint should be saved.
68+
## Frequently Asked Questions (FAQ)
14469

145-
## Starting the Training Process
70+
- **Q: Will APRIL affect policy purity and convergence?**
71+
- A: It will definitely have an impact on policy purity; the proportion of off-policy tokens in one round is about 40%. However, from both an engineering and experimental perspective, partial rollout has not introduced significant instability under the current settings. Further verification is needed for tasks with a much larger `max_response_length` (e.g., agent tasks, multi-turn tasks).
14672

147-
The entire program needs to be launched using Ray. First, you need to start a Ray cluster. On node 0, run:
73+
- **Q: Are changes to the decoding kernel required?**
74+
- A: No. APRIL operates at the **system scheduling layer** and does not conflict with inference acceleration techniques like speculative decoding or continuous batching. Instead, they are complementary and can be stacked.
14875

149-
```bash
150-
# Node0 (HEAD)
151-
ray start --head --node-ip-address ${MASTER_ADDR} \
152-
--num-gpus 8 --disable-usage-stats
76+
## Directory Structure
15377

154-
# Other Nodes
155-
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
15678
```
79+
APRIL/
80+
├── scripts/
81+
│ └── partial_rollout/
82+
│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B
83+
│ └── qwen/ # Experiment code for qwen3-4B
84+
├── slime/
85+
│ ├── backends/
86+
│ ├── rollout/
87+
│ │ └── sglang_example.py # Core sampling code
88+
│ ├── ray/ # Core scheduling logic
89+
│ │ └── buffer.py # Buffer implementation code
90+
│ └── utils/
91+
└── tools/ # Megatron format conversion tools
15792
158-
After the Ray cluster has started, you can submit a job from node 0, for example:
159-
160-
```bash
161-
ray job submit --address="http://127.0.0.1:8265" \
162-
--runtime-env-json='{
163-
"env_vars": {
164-
"PYTHONPATH": "/root/Megatron-LM/",
165-
... # e.g., no_proxy, API variables, etc.
166-
}
167-
}' \
168-
-- python3 train.py \
169-
--... # Other Megatron/SGLang/slime arguments
17093
```
94+
## Paper
17195

172-
### Argument Descriptions
173-
174-
Arguments are divided into three categories:
175-
176-
1. **Megatron arguments**: slime reads all arguments set in Megatron via `PYTHONPATH`. You can configure Megatron by passing arguments like `--tensor-model-parallel-size 2`.
177-
2. **SGLang arguments**: All arguments for the installed SGLang are supported. These arguments must be prefixed with `--sglang-`. For example, `--mem-fraction-static` should be passed as `--sglang-mem-fraction-static`.
178-
3. **slime-specific arguments**: Please refer to: [slime/utils/arguments.py](slime/utils/arguments.py)
179-
180-
For complete usage instructions, please refer to the [Usage Documentation](docs/en/usage.md).
181-
182-
## Developer Guide
183-
184-
- **Contributions are welcome\!** If you have suggestions for new features, performance tuning, or feedback on user experience, feel free to submit an Issue or PR 😊
185-
186-
- Use [pre-commit](https://pre-commit.com/) to ensure code style consistency for your commits:
187-
188-
```bash
189-
apt install pre-commit -y
190-
pre-commit install
191-
```
192-
193-
- For debugging tips, please refer to the [Debugging Guide](docs/en/debug.md)
194-
195-
## Hardware Support
196-
- Nvidia: refer to this repo README
197-
- AMD: refer to the [tutorial](docs/en/amd_tutorial.md)
198-
199-
## FAQ & Acknowledgements
200-
201-
- For frequently asked questions, please see the [Q\&A](docs/en/qa.md)
202-
- Special thanks to the following projects & communities: SGLang, Megatron‑LM, mbridge, OpenRLHF, veRL, and others.
96+
(TODO: arXiv link for the paper)

0 commit comments

Comments
 (0)