Skip to content

Commit 8f07a40

Browse files
Merge pull request #2 from RLsys-Foundation/clean
Clean code and add readme
2 parents 055088b + 8955606 commit 8f07a40

28 files changed

+136
-3156
lines changed

README.md

Lines changed: 62 additions & 169 deletions
Original file line numberDiff line numberDiff line change
@@ -1,203 +1,96 @@
1-
# slime
1+
# APRIL: Active Partial Rollouts in Reinforcement Learning to Tame Long-tail Generation
2+
## About
3+
### Background: Why the sampling-training loop of synchronous RL is dragged down by the "long tail"
24

3-
[中文版](./README_zh.md)
5+
In on-policy RLHF/GR?O training, the system enters an update phase only after collecting **N** rollout samples in a "round." Due to the inconsistent lengths of generated samples, the system has to wait for a few **long-tail samples** to complete before starting the training phase. This leads to decreased GPU utilization and lower throughput in the later stages of the rollout phase.
46

5-
**slime** is an LLM post-training framework for RL scaling, providing two core capabilities:
7+
### What We Did: Active Partial Rollout (APRIL)
68

7-
1. **High-Performance Training**: Supports efficient training in various modes by connecting Megatron with SGLang;
8-
2. **Flexible Data Generation**: Enables arbitrary training data generation workflows through custom data generation interfaces and server-based engines.
9+
**Core Idea**: In each round, we **over-sample** (N' > N) and **actively interrupt** the remaining in-progress requests once the target of **N** completed samples is reached. The **unfinished responses** are stored in a **buffer** and are **prioritized for continued rollout** in the next round, thereby mitigating the efficiency degradation caused by long-tail requests.
910

10-
## Table of Contents
11+
![scheduling](./imgs/partial_scheduling.png)
12+
### Highlights
1113

12-
- [Architecture Overview](#architecture-overview)
13-
- [Quick Start](#quick-start)
14-
- [Environment Setup](#environment-setup)
15-
- [Examples](#examples)
16-
- [Dense Model Examples: GLM-4-9B and Qwen3-4B](#Dense-Model-Examples-GLM-4-9B-and-Qwen3-4B)
17-
- [MoE Model Example: Qwen3-30B-A3B and DeepSeek-R1](#MoE-Model-Example-Qwen3-30B-A3B-and-DeepSeek-R1)
18-
- [Multi-Turn + Tool Calling Example: Search-R1 lite](#Multi-Turn--Tool-Calling-Example-Search-R1-lite)
19-
- [SFT Example: Qwen3-4B-Base with OpenHermes-2.5](#SFT-Example-Qwen3-4B-Base-with-OpenHermes-25)
20-
- [Checkpoint Format Conversion](#checkpoint-format-conversion)
21-
- [Starting the Training Process](#starting-the-training-process)
22-
- [Argument Descriptions](#argument-descriptions)
23-
- [Developer Guide](#developer-guide)
24-
- [FAQ & Acknowledgements](#faq--acknowledgements)
14+
- **Over-sampling**: Assuming the training phase requires `rollout_batch_size=32` complete samples per round, we actually initiate a larger sampling request, i.e., `over_sampling_batch_size=64`.
15+
- **Stop upon collection**: As soon as the number of collected complete sample groups reaches `rollout_batch_size`, an `abort` signal is immediately sent to the sglang router.
16+
- **Collect and reuse**: Upon receiving the `abort` signal, sglang stops the ongoing generation tasks and returns their partially generated portions (half-completed trajectories). This partial data is not discarded but is stored in a buffer. When the next rollout round begins, they continue generating from where they left off, along with new prompts, thus achieving seamless reuse across iteration steps.
17+
- **Elegant implementation**: Slime's partial rollout provides a more native and lightweight optimization solution that is less intrusive to the original pipeline. You can enable it out-of-the-box simply by setting the `--partial-rollout` flag and specifying `--over-sampling-batch-size`.
2518

26-
## Architecture Overview
19+
## Three Steps to Get Started
2720

28-
![arch](./imgs/arch.png)
29-
30-
**Module Descriptions**:
31-
32-
- **training (Megatron)**: Responsible for the main training process, reads data from the Data Buffer, and synchronizes parameters to the rollout module after training.
33-
- **rollout (SGLang + router)**: Generates new data (including rewards/verifier outputs) and stores it in the Data Buffer.
34-
- **data buffer**: A bridge module that manages prompt initialization, custom data, and rollout generation methods.
35-
36-
## Quick Start
37-
38-
### Environment Setup
39-
40-
Based on the `zhuzilin/slime:latest` image (pre-installed with SGLang 0.4.7 and Megatron):
21+
### 1) Environment Setup (Requires an AMD GPU)
4122

23+
**Start docker**
4224
```bash
4325
docker run --rm --gpus all --ipc=host --shm-size=16g \
4426
--ulimit memlock=-1 --ulimit stack=67108864 \
45-
-it zhuzilin/slime:latest /bin/bash
46-
47-
git clone https://github.com/THUDM/slime.git
48-
cd slime
49-
pip install -e .
27+
-it rlsys/slime:slime_ubuntu22.04_rocm6.3.4-patch-numa-patch_sglang0.4.9_megatron-patch_ray2.47.1_apex_torch-memory-saver0.0.8-patch-vim /bin/bash
5028
```
51-
52-
- If you prefer not to use Docker, or if it's inconvenient, please refer to [Setting up the Environment from Scratch](./docs/en/build.md).
53-
- For AMD support, please refer to [AMD Tutorial](./docs/en/amd_tutorial.md).
54-
55-
### Examples
56-
57-
#### Dense Model Examples: GLM-4-9B and Qwen3-4B
58-
59-
We provide examples to use [GLM-4-9B](https://huggingface.co/THUDM/GLM-Z1-9B-0414) and [Qwen3-4B](https://huggingface.co/Qwen/Qwen3-4B), please refer to:
60-
61-
- [Example: GLM-4-9B](docs/en/models/glm4-9B.md).
62-
- [Example: Qwen3-4B](docs/en/models/qwen3-4B.md).
63-
64-
#### MoE Model Example: Qwen3-30B-A3B and DeepSeek-R1
65-
66-
For MoE example, please refer to:
67-
68-
- [Example: Qwen3-30B-A3B](docs/en/models/qwen3-30B-A3B.md).
69-
- [Example: Training DeepSeek R1 with 128xH100](docs/en/models/deepseek-r1.md)
70-
71-
#### Multi-Turn + Tool Calling Example: Search-R1 lite
72-
73-
For multi-turn and tool calling, we also provides an minimal reimplenmentation of Search-R1, please refer to:
74-
75-
- [Example: Search-R1 lite](examples/search-r1/README.md).
76-
77-
#### SFT Example: Qwen3-4B-Base with OpenHermes-2.5
78-
79-
slime is not just a RL framework, we support a diverse set of post-training setups. For an SFT example, please refer to:
80-
81-
- [Example: Qwen3-4B-Base with OpenHermes-2.5](docs/en/sft.md).
82-
83-
### Checkpoint Format Conversion
84-
85-
Since slime uses Megatron, and Megatron does not support loading Hugging Face checkpoints directly, we need to convert the model to the `torch_dist` format that Megatron supports.
86-
87-
#### HF → Megatron torch\_dist ckpt
88-
89-
We recommend using [Pai-Megatron-Patch](https://github.com/alibaba/Pai-Megatron-Patch) for mcore checkpoint conversion.
90-
91-
If the mode you are using are not supported by Pai-Megatron-Patch, you could use [mbridge](https://github.com/ISEEKYAN/mbridge.git) for conversion:
29+
### 2) Install APRIL
9230

9331
```bash
94-
cd slime/
95-
PYTHONPATH=/root/Megatron-LM python tools/convert_hf_to_torch_dist.py \
96-
--hf-checkpoint /root/GLM-Z1-9B-0414 \
97-
--save /root/GLM-Z1-9B-0414_torch_dist
32+
git clone [https://github.com/RLsys-Foundation/APRIL.git](https://github.com/RLsys-Foundation/APRIL.git)
33+
cd APRIL
34+
pip install -e .
9835
```
9936

100-
⚠️ If you encounter an issue where slime cannot be found, please run `pip install -e .` in the slime directory.
101-
102-
#### Megatron torch\_dist → HF ckpt
37+
### 3) Run an Example
10338

104-
To convert a `torch_dist` checkpoint saved during training back to a Hugging Face checkpoint:
39+
All scripts are in the `scripts/partial_rollout/` directory.
10540

10641
```bash
107-
cd slime/
108-
PYTHONPATH=/root/Megatron-LM python tools/convert_torch_dist_to_hf.py \
109-
--input-dir /path/to/torch_dist_ckpt/iter_xxx/ \
110-
--output-dir /root/GLM-Z1-9B-0414-iter_xxx \
111-
--origin-hf-dir /root/GLM-Z1-9B-0414
42+
bash scripts/partial_rollout/qwen/grpo/run-qwen3-4B-dapo-partial.sh
11243
```
44+
### 4) Parameter Details
11345

114-
⚠️ Since the `torch_dist` checkpoint converted by mbridge does not currently save args, you cannot convert the checkpoint from the previous step back to HF format.
115-
116-
#### Any Megatron ckpt → HF
117-
118-
Applicable for custom save formats (e.g., `--ckpt-format torch`).
119-
120-
The principle behind this conversion method is to reuse the function that updates parameters from Megatron to SGLang during training. This means reusing the training script and changing the original command from:
121-
46+
The core functionality of partial rollout is controlled by the following parameters:
12247
```bash
123-
ray job submit --address="http://127.0.0.1:8265" \
124-
--runtime-env-json='{
125-
"env_vars": { ...}
126-
}' \
127-
-- python3 train.py \
128-
... # Other training args
48+
# Enable the partial rollout feature
49+
# Set this parameter to enable the mechanism of stopping generation upon reaching the target count + recycling unfinished samples
50+
--partial-rollout
51+
52+
# The batch size for sampling. This parameter controls the sampling granularity per round.
53+
# If this parameter > rollout_batch_size, over-sampling is performed.
54+
# If this parameter < rollout_batch_size, sampling will continue at this granularity until rollout_batch_size samples are collected.
55+
--over-sampling-batch-size 16
12956
```
57+
For other parameters, please refer to the arguments in [arguments.py](./slime/utils/arguments.py). For more details, you can consult the original [slime](https://github.com/THUDM/slime) repository.
58+
## Results and Comparison (Abridged)
13059

131-
To:
132-
133-
```bash
134-
torchrun --nproc_per_node ${NUM_GPU} tools/convert_to_hf.py \
135-
--load /your/saved/megatron_ckpt \
136-
--output-dir /your/converted/hf_ckpt \
137-
... # Other training args
138-
```
60+
| Dataset | Model | Metric | APRIL vs. Baseline |
61+
|---------------|----------|------------------|-----------------------|
62+
| DAPO‑Math‑17k | Qwen3‑4B | Rollout Throughput | **+17%** |
63+
| DeepScaleR | Qwen3‑4B | Rollout Throughput | **+21%** |
64+
| DeepMath‑103K | Qwen3‑4B | Rollout Throughput | **+35%** |
13965

140-
That is, keep all other arguments the same, and:
66+
![evaluation](./imgs/eval_dapo_qwen.png)
14167

142-
1. Change the task launcher from `ray` to `torchrun`. Set the number of GPUs to the minimum required for Megatron's parallelism without data parallelism (DP). For example, if you are using `tp4`, set it to 4.
143-
2. Make sure to change `--load` to the path of the checkpoint you want to load.
144-
3. Add the `--output-dir` argument to specify where the converted Hugging Face checkpoint should be saved.
68+
## Frequently Asked Questions (FAQ)
14569

146-
## Starting the Training Process
70+
- **Q: Will APRIL affect policy purity and convergence?**
71+
- A: It will definitely have an impact on policy purity; the proportion of off-policy tokens in one round is about 40%. However, from both an engineering and experimental perspective, partial rollout has not introduced significant instability under the current settings. Further verification is needed for tasks with a much larger `max_response_length` (e.g., agent tasks, multi-turn tasks).
14772

148-
The entire program needs to be launched using Ray. First, you need to start a Ray cluster. On node 0, run:
73+
- **Q: Are changes to the decoding kernel required?**
74+
- A: No. APRIL operates at the **system scheduling layer** and does not conflict with inference acceleration techniques like speculative decoding or continuous batching. Instead, they are complementary and can be stacked.
14975

150-
```bash
151-
# Node0 (HEAD)
152-
ray start --head --node-ip-address ${MASTER_ADDR} \
153-
--num-gpus 8 --disable-usage-stats
76+
## Directory Structure
15477

155-
# Other Nodes
156-
ray start --address=${MASTER_ADDR}:6379 --num-gpus 8
15778
```
79+
APRIL/
80+
├── scripts/
81+
│ └── partial_rollout/
82+
│ ├── deepseek/ # Experiment code for deepseek-r1-distill-1.5B
83+
│ └── qwen/ # Experiment code for qwen3-4B
84+
├── slime/
85+
│ ├── backends/
86+
│ ├── rollout/
87+
│ │ └── sglang_example.py # Core sampling code
88+
│ ├── ray/ # Core scheduling logic
89+
│ │ └── buffer.py # Buffer implementation code
90+
│ └── utils/
91+
└── tools/ # Megatron format conversion tools
15892
159-
After the Ray cluster has started, you can submit a job from node 0, for example:
160-
161-
```bash
162-
ray job submit --address="http://127.0.0.1:8265" \
163-
--runtime-env-json='{
164-
"env_vars": {
165-
"PYTHONPATH": "/root/Megatron-LM/",
166-
... # e.g., no_proxy, API variables, etc.
167-
}
168-
}' \
169-
-- python3 train.py \
170-
--... # Other Megatron/SGLang/slime arguments
17193
```
94+
## Paper
17295

173-
### Argument Descriptions
174-
175-
Arguments are divided into three categories:
176-
177-
1. **Megatron arguments**: slime reads all arguments set in Megatron via `PYTHONPATH`. You can configure Megatron by passing arguments like `--tensor-model-parallel-size 2`.
178-
2. **SGLang arguments**: All arguments for the installed SGLang are supported. These arguments must be prefixed with `--sglang-`. For example, `--mem-fraction-static` should be passed as `--sglang-mem-fraction-static`.
179-
3. **slime-specific arguments**: Please refer to: [slime/utils/arguments.py](slime/utils/arguments.py)
180-
181-
For complete usage instructions, please refer to the [Usage Documentation](docs/en/usage.md).
182-
183-
## Developer Guide
184-
185-
- **Contributions are welcome\!** If you have suggestions for new features, performance tuning, or feedback on user experience, feel free to submit an Issue or PR 😊
186-
187-
- Use [pre-commit](https://pre-commit.com/) to ensure code style consistency for your commits:
188-
189-
```bash
190-
apt install pre-commit -y
191-
pre-commit install
192-
```
193-
194-
- For debugging tips, please refer to the [Debugging Guide](docs/en/debug.md)
195-
196-
## Hardware Support
197-
- Nvidia: refer to this repo README
198-
- AMD: refer to the [tutorial](docs/en/amd_tutorial.md)
199-
200-
## FAQ & Acknowledgements
201-
202-
- For frequently asked questions, please see the [Q\&A](docs/en/qa.md)
203-
- Special thanks to the following projects & communities: SGLang, Megatron‑LM, mbridge, OpenRLHF, veRL, and others.
96+
(TODO: arXiv link for the paper)

0 commit comments

Comments
 (0)