An automated data pipeline aims to scale up the RL datasets for LLM training to the webscale. It converts pretraining data into reinforcement learning (RL) datasets with question-answer format. It enables the creation of pretrain-scale RL datasets for LLM training while maintaining the diversity and quality of the original pretraining data.
Note: The dataset we released was generated using GPT and should not be used to develop models that compete with OpenAI.
The Webscale-RL pipeline employs a multi-stage approach to transform diverse pretraining materials into structured QA pairs suitable for RL training:
- Filter: Pre-processes and filters raw materials for quality
- Identifier: Identifies domain classification and target persona
- Generator: Creates question-answer pairs based on identified personas
- Checker: Validates generated content for quality and correctness
- π Multi-stage Pipeline: Robust 4-stage processing ensuring high-quality output
- π Persona-based Generation: Creates diverse QA pairs from different persona perspectives
- π·οΈ Diverse Domains: Supports 10+ domains, beyond typical post-training datasets with specific domains such as Math, Coding, etc
- Python 3.8 or higher
- OpenAI API key
- Clone the repository:
git clone https://github.com/SalesforceAIResearch/PretrainRL-pipeline
cd PretrainRL-pipeline- Install the package:
pip install -r requirements.txt
pip install -e .- Set up your OpenAI API key:
export OPENAI_API_KEY=your_api_keyFor standard processing with individual API calls, remember to change the logic of loading the pretrain data to your own pretrain data in main.py.
python main.py \
--model gpt-4.1 \
--seed_dataset_dir [your_pretrain_data_dir] \
--RL_dataset_save_dir data/RL_datasets \
--RL_dataset_filename webscale_rl.jsonl \
--failure_log_filename failure_log.jsonl \
--workers 10| Parameter | Type | Default | Description |
|---|---|---|---|
--model |
str | gpt-4.1 |
OpenAI model to use |
--seed_dataset_dir |
str | "" |
The directory of the pretrain data |
--RL_dataset_save_dir |
str | data/RL_datasets |
Output directory of RL datasets |
--RL_dataset_filename |
str | webscale_rl.jsonl |
Output filename of RL datasets |
--failure_log_filename |
str | failure_log.jsonl |
Failure log filename |
--temperature |
float | 1.0 |
Model temperature |
--max-tokens |
int | 4096 |
Maximum output tokens per request |
--num-fewshot |
int | 2 |
Number of few-shot examples |
--seed |
int | 42 |
Random seed |
--workers |
int | 10 |
Number of parallel workers |
The pipeline supports configurable model settings for each stage:
- Filter Stage: Pre-processes raw materials
- Identifier Stage: Domain and persona identification
- Generator Stage: QA pair generation
- Checker Stage: Quality validation
The pipeline generates JSONL files with the following structure:
{
"original_text": "...",
"domain": "Technology & Engineering",
"persona": "Software Developer",
"question": "What is the primary advantage of using microservices architecture?",
"answer": "Microservices architecture allows for better scalability, maintainability, and independent deployment of services.",
"metadata": "..."
}Custom Domain Configuration
To add custom domains, you need to:
- modify the
ALL_DOMAINSlist inwebscale_rl/behavior_template/identifier.py - add the domain-specific few-shot examples in
webscale_rl/domain_specific_library/ - modify the
FEW_SHOT_LIBRARY_PATHinwebscale_rl/agent/agent.pyand ensure that all domains are added to theFEW_SHOT_LIBRARY_PATH
Custom Persona Generation
The system automatically identifies appropriate personas based on content. You can customize persona selection in the identifier template.
Quality Control
The pipeline includes multiple quality control mechanisms:
- Pre-filtering to remove low-quality pretraining materials
- Semantic validation for answer correctness
- Information leakage detection to prevent information leakage from question to answer
We construct the Webscale-RL dataset by the pipeline, which contains ~1.2M samples with following distributions (we did not release the data converted from Stack-v2 due to license issues):
While this dataset contains ~1M samples, we can easily scale up the dataset size further to pretraining level with our pipeline.
If you find this work useful, please consider citing:
@article{cen2025webscalerl,
title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels},
author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao},
journal={arXiv preprint arXiv:2510.06499},
year={2025},
}
