Webscale-RL Data Pipeline

An automated data pipeline aims to scale up the RL datasets for LLM training to the webscale. It converts pretraining data into reinforcement learning (RL) datasets with question-answer format. It enables the creation of pretrain-scale RL datasets for LLM training while maintaining the diversity and quality of the original pretraining data.

Note: The dataset we released was generated using GPT and should not be used to develop models that compete with OpenAI.

🎯 Overview

The Webscale-RL pipeline employs a multi-stage approach to transform diverse pretraining materials into structured QA pairs suitable for RL training:

Filter: Pre-processes and filters raw materials for quality
Identifier: Identifies domain classification and target persona
Generator: Creates question-answer pairs based on identified personas
Checker: Validates generated content for quality and correctness

Key Features

🔄 Multi-stage Pipeline: Robust 4-stage processing ensuring high-quality output
🎭 Persona-based Generation: Creates diverse QA pairs from different persona perspectives
🏷️ Diverse Domains: Supports 10+ domains, beyond typical post-training datasets with specific domains such as Math, Coding, etc

🚀 Installation

Prerequisites

Python 3.8 or higher
OpenAI API key

Setup

Clone the repository:

git clone https://github.com/SalesforceAIResearch/PretrainRL-pipeline
cd PretrainRL-pipeline

Install the package:

pip install -r requirements.txt
pip install -e .

Set up your OpenAI API key:

export OPENAI_API_KEY=your_api_key

📖 Usage

Basic Usage (Single API Calls)

For standard processing with individual API calls, remember to change the logic of loading the pretrain data to your own pretrain data in main.py.

python main.py \
    --model gpt-4.1 \
    --seed_dataset_dir [your_pretrain_data_dir] \
    --RL_dataset_save_dir data/RL_datasets \
    --RL_dataset_filename webscale_rl.jsonl \
    --failure_log_filename failure_log.jsonl \
    --workers 10

⚙️ Configuration

Command Line Arguments

Parameter	Type	Default	Description
`--model`	str	`gpt-4.1`	OpenAI model to use
`--seed_dataset_dir`	str	`""`	The directory of the pretrain data
`--RL_dataset_save_dir`	str	`data/RL_datasets`	Output directory of RL datasets
`--RL_dataset_filename`	str	`webscale_rl.jsonl`	Output filename of RL datasets
`--failure_log_filename`	str	`failure_log.jsonl`	Failure log filename
`--temperature`	float	`1.0`	Model temperature
`--max-tokens`	int	`4096`	Maximum output tokens per request
`--num-fewshot`	int	`2`	Number of few-shot examples
`--seed`	int	`42`	Random seed
`--workers`	int	`10`	Number of parallel workers

Model Configuration

The pipeline supports configurable model settings for each stage:

Filter Stage: Pre-processes raw materials
Identifier Stage: Domain and persona identification
Generator Stage: QA pair generation
Checker Stage: Quality validation

📊 Output Format

The pipeline generates JSONL files with the following structure:

{
  "original_text": "...",
  "domain": "Technology & Engineering",
  "persona": "Software Developer",
  "question": "What is the primary advantage of using microservices architecture?",
  "answer": "Microservices architecture allows for better scalability, maintainability, and independent deployment of services.",
  "metadata": "..."
}

🔧 Advanced Usage

Custom Domain Configuration

To add custom domains, you need to:

modify the ALL_DOMAINS list in webscale_rl/behavior_template/identifier.py
add the domain-specific few-shot examples in webscale_rl/domain_specific_library/
modify the FEW_SHOT_LIBRARY_PATH in webscale_rl/agent/agent.py and ensure that all domains are added to the FEW_SHOT_LIBRARY_PATH

Custom Persona Generation

The system automatically identifies appropriate personas based on content. You can customize persona selection in the identifier template.

Quality Control

The pipeline includes multiple quality control mechanisms:

Pre-filtering to remove low-quality pretraining materials
Semantic validation for answer correctness
Information leakage detection to prevent information leakage from question to answer

📊 Webscale-RL Dataset

We construct the Webscale-RL dataset by the pipeline, which contains ~1.2M samples with following distributions (we did not release the data converted from Stack-v2 due to license issues):

While this dataset contains ~1M samples, we can easily scale up the dataset size further to pretraining level with our pipeline.

📝 Citation

If you find this work useful, please consider citing:

@article{cen2025webscalerl,
  title={Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels},
  author={Zhepeng Cen and Haolin Chen and Shiyu Wang and Zuxin Liu and Zhiwei Liu and Ding Zhao and Silvio Savarese and Caiming Xiong and Huan Wang and Weiran Yao},
  journal={arXiv preprint arXiv:2510.06499},
  year={2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
assets		assets
domain_specific_library		domain_specific_library
webscale_rl		webscale_rl
.gitignore		.gitignore
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
main_batch.py		main_batch.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webscale-RL Data Pipeline

🎯 Overview

Key Features

🚀 Installation

Prerequisites

Setup

📖 Usage

Basic Usage (Single API Calls)

⚙️ Configuration

Command Line Arguments

Model Configuration

📊 Output Format

🔧 Advanced Usage

📊 Webscale-RL Dataset

📝 Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Webscale-RL Data Pipeline

🎯 Overview

Key Features

🚀 Installation

Prerequisites

Setup

📖 Usage

Basic Usage (Single API Calls)

⚙️ Configuration

Command Line Arguments

Model Configuration

📊 Output Format

🔧 Advanced Usage

📊 Webscale-RL Dataset

📝 Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages