RLHF-Lab is a unified laboratory for experimenting with RLHF-style methods under a single, consistent pipeline.
It provides toy but faithful implementations (for small models and tiny datasets) of:
- SFT (Supervised Fine-Tuning)
- PPO-style RLHF (InstructGPT-like, simplified)
- DPO (Direct Preference Optimization, simplified no-ref variant)
- IPO (margin-based preference optimization, simplified)
- ORPO (SFT + KL penalty to reference model)
- RLAIF (AI feedback; implemented as DPO with AI-labeled preferences)
- Active Preference Learning (uncertainty-based pair selection, simplified)
On top of these, RLHF-Lab provides a shared evaluation toolkit:
- LM Basic Quality — Perplexity (PPL), BERTScore
- Preference Fit — Win Rate, Bradley–Terry score
- Robustness — Self-consistency entropy, injection success rate
- Reward Consistency — KL to SFT, average reward-model score
- Compute Efficiency — Latency, tokens/sec, approx FLOPs/token (placeholder)
⚠️ Scope & Intended Use
- Designed for research prototypes, teaching, and benchmarking on tiny models
- Not intended as a production RLHF library or large-scale training framework
- Uses only free/open-source dependencies (PyTorch, HuggingFace, BERTScore, NumPy, etc.)
git clone https://github.com/REICHIYAN/rlhf_lab.git
cd rlhf_lab
pip install -e .or, inside the project root:
pip install .Core dependencies are:
torchtransformersbert-scorenumpypandas
These are declared in pyproject.toml and requirements.txt.
rlhf_lab/
pyproject.toml
README.md
requirements.txt
LICENSE
unirlhf/
__init__.py
data/
__init__.py
schemas.py
models/
__init__.py
interfaces.py
dummy.py
eval/
__init__.py
lm_basic.py
preference.py
robustness.py
reward_consistency.py
compute_efficiency.py
runner.py
train/
__init__.py
datasets.py
sft_trainer.py
ppo_trainer.py
dpo_trainer.py
ipo_trainer.py
orpo_trainer.py
rlaif_trainer.py
active_pl_trainer.py
examples/
run_dummy_evaluation.py
run_all_methods_tiny_gpt2.py
test_data/
prompts.jsonl
injection_base_prompts.jsonl
comparisons.jsonl
sft_train.jsonl
pref_train.jsonl
tests/
unit/
test_basic_flow.py
integration/
test_train_smoke.py
- Upload the zip (
rlhf_lab.zip) and unzip:
!unzip rlhf_lab.zip -d .
%cd rlhf_lab
!pip install -e .- Run all methods training + evaluation (tiny GPT-2):
python -m examples.run_all_methods_tiny_gpt2This will:
- Download a tiny causal LM (
sshleifer/tiny-gpt2) - Train small models for:
- SFT
- PPO-style RLHF
- DPO / IPO / ORPO
- RLAIF
- Active Preference Learning
- Run the unified evaluator (
UnifiedEvaluator) on these models - Print a comparison table over all 5 metric groups
- To test the evaluation pipeline alone (no HF / internet needed):
python -m examples.run_dummy_evaluationUnit tests (no external downloads):
pytest tests/unitIntegration tests (downloads tiny HF model):
pytest tests/integrationThis project is distributed under the MIT License. See LICENSE for details.
If you use RLHF-Lab in academic work, you might cite it informally as:
RLHF-Lab: A Unified Laboratory for RLHF-style Methods
R. Taguchi, 2025.
https://github.com/REICHIYAN/rlhf_lab
Adjust the author / URL as appropriate.