🤖 RLHF & RLVR Training Workshop (DPO + GRPO)

This repository demonstrates how to improve language models using Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). It is designed for a hands-on tutorial with a student branch (TODOs) and a solution branch (main branch).

🧠 What You’ll Learn

DPO: align response style and tone using preference pairs.
GRPO: improve reasoning with reward-based practice.
Compare a base model vs a tuned model in a live chat UI.

📦 Repo Layout

main branch: main.ipynb is the full solution notebook.
student branch: student.ipynb is the fill-in-the-blanks notebook.
chat_app.py is the Gradio comparison UI.
utils.py contains the LoRA merge helper.
presentation.pdf and presentation.pptx are the slides.

🚀 Quickstart (Local)

Install dependencies:
```
pip install -r requirements.txt
```
Login to Hugging Face:
```
huggingface-cli login
```
Set WANDB_API_KEY, WANDB_PROJECT, and WANDB_ENTITY in your environment (copy .env.example to .env and export it in your shell).
Open main.ipynb (solution) or switch to the student branch for the TODO version in student.ipynb.

📓 Colab Links

Teacher (Solution Notebook)

Student (TODO Notebook)

🚀 Quickstart (Colab)

Recommended: open the notebook directly using the Colab badges above.

If you prefer to clone manually:

Clone the repo:

!git clone https://github.com/BounharAbdelaziz/RLHF.git

Open student.ipynb (student branch) or main.ipynb (solution) from the Files sidebar.
Run the setup cells at the top of the notebook.

🏗️ Project Structure (High Level)

.
├── main.ipynb         # DPO & GRPO training, merging, testing (solution)
├── student.ipynb      # TODO version (student branch)
├── chat_app.py        # Gradio chat app for model comparison
├── utils.py           # Utilities (merging, testing)
├── docs/              # Student & instructor guides
├── presentation.*     # Slides
└── requirements.txt   # Python dependencies

Instructor Guide

Suggested run-of-show (90 minutes)

Intro and goals.
LLM training in practice. DPO & GRPO concepts. When to use each one? (slide 2-8)
DPO hands-on. Students complete the DPO TODOs and run training.
GRPO hands-on. Students complete the GRPO TODOs and run training.
(optional) compare models in the chat app and discuss results.
Risks and safety. Use slides 9 to 12 for discussion.

Common pitfalls to watch

Missing WANDB_API_KEY or Hugging Face login.
GPU not available or incompatible PyTorch build.
Running out of VRAM when using larger models.

Discussion questions

Why does DPO work without an explicit reward model?
When would GRPO be worth the extra compute cost?
What kinds of reward hacking could appear here?

✅ Notes

This workshop assumes an NVIDIA GPU. Install PyTorch for your CUDA version from pytorch.org.
Use W&B for experiment tracking, or disable it in the notebook configs.

📚 References

Happy fine-tuning.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
chat_app.py		chat_app.py
main.ipynb		main.ipynb
presentation.pdf		presentation.pdf
presentation.pptx		presentation.pptx
push_to_hub.py		push_to_hub.py
push_to_hub.sh		push_to_hub.sh
requirements.txt		requirements.txt
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🤖 RLHF & RLVR Training Workshop (DPO + GRPO)

🧠 What You’ll Learn

📦 Repo Layout

🚀 Quickstart (Local)

📓 Colab Links

🚀 Quickstart (Colab)

🏗️ Project Structure (High Level)

Instructor Guide

✅ Notes

📚 References

About

Uh oh!

Releases

Packages

Languages

BounharAbdelaziz/RLHF

Folders and files

Latest commit

History

Repository files navigation

🤖 RLHF & RLVR Training Workshop (DPO + GRPO)

🧠 What You’ll Learn

📦 Repo Layout

🚀 Quickstart (Local)

📓 Colab Links

🚀 Quickstart (Colab)

🏗️ Project Structure (High Level)

Instructor Guide

✅ Notes

📚 References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages