This repository demonstrates how to improve language models using Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). It is designed for a hands-on tutorial with a student branch (TODOs) and a solution branch (main branch).
- DPO: align response style and tone using preference pairs.
- GRPO: improve reasoning with reward-based practice.
- Compare a base model vs a tuned model in a live chat UI.
mainbranch:main.ipynbis the full solution notebook.studentbranch:student.ipynbis the fill-in-the-blanks notebook.chat_app.pyis the Gradio comparison UI.utils.pycontains the LoRA merge helper.presentation.pdfandpresentation.pptxare the slides.
- Install dependencies:
pip install -r requirements.txt
- Login to Hugging Face:
huggingface-cli login
- Set
WANDB_API_KEY,WANDB_PROJECT, andWANDB_ENTITYin your environment (copy.env.exampleto.envand export it in your shell). - Open
main.ipynb(solution) or switch to thestudentbranch for the TODO version instudent.ipynb.
Recommended: open the notebook directly using the Colab badges above.
If you prefer to clone manually:
- Clone the repo:
!git clone https://github.com/BounharAbdelaziz/RLHF.git - Open
student.ipynb(student branch) ormain.ipynb(solution) from the Files sidebar. - Run the setup cells at the top of the notebook.
.
├── main.ipynb # DPO & GRPO training, merging, testing (solution)
├── student.ipynb # TODO version (student branch)
├── chat_app.py # Gradio chat app for model comparison
├── utils.py # Utilities (merging, testing)
├── docs/ # Student & instructor guides
├── presentation.* # Slides
└── requirements.txt # Python dependencies
Suggested run-of-show (90 minutes)
- Intro and goals.
- LLM training in practice. DPO & GRPO concepts. When to use each one? (slide 2-8)
- DPO hands-on. Students complete the DPO TODOs and run training.
- GRPO hands-on. Students complete the GRPO TODOs and run training.
- (optional) compare models in the chat app and discuss results.
- Risks and safety. Use slides 9 to 12 for discussion.
Common pitfalls to watch
- Missing
WANDB_API_KEYor Hugging Face login. - GPU not available or incompatible PyTorch build.
- Running out of VRAM when using larger models.
Discussion questions
- Why does DPO work without an explicit reward model?
- When would GRPO be worth the extra compute cost?
- What kinds of reward hacking could appear here?
- This workshop assumes an NVIDIA GPU. Install PyTorch for your CUDA version from pytorch.org.
- Use W&B for experiment tracking, or disable it in the notebook configs.
Happy fine-tuning.