Skip to content

BounharAbdelaziz/RLHF

Repository files navigation

🤖 RLHF & RLVR Training Workshop (DPO + GRPO)

Colab Teacher Colab Student Stars Forks License

This repository demonstrates how to improve language models using Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO). It is designed for a hands-on tutorial with a student branch (TODOs) and a solution branch (main branch).

🧠 What You’ll Learn

  • DPO: align response style and tone using preference pairs.
  • GRPO: improve reasoning with reward-based practice.
  • Compare a base model vs a tuned model in a live chat UI.

📦 Repo Layout

  • main branch: main.ipynb is the full solution notebook.
  • student branch: student.ipynb is the fill-in-the-blanks notebook.
  • chat_app.py is the Gradio comparison UI.
  • utils.py contains the LoRA merge helper.
  • presentation.pdf and presentation.pptx are the slides.

🚀 Quickstart (Local)

  1. Install dependencies:
    pip install -r requirements.txt
  2. Login to Hugging Face:
    huggingface-cli login
  3. Set WANDB_API_KEY, WANDB_PROJECT, and WANDB_ENTITY in your environment (copy .env.example to .env and export it in your shell).
  4. Open main.ipynb (solution) or switch to the student branch for the TODO version in student.ipynb.

📓 Colab Links

Teacher (Solution Notebook)
Open In Colab

Student (TODO Notebook)
Open In Colab

🚀 Quickstart (Colab)

Recommended: open the notebook directly using the Colab badges above.

If you prefer to clone manually:

  1. Clone the repo:
    !git clone https://github.com/BounharAbdelaziz/RLHF.git
  2. Open student.ipynb (student branch) or main.ipynb (solution) from the Files sidebar.
  3. Run the setup cells at the top of the notebook.

🏗️ Project Structure (High Level)

.
├── main.ipynb         # DPO & GRPO training, merging, testing (solution)
├── student.ipynb      # TODO version (student branch)
├── chat_app.py        # Gradio chat app for model comparison
├── utils.py           # Utilities (merging, testing)
├── docs/              # Student & instructor guides
├── presentation.*     # Slides
└── requirements.txt   # Python dependencies

Instructor Guide

Suggested run-of-show (90 minutes)

  1. Intro and goals.
  2. LLM training in practice. DPO & GRPO concepts. When to use each one? (slide 2-8)
  3. DPO hands-on. Students complete the DPO TODOs and run training.
  4. GRPO hands-on. Students complete the GRPO TODOs and run training.
  5. (optional) compare models in the chat app and discuss results.
  6. Risks and safety. Use slides 9 to 12 for discussion.

Common pitfalls to watch

  • Missing WANDB_API_KEY or Hugging Face login.
  • GPU not available or incompatible PyTorch build.
  • Running out of VRAM when using larger models.

Discussion questions

  • Why does DPO work without an explicit reward model?
  • When would GRPO be worth the extra compute cost?
  • What kinds of reward hacking could appear here?

✅ Notes

  • This workshop assumes an NVIDIA GPU. Install PyTorch for your CUDA version from pytorch.org.
  • Use W&B for experiment tracking, or disable it in the notebook configs.

📚 References

Happy fine-tuning.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published