This repository implements an end-to-end pipeline for bootstrapping high-quality Direct Preference Optimization (DPO) datasets.
Instead of relying on expensive human annotation, this project uses a "Teacher-Student" approach:
- Teacher (Gemini): Generates synthetic topics, subtopics, questions, and paired responses.
- Judge (Gemma-2B Reward Model): A fine-tuned reward model scores the pairs to distinguish high-quality answers.
- Filter: The pipeline selects the best response as "chosen" and the worst as "rejected" to form a DPO-ready dataset.
graph LR
A[Topic Generation] --> B[Question Generation]
B --> C[Paired Responses A/B]
C --> D[Reward Model Scoring]
D --> E{Threshold Filter}
E -->|Pass| F[Final DPO Dataset]
E -->|Fail| G[Discard]