Skip to content

Latest commit

 

History

History
20 lines (16 loc) · 866 Bytes

File metadata and controls

20 lines (16 loc) · 866 Bytes

Synthetic DPO Data Pipeline

Overview

This repository implements an end-to-end pipeline for bootstrapping high-quality Direct Preference Optimization (DPO) datasets.

Instead of relying on expensive human annotation, this project uses a "Teacher-Student" approach:

  1. Teacher (Gemini): Generates synthetic topics, subtopics, questions, and paired responses.
  2. Judge (Gemma-2B Reward Model): A fine-tuned reward model scores the pairs to distinguish high-quality answers.
  3. Filter: The pipeline selects the best response as "chosen" and the worst as "rejected" to form a DPO-ready dataset.

Architecture

graph LR
    A[Topic Generation] --> B[Question Generation]
    B --> C[Paired Responses A/B]
    C --> D[Reward Model Scoring]
    D --> E{Threshold Filter}
    E -->|Pass| F[Final DPO Dataset]
    E -->|Fail| G[Discard]
Loading