Synthetic DPO Data Pipeline

Overview

This repository implements an end-to-end pipeline for bootstrapping high-quality Direct Preference Optimization (DPO) datasets.

Instead of relying on expensive human annotation, this project uses a "Teacher-Student" approach:

Teacher (Gemini): Generates synthetic topics, subtopics, questions, and paired responses.
Judge (Gemma-2B Reward Model): A fine-tuned reward model scores the pairs to distinguish high-quality answers.
Filter: The pipeline selects the best response as "chosen" and the worst as "rejected" to form a DPO-ready dataset.

Architecture

graph LR
    A[Topic Generation] --> B[Question Generation]
    B --> C[Paired Responses A/B]
    C --> D[Reward Model Scoring]
    D --> E{Threshold Filter}
    E -->|Pass| F[Final DPO Dataset]
    E -->|Fail| G[Discard]

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
src		src
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Synthetic DPO Data Pipeline

Overview

Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Synthetic DPO Data Pipeline

Overview

Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages