Skip to content

Kshitiz-Khandel/DPO-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 

Repository files navigation

Synthetic DPO Data Pipeline

Overview

This repository implements an end-to-end pipeline for bootstrapping high-quality Direct Preference Optimization (DPO) datasets.

Instead of relying on expensive human annotation, this project uses a "Teacher-Student" approach:

  1. Teacher (Gemini): Generates synthetic topics, subtopics, questions, and paired responses.
  2. Judge (Gemma-2B Reward Model): A fine-tuned reward model scores the pairs to distinguish high-quality answers.
  3. Filter: The pipeline selects the best response as "chosen" and the worst as "rejected" to form a DPO-ready dataset.

Architecture

graph LR
    A[Topic Generation] --> B[Question Generation]
    B --> C[Paired Responses A/B]
    C --> D[Reward Model Scoring]
    D --> E{Threshold Filter}
    E -->|Pass| F[Final DPO Dataset]
    E -->|Fail| G[Discard]
Loading

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages