Skip to content

Added option for loading pre-saved bootstrapped training data for fine-tuning #8262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

asad-aali
Copy link

For the bootstrap fine-tuning feature, I observed that DSPy would re-run bootstrapping every time, which was pretty time-consuming and expensive. Cache did not solve the problem because my dataset goes through modifications at random in every run.

This PR adds an optional feature to pass bootstrapped_data_path when using dspy.BootstrapFinetune(). The bootstrapped_data_path can be a .jsonl file saved from previous or other fine-tuning runs. If bootstrapped_data_path is passed, the class BootstrapFinetune(FinetuneTeleprompter): automatically loads this data and skips the bootstrap_trace_data step, saving repeated effort.

…apped training data (.jsonl) from a local path, for fine-tuning
@asad-aali asad-aali changed the title Added option for loading pre-saved bootstrapped training data (.jsonl) for fine-tuning Added option for loading pre-saved bootstrapped training data for fine-tuning May 22, 2025
@okhat
Copy link
Collaborator

okhat commented May 22, 2025

Thank you @asad-aali ! This complicates the logic a bit, in light of new optimizers that do fine-tuning in DSPy....

QQ: Can you handle this by making the randomization on your end more deterministic? i.e., randomize via a hash of the input, so the randomness is fixed-per-example every time

@asad-aali
Copy link
Author

Thanks for the feedback @okhat! You're right that with fully deterministic inputs, DSPy's caching can prevent redundant bootstrapping. That said, the option to re-use the same data can support reproducibility across machines/pipelines (for more apples-to-apples analyses), especially when reusing high-cost teacher traces (e.g., GPT-4).

That said, happy to defer the decision to you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants