Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feature request] Saving / Loading packed dataset #1149

Open
ScottHoang opened this issue Jul 8, 2024 · 6 comments
Open

[feature request] Saving / Loading packed dataset #1149

ScottHoang opened this issue Jul 8, 2024 · 6 comments
Labels
enhancement New feature or request help wanted Extra attention is needed

Comments

@ScottHoang
Copy link

Hi team,
Can we save and load a packed dataset? I have a used case where I must train multiple models on the same packed dataset with identical sequence lengths. We could speed up the packing process by saving it and loading it for next time.
What do you think?

@RdoubleA
Copy link
Contributor

RdoubleA commented Jul 8, 2024

Hi @ScottHoang,

This would be an excellent feature to have and has been on our mind since we added packed datasets, but haven't been able to get around to yet. If you are open to contributing, would love to see an initial PR on this, and we can help review/get it in good shape.

@RdoubleA RdoubleA added the help wanted Extra attention is needed label Jul 8, 2024
@felipemello1 felipemello1 added the enhancement New feature or request label Jul 8, 2024
@ScottHoang
Copy link
Author

Sure! I would love to contribute back. Let me see what I can do

@joecummings
Copy link
Contributor

CC @winglian for his thoughts on caching the packed datasets. I believe axolotl already does something like this?

@pbontrager
Copy link
Contributor

pbontrager commented Jul 17, 2024

@RdoubleA If our plan is to move to online packing anyway, does this make sense as a direction to go in?

@joecummings
Copy link
Contributor

@RdoubleA If our plan is to move to online packing anyway, does this make sense as a direction to go in?

This wouldn't be fully online though, right? And we could still cache an iterable packed dataset?

@pbontrager
Copy link
Contributor

What’s the benefit? For streamed datasets the packing would be based on the order of random shards, so you wouldn’t reuse the packing. And even if you downloaded the full dataset, how much time is online packing costing you? I doubt it would even impact total training time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

5 participants