-
Notifications
You must be signed in to change notification settings - Fork 378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feature request] Saving / Loading packed dataset #1149
Comments
Hi @ScottHoang, This would be an excellent feature to have and has been on our mind since we added packed datasets, but haven't been able to get around to yet. If you are open to contributing, would love to see an initial PR on this, and we can help review/get it in good shape. |
Sure! I would love to contribute back. Let me see what I can do |
CC @winglian for his thoughts on caching the packed datasets. I believe axolotl already does something like this? |
@RdoubleA If our plan is to move to online packing anyway, does this make sense as a direction to go in? |
This wouldn't be fully online though, right? And we could still cache an iterable packed dataset? |
What’s the benefit? For streamed datasets the packing would be based on the order of random shards, so you wouldn’t reuse the packing. And even if you downloaded the full dataset, how much time is online packing costing you? I doubt it would even impact total training time. |
Hi team,
Can we save and load a packed dataset? I have a used case where I must train multiple models on the same packed dataset with identical sequence lengths. We could speed up the packing process by saving it and loading it for next time.
What do you think?
The text was updated successfully, but these errors were encountered: