[feature request] Saving / Loading packed dataset #1149

ScottHoang · 2024-07-08T21:52:13Z

Hi team,
Can we save and load a packed dataset? I have a used case where I must train multiple models on the same packed dataset with identical sequence lengths. We could speed up the packing process by saving it and loading it for next time.
What do you think?

RdoubleA · 2024-07-08T22:08:57Z

Hi @ScottHoang,

This would be an excellent feature to have and has been on our mind since we added packed datasets, but haven't been able to get around to yet. If you are open to contributing, would love to see an initial PR on this, and we can help review/get it in good shape.

ScottHoang · 2024-07-09T04:11:34Z

Sure! I would love to contribute back. Let me see what I can do

joecummings · 2024-07-09T12:16:27Z

CC @winglian for his thoughts on caching the packed datasets. I believe axolotl already does something like this?

pbontrager · 2024-07-17T21:32:30Z

@RdoubleA If our plan is to move to online packing anyway, does this make sense as a direction to go in?

joecummings · 2024-07-18T00:21:03Z

@RdoubleA If our plan is to move to online packing anyway, does this make sense as a direction to go in?

This wouldn't be fully online though, right? And we could still cache an iterable packed dataset?

pbontrager · 2024-07-19T16:39:40Z

What’s the benefit? For streamed datasets the packing would be based on the order of random shards, so you wouldn’t reuse the packing. And even if you downloaded the full dataset, how much time is online packing costing you? I doubt it would even impact total training time.

RdoubleA added the help wanted Extra attention is needed label Jul 8, 2024

felipemello1 added the enhancement New feature or request label Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] Saving / Loading packed dataset #1149

[feature request] Saving / Loading packed dataset #1149

ScottHoang commented Jul 8, 2024

RdoubleA commented Jul 8, 2024

ScottHoang commented Jul 9, 2024

joecummings commented Jul 9, 2024

pbontrager commented Jul 17, 2024 •

edited

Loading

joecummings commented Jul 18, 2024

pbontrager commented Jul 19, 2024

[feature request] Saving / Loading packed dataset #1149

[feature request] Saving / Loading packed dataset #1149

Comments

ScottHoang commented Jul 8, 2024

RdoubleA commented Jul 8, 2024

ScottHoang commented Jul 9, 2024

joecummings commented Jul 9, 2024

pbontrager commented Jul 17, 2024 • edited Loading

joecummings commented Jul 18, 2024

pbontrager commented Jul 19, 2024

pbontrager commented Jul 17, 2024 •

edited

Loading