You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add a custom (batch index) sampler to automatically determine batch size to a fixed target number of tokens.
Motivation
I'm keen to try out DataCollatorWithFlattening but unsure about how to set batch size, since no padding will be added so the total number of tokens is dynamic.
Im also uncertain whether fixing the total number of tokens is itself optimal...Does optimal memory allocation require accounting for the amount of attention masking that will be applied to the batch?
Is there any recommendation on how to handle this currently?
(Edit: seems like near-optimal solution for map-style datasets is provided by https://github.com/imoneoi/multipack_sampler/tree/master, which presumably just tries to ensure all batches are as full as possible given some max number of tokens. It would be nice to support similar functionality for Iterable Datasets - not optimal packing, but adjusting batch size to adapt to number of tokens in examples should be possible)
Your contribution
May be able to try to implement something for iterable datasets if this is possible.
The text was updated successfully, but these errors were encountered:
Feature request
Add a custom (batch index) sampler to automatically determine batch size to a fixed target number of tokens.
Motivation
I'm keen to try out DataCollatorWithFlattening but unsure about how to set batch size, since no padding will be added so the total number of tokens is dynamic.
Im also uncertain whether fixing the total number of tokens is itself optimal...Does optimal memory allocation require accounting for the amount of attention masking that will be applied to the batch?
Is there any recommendation on how to handle this currently?
(Edit: seems like near-optimal solution for map-style datasets is provided by https://github.com/imoneoi/multipack_sampler/tree/master, which presumably just tries to ensure all batches are as full as possible given some max number of tokens. It would be nice to support similar functionality for Iterable Datasets - not optimal packing, but adjusting batch size to adapt to number of tokens in examples should be possible)
Your contribution
May be able to try to implement something for iterable datasets if this is possible.
The text was updated successfully, but these errors were encountered: