Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

alex-hh · 2024-10-04T12:19:46Z

Feature request

Add a custom (batch index) sampler to automatically determine batch size to a fixed target number of tokens.

Motivation

I'm keen to try out DataCollatorWithFlattening but unsure about how to set batch size, since no padding will be added so the total number of tokens is dynamic.

Im also uncertain whether fixing the total number of tokens is itself optimal...Does optimal memory allocation require accounting for the amount of attention masking that will be applied to the batch?

Is there any recommendation on how to handle this currently?

(Edit: seems like near-optimal solution for map-style datasets is provided by https://github.com/imoneoi/multipack_sampler/tree/master, which presumably just tries to ensure all batches are as full as possible given some max number of tokens. It would be nice to support similar functionality for Iterable Datasets - not optimal packing, but adjusting batch size to adapt to number of tokens in examples should be possible)

Your contribution

May be able to try to implement something for iterable datasets if this is possible.

alex-hh added the Feature request Request for a new feature label Oct 4, 2024

ArthurZucker added the Usage General questions about the library label Oct 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

alex-hh commented Oct 4, 2024 •

edited

Loading

Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

Comments

alex-hh commented Oct 4, 2024 • edited Loading

Feature request

Motivation

Your contribution

alex-hh commented Oct 4, 2024 •

edited

Loading