Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

Open
alex-hh opened this issue Oct 4, 2024 · 0 comments
Open

Automatic dynamic batch size selection for DataCollatorWithFlattening #33945

alex-hh opened this issue Oct 4, 2024 · 0 comments
Labels
Feature request Request for a new feature Usage General questions about the library

Comments

@alex-hh
Copy link

alex-hh commented Oct 4, 2024

Feature request

Add a custom (batch index) sampler to automatically determine batch size to a fixed target number of tokens.

Motivation

I'm keen to try out DataCollatorWithFlattening but unsure about how to set batch size, since no padding will be added so the total number of tokens is dynamic.

Im also uncertain whether fixing the total number of tokens is itself optimal...Does optimal memory allocation require accounting for the amount of attention masking that will be applied to the batch?

Is there any recommendation on how to handle this currently?

(Edit: seems like near-optimal solution for map-style datasets is provided by https://github.com/imoneoi/multipack_sampler/tree/master, which presumably just tries to ensure all batches are as full as possible given some max number of tokens. It would be nice to support similar functionality for Iterable Datasets - not optimal packing, but adjusting batch size to adapt to number of tokens in examples should be possible)

Your contribution

May be able to try to implement something for iterable datasets if this is possible.

@alex-hh alex-hh added the Feature request Request for a new feature label Oct 4, 2024
@ArthurZucker ArthurZucker added the Usage General questions about the library label Oct 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature Usage General questions about the library
Projects
None yet
Development

No branches or pull requests

2 participants