Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use variable size chunks instead of always using max_seq_len #17

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

sreeprasannar
Copy link

@sreeprasannar sreeprasannar commented Jan 31, 2025

additions to the PretokDataset in relation to #15:

Keyword arguments:

  • chunk_ratios: list[float] -- if set, batching is done with variable chunk sizes. The list of floats are used to calculate these variable chunk sizes by multiplying with the max_seq_len. Hence, the maximum ratio value (the max value for an element in this list of floats) can be 1.0. If chunk_ratios is specified, we expect to do variable size chunks sometimes less than the max_seq_len - this means we can pack more into a single batch. We calculate how much more based on the chunk size (directly proportional). If the chunk size is half of the max_seq_len then we can pack twice the number of batches

  • batch_size_for_max_seq_len: int -- for chunks of size max_seq_len, this is the batch size. But if chunk_ratios are specified, we may adjust the batch size based on the chunk size (we'll pack more for smaller chunk sizes)

  • max_shards - useful for quick debugging. Simply stop yielding after we reach max_shards number of shards

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant