use variable size chunks instead of always using max_seq_len
#17
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
additions to the
PretokDataset
in relation to #15:Keyword arguments:
chunk_ratios: list[float] -- if set, batching is done with variable chunk sizes. The list of floats are used to calculate these variable chunk sizes by multiplying with the
max_seq_len
. Hence, the maximum ratio value (the max value for an element in this list of floats) can be1.0
. Ifchunk_ratios
is specified, we expect to do variable size chunks sometimes less than themax_seq_len
- this means we can pack more into a single batch. We calculate how much more based on the chunk size (directly proportional). If the chunk size is half of themax_seq_len
then we can pack twice the number of batchesbatch_size_for_max_seq_len: int -- for chunks of size
max_seq_len
, this is the batch size. But ifchunk_ratios
are specified, we may adjust the batch size based on the chunk size (we'll pack more for smaller chunk sizes)max_shards - useful for quick debugging. Simply stop yielding after we reach
max_shards
number of shards