Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What happened
Training was randomly crashing with
IndexError: index -1 is out of bounds for dimension 1 with size 0at trainer.py:798.Tracked it down - dataset was returning empty sequences, literally
torch.Size([1, 0])with zero tokens.The problem
The math that broke:
array
When sampler picked one of those phantom indices:
But tokens only go up to 107,374,182,400, so numpy returns empty array
[]Why it happened
Bug in preprocessing (02_consolidate_shards.py):
This created one extra boundary. If the shard didn't divide evenly by seqlen,
we'd get phantom sample IDs for the incomplete chunk at the end.
Runtime trusted the wrong thing:
Should've been checking against actual token count.
The fix
Runtime (sharded_dataset.py):
Now logs warning if there's a mismatch and uses the safe count.
Preprocessing (02_consolidate_shards.py):
Also fixed the dtype loading - now properly uses np.load for .npy files instead
of reinterpreting uint16 as uint32.
Impact
Notes
The preprocessing script also had a dtype issue where it was loading files as
uint16 then viewing as uint32, but since step 01 saves as uint32 in .npy
format, we should just load them properly with np.load which respects the
embedded dtype.