fix(datasets): use all DataLoader workers in StreamingLeRobotDataset#3781
Open
kohankhaki wants to merge 1 commit into
Open
fix(datasets): use all DataLoader workers in StreamingLeRobotDataset#3781kohankhaki wants to merge 1 commit into
kohankhaki wants to merge 1 commit into
Conversation
9c93473 to
180a4a2
Compare
Member
|
Hi thanks for the PR, we are in process of refactoring Streaming data loader to be more performant in general, will take your pr into account! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary / Motivation
StreamingLeRobotDatasetdoes not scale with DataLoader workers: withnum_workers > 1,all data is read by worker 0 while the remaining workers sit idle, and the log is spammed with
Too many dataloader workers: 2 (max is dataset.num_shards=1). Stopping 1 dataloader workers.Root cause: two levels of sharding conflict.
__iter__slices the HF dataset intoself.num_shardssub-datasets (viasafe_shard) to interleave reads for shuffling, so eachsub-dataset only holds
num_files / num_shardsparquet files. When a sub-dataset is iteratedinside a DataLoader worker,
datasets.IterableDatasetadditionally splits its files acrossworkers: a sub-dataset with fewer files than workers can only feed the first worker(s).
Measured on an 8-file dataset with
num_workers=2: worker 0 yielded all 2400 samples,worker 1 yielded 0.
Related issues
What changed
__iter__caps the number of sub-shards so each keeps at leastnum_workersfiles:num_shards = max(1, min(self.num_shards, self.hf_dataset.num_shards // num_workers)).HF's internal per-worker file split then assigns every worker a disjoint, non-empty set of files.
num_workers0/1; no API change.no warnings. Data was never duplicated or lost before this change. This is purely a
throughput/parallelism fix.
How was this tested (or how to run locally)
test_dataloader_workers_complete_and_balanced(
uv run pytest tests/datasets/test_streaming.py -k workers): multi-file local dataset,asserts every sample is yielded exactly once through a 2-worker DataLoader.
over 8- and 12-file datasets.
Checklist (required before merge)
pre-commit run -a)pytest)Reviewer notes
(
get_hf_dataset_size_in_mbfloors to MB), so the wide motor features in the test toforce a multi-file dataset (single-file datasets make any worker test pass trivially).