Skip to content

fix(datasets): use all DataLoader workers in StreamingLeRobotDataset#3781

Open
kohankhaki wants to merge 1 commit into
huggingface:mainfrom
kohankhaki:fix-streaming-dataloader-workers
Open

fix(datasets): use all DataLoader workers in StreamingLeRobotDataset#3781
kohankhaki wants to merge 1 commit into
huggingface:mainfrom
kohankhaki:fix-streaming-dataloader-workers

Conversation

@kohankhaki

@kohankhaki kohankhaki commented Jun 12, 2026

Copy link
Copy Markdown

Summary / Motivation

StreamingLeRobotDataset does not scale with DataLoader workers: with num_workers > 1,
all data is read by worker 0 while the remaining workers sit idle, and the log is spammed with
Too many dataloader workers: 2 (max is dataset.num_shards=1). Stopping 1 dataloader workers.

Root cause: two levels of sharding conflict. __iter__ slices the HF dataset into
self.num_shards sub-datasets (via safe_shard) to interleave reads for shuffling, so each
sub-dataset only holds num_files / num_shards parquet files. When a sub-dataset is iterated
inside a DataLoader worker, datasets.IterableDataset additionally splits its files across
workers: a sub-dataset with fewer files than workers can only feed the first worker(s).
Measured on an 8-file dataset with num_workers=2: worker 0 yielded all 2400 samples,
worker 1 yielded 0.

Related issues

  • None filed; happy to open one if preferred.

What changed

  • __iter__ caps the number of sub-shards so each keeps at least num_workers files:
    num_shards = max(1, min(self.num_shards, self.hf_dataset.num_shards // num_workers)).
    HF's internal per-worker file split then assigns every worker a disjoint, non-empty set of files.
  • No behavior change for num_workers 0/1; no API change.
  • After the fix (same setup): 1227/1173 samples per worker, every sample yielded exactly once,
    no warnings. Data was never duplicated or lost before this change. This is purely a
    throughput/parallelism fix.

How was this tested (or how to run locally)

  • New test: test_dataloader_workers_complete_and_balanced
    (uv run pytest tests/datasets/test_streaming.py -k workers): multi-file local dataset,
    asserts every sample is yielded exactly once through a 2-worker DataLoader.
  • Manually verified worker load distribution and absence of the HF warning for 2 and 4 workers
    over 8- and 12-file datasets.

Checklist (required before merge)

Reviewer notes

  • The test fixture only rolls a new parquet file above 1 integer-MB
    (get_hf_dataset_size_in_mb floors to MB), so the wide motor features in the test to
    force a multi-file dataset (single-file datasets make any worker test pass trivially).

@github-actions github-actions Bot added dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing labels Jun 12, 2026
@kohankhaki kohankhaki force-pushed the fix-streaming-dataloader-workers branch from 9c93473 to 180a4a2 Compare June 12, 2026 02:11
@pkooij

pkooij commented Jun 12, 2026

Copy link
Copy Markdown
Member

Hi thanks for the PR, we are in process of refactoring Streaming data loader to be more performant in general, will take your pr into account!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants