fix(datasets): use all DataLoader workers in StreamingLeRobotDataset by kohankhaki · Pull Request #3781 · huggingface/lerobot

kohankhaki · 2026-06-12T02:05:35Z

Summary / Motivation

StreamingLeRobotDataset does not scale with DataLoader workers: with num_workers > 1,
all data is read by worker 0 while the remaining workers sit idle, and the log is spammed with
Too many dataloader workers: 2 (max is dataset.num_shards=1). Stopping 1 dataloader workers.

Root cause: two levels of sharding conflict. __iter__ slices the HF dataset into
self.num_shards sub-datasets (via safe_shard) to interleave reads for shuffling, so each
sub-dataset only holds num_files / num_shards parquet files. When a sub-dataset is iterated
inside a DataLoader worker, datasets.IterableDataset additionally splits its files across
workers: a sub-dataset with fewer files than workers can only feed the first worker(s).
Measured on an 8-file dataset with num_workers=2: worker 0 yielded all 2400 samples,
worker 1 yielded 0.

Related issues

None filed; happy to open one if preferred.

What changed

__iter__ caps the number of sub-shards so each keeps at least num_workers files:
num_shards = max(1, min(self.num_shards, self.hf_dataset.num_shards // num_workers)).
HF's internal per-worker file split then assigns every worker a disjoint, non-empty set of files.
No behavior change for num_workers 0/1; no API change.
After the fix (same setup): 1227/1173 samples per worker, every sample yielded exactly once,
no warnings. Data was never duplicated or lost before this change. This is purely a
throughput/parallelism fix.

How was this tested (or how to run locally)

New test: test_dataloader_workers_complete_and_balanced
(uv run pytest tests/datasets/test_streaming.py -k workers): multi-file local dataset,
asserts every sample is yielded exactly once through a 2-worker DataLoader.
Manually verified worker load distribution and absence of the HF warning for 2 and 4 workers
over 8- and 12-file datasets.

Checklist (required before merge)

Linting/formatting run (pre-commit run -a)
All tests pass locally (pytest)
Documentation updated
CI is green
Community Review: I have reviewed another contributor's open PR and linked it here: Add vllm policy (remote vLLM OpenPI inference) #3725

Reviewer notes

The test fixture only rolls a new parquet file above 1 integer-MB
(get_hf_dataset_size_in_mb floors to MB), so the wide motor features in the test to
force a multi-file dataset (single-file datasets make any worker test pass trivially).

pkooij · 2026-06-12T08:48:56Z

Hi thanks for the PR, we are in process of refactoring Streaming data loader to be more performant in general, will take your pr into account!

github-actions Bot added dataset Issues regarding data inputs, processing, or datasets tests Problems with test coverage, failures, or improvements to testing labels Jun 12, 2026

fix(datasets): use all DataLoader workers in StreamingLeRobotDataset

180a4a2

kohankhaki force-pushed the fix-streaming-dataloader-workers branch from 9c93473 to 180a4a2 Compare June 12, 2026 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(datasets): use all DataLoader workers in StreamingLeRobotDataset#3781

fix(datasets): use all DataLoader workers in StreamingLeRobotDataset#3781
kohankhaki wants to merge 1 commit into
huggingface:mainfrom
kohankhaki:fix-streaming-dataloader-workers

kohankhaki commented Jun 12, 2026 •

edited

Loading

Uh oh!

pkooij commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kohankhaki commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary / Motivation

Related issues

What changed

How was this tested (or how to run locally)

Checklist (required before merge)

Reviewer notes

Uh oh!

pkooij commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kohankhaki commented Jun 12, 2026 •

edited

Loading