Skip to content

Support for Streaming Dataset #703

@akoumpa

Description

@akoumpa

Is your feature request related to a problem? Please describe.
Supporting Streaming Datasets (Iterable) offers many benefits compared to Mapped Datasets. In particular, enables immediate consumption of data batches, without having to download or preprocess the whole dataset beforehand. This greatly enhances user experience, since it eliminates a time-consuming preprocessing step.

With great power, comes great responsibility: while Streaming Datasets are great to enable no-wait workflows, distributed sampling becomes trickier than with Mapped Datasets, mainly due to the lack of the __len__ method. As a result, this necessitates special attention wrt avoiding redundant data replication and uneven last batch across ranks.

Open source libraries such as datasets have relevant function split_dataset_by_node, that might be useful here.

Describe the solution you'd like
Support for streaming datasets, if possible without going to torch.distributed in dataloaders :)

Describe alternatives you've considered
Mapped Datasets

Additional context
This is useful for job launching on cloud provides, since it saves time by amortizing the data preprocessing cost over the training process.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions