Support for Streaming Dataset

**Is your feature request related to a problem? Please describe.**
Supporting Streaming Datasets (Iterable) offers many benefits compared to Mapped Datasets. In particular, enables immediate consumption of data batches, without having to download or preprocess the whole dataset beforehand. This greatly enhances user experience, since it eliminates a time-consuming preprocessing step.

With great power, comes great responsibility: while Streaming Datasets are great to enable no-wait workflows, distributed sampling becomes trickier than with Mapped Datasets, mainly due to the lack of the `__len__` method. As a result, this necessitates special attention wrt avoiding redundant data replication and uneven last batch across ranks.

Open source libraries such as datasets have relevant function [split_dataset_by_node](https://huggingface.co/docs/datasets/en/package_reference/main_classes#datasets.distributed.split_dataset_by_node), that might be useful here.

**Describe the solution you'd like**
Support for streaming datasets, if possible without going to torch.distributed in dataloaders :) 

**Describe alternatives you've considered**
Mapped Datasets

**Additional context**
This is useful for job launching on cloud provides, since it saves time by amortizing the data preprocessing cost over the training process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support for Streaming Dataset #703

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support for Streaming Dataset #703

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions