-
Notifications
You must be signed in to change notification settings - Fork 19
Description
Is your feature request related to a problem? Please describe.
Supporting Streaming Datasets (Iterable) offers many benefits compared to Mapped Datasets. In particular, enables immediate consumption of data batches, without having to download or preprocess the whole dataset beforehand. This greatly enhances user experience, since it eliminates a time-consuming preprocessing step.
With great power, comes great responsibility: while Streaming Datasets are great to enable no-wait workflows, distributed sampling becomes trickier than with Mapped Datasets, mainly due to the lack of the __len__ method. As a result, this necessitates special attention wrt avoiding redundant data replication and uneven last batch across ranks.
Open source libraries such as datasets have relevant function split_dataset_by_node, that might be useful here.
Describe the solution you'd like
Support for streaming datasets, if possible without going to torch.distributed in dataloaders :)
Describe alternatives you've considered
Mapped Datasets
Additional context
This is useful for job launching on cloud provides, since it saves time by amortizing the data preprocessing cost over the training process.