Skip to content

Commit d5bd6a5

Browse files
andimarafiotiburtenshawlhoestqmerveenoyanpcuenca
authored
Blog to promote the improved streaming (#3142)
* Blog to promote the improved streaming * add entry to _blog.yml * Apply suggestions from code review from Ben and Quentin Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> * more suggestions from ben * change dataset name * adding more CTA at the end * fix date * changing title and banner per pedro's review * Update streaming-datasets.md Co-authored-by: Merve Noyan <[email protected]> * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Merve Noyan <[email protected]> * Apply suggestions from code review Co-authored-by: Pedro Cuenca <[email protected]> * Apply suggestions from code review Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]> * Apply suggestions from code review Co-authored-by: Merve Noyan <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]> * comments from merve * adding merve as an author :) * pedro's comment --------- Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Merve Noyan <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
1 parent bb72c7a commit d5bd6a5

File tree

3 files changed

+157
-1
lines changed

3 files changed

+157
-1
lines changed

_blog.yml

Lines changed: 16 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6885,4 +6885,19 @@
68856885
- python
68866886
- announcement
68876887
- open-source
6888-
- hub
6888+
- hub
6889+
6890+
- local: streaming-datasets
6891+
title: "Streaming datasets at scale"
6892+
author: andito
6893+
thumbnail: /blog/assets/streaming_datasets/streaming_datasets.png
6894+
date: Oct 27, 2025
6895+
tags:
6896+
- datasets
6897+
- xet
6898+
- hub
6899+
- parquet
6900+
- streaming
6901+
- scale
6902+
- dataloaders
6903+
- storage
658 KB
Loading

streaming-datasets.md

Lines changed: 141 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,141 @@
1+
---
2+
title: Streaming datasets: 100x More Efficient
3+
thumbnail: /blog/assets/streaming_datasets/streaming_datasets.png
4+
authors:
5+
- user: andito
6+
- user: lhoestq
7+
- user: burtenshaw
8+
- user: pcuenq
9+
- user: merve
10+
---
11+
12+
13+
## TLDR
14+
15+
> We boosted `load_dataset('dataset', streaming=True)`, streaming datasets without downloading them with one line of code!
16+
>
17+
> Start training on multi-TB datasets immediately, without complex setups, downloading, no "disk out of space", or 429 “stop requesting!” errors.
18+
> It's super fast! Outrunning our local SSDs when training on 64xH100 with 256 workers downloading data.
19+
> We've improved streaming to have 100x fewer requests, → 10× faster data resolution → 2x sample/sec, → 0 worker crashes at 256 concurrent workers.
20+
21+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/streaming-dark.gif" width="800" height="auto" alt="Visualization of a dataset being streamed">
22+
23+
## Streaming datasets: 100x More Efficient
24+
25+
Loading data, especially at the terabyte scale, is a major pain in any machine learning workflow. We suffered this while training [SmolLM3](https://huggingface.co/blog/smollm3), at one point we had to wait 3 hours before each run to download enough data.
26+
27+
Streaming has always been possible in the `datasets` library, but large scale training with massive datasets remained a challenge. That changes today 🔥. We spent a few months improving the backend, focusing on streaming datasets to make it faster and more efficient.
28+
29+
What did we do exactly? ⤵️
30+
31+
## Streaming: The Same Easy API
32+
33+
First things first: our changes are backwards compatible. You can still stream any dataset from the Hub with the same simple `streaming=True` flag. It's as easy as ever. 🚀
34+
35+
```python
36+
from datasets import load_dataset
37+
38+
# Stream a dataset instead of downloading it
39+
dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)
40+
# Get the first example
41+
print(next(iter(dataset)))
42+
```
43+
44+
Thousands of AI developers around the world use `datasets` daily; they should just get improved performance with zero extra work.
45+
46+
## The Challenge: Streaming at Scale
47+
48+
Streaming was a lifesaver to quickly understand a dataset, but to train models, people were usually downloading the data locally, or using a cloud storage service such as S3. That's what we were doing for training [SmolVLM](https://huggingface.co/blog/smolvlm2), we had all of our data on S3 and were streaming directly from it.
49+
50+
We wanted to change that, so we decided to use streaming from the Hub when we were developing [nanoVLM](https://github.com/huggingface/nanoVLM). Soon we found a big issue: our test run generated over 100,000 requests in under a minute, which got our IP blocked by the Hub! 😅 This happened because every DataLoader worker was initializing the dataset independently. As we dug deeper, we found that this creates a storm of redundant requests, many of which are unnecessary. Our changes ultimately reduced startup requests by a factor of 100. In total, our improvements delivered:
51+
52+
- Data files resolution time: 10x faster
53+
- Startup requests: Up to 100x more efficient
54+
- Streaming speed: Up to 2x faster
55+
- In-flight requests: Up to 2x more efficient
56+
57+
## Under the Hood: What We Improved
58+
59+
So, what changed? We focused on two phases: startup and streaming.
60+
61+
**1. Startup⚡️**
62+
The initial resolution of data files was creating a ton of requests. We made two major changes:
63+
- Persistent Data Files Cache: We are now caching the list of data files across all DataLoader workers. The first worker resolves the file list from the Hub. All others workers read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms!
64+
- Optimized Resolution Logic: We also minimized the number of API calls required for that initial worker to fetch the file list. We now bundle the necessary requests as efficiently as possible, reducing latency even further.
65+
66+
**2. Streaming 🏎️**
67+
To improve throughput during streaming itself, we've introduced two new features:
68+
- Prefetching for Parquet: We enabled prefetching for Parquet datasets. This means that while your model is processing the current chunk of data, the datasets library is already fetching the next chunk in the background. This keeps the data pipeline full and ensures your GPU is never left waiting for data.
69+
- Configurable Buffering: Advanced users can now fine-tune streaming performance for their specific hardware and network setup. We've exposed options to configure the buffer's block size and the prefetch volume, giving you maximum control to optimize I/O.
70+
71+
This is how we can increase the minimum request size when streaming from 32MiB (default) to 128MiB and configure prefetching:
72+
73+
```python
74+
import pyarrow
75+
import pyarrow.dataset
76+
77+
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(
78+
cache_options=pyarrow.CacheOptions(
79+
prefetch_limit=1,
80+
range_size_limit=128 << 20
81+
),
82+
)
83+
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
84+
```
85+
86+
Together, these improvements can double your data throughput, allowing you to train faster and more efficiently.
87+
88+
## How are we faster than plain S3: Xet
89+
90+
Hugging Face uses Xet: a dedupe-based storage which enables fast deduped uploads and downloads. Unlike traditional remote storage, data transfers are faster on Xet because duplicated data is only transferred once. For example: uploading a large scale dataset to Hugging Face leverages Xet which accelerates uploads. Once the dataset is uploaded, it can be streamed right away.
91+
92+
Deduplication for Parquet is enabled through [Parquet Content Defined Chunking (CDC)](https://huggingface.co/blog/parquet-cdc). Thanks to Parquet CDC and Xet deduplication, uploading datasets on Hugging Face is faster than on any traditional remote storage.
93+
94+
This is supported by our `pyspark_huggingface` package, a Spark Data Source to read/write HF datasets. It includes Parquet CDC and Xet support, accelerating data transfers on HF dramatically.
95+
96+
## Need a custom streaming pipeline ?
97+
98+
Some data file formats are not supported in `datasets`, and sometimes there is a need for more control, so we made it easy to build custom streaming pipelines. This has been battle-tested in the LeRobot library to sample video frames, and in the `WebDataset` library to stream TAR archives.
99+
100+
We improved the [HfFileSystem](https://huggingface.co/docs/huggingface_hub/guides/hf_file_system) in the `huggingface_hub` library to efficiently read files from remote Hugging Face dataset repositories and stream data:
101+
102+
```python
103+
from huggingface_hub import HfFileSystem
104+
105+
path = f"hf://datasets/{dataset_id}/{path_in_repo}"
106+
with HfFileSystem().open(path) as f:
107+
# loop with .read() or .readline() to stream data
108+
# or do random access with .seek()
109+
```
110+
111+
Passing a `HfFileSystem` to a torch `DataLoader` reuses the cached results from `.ls()` and `.glob()` which eliminates the need for additional requests when listing data files.
112+
113+
114+
## Push streaming to the limit
115+
116+
We're now using these streaming enhancements in nanoVLM to train the next generation of SmolVLMs. With these tweaks, we achieve better performance from streaming than from training on our cluster's hierarchical hard disk setup. In fact, streaming is now as fast as reading the data from local SSDs! Previously, transferring data to local SSDs was the process that used to delay our trainings by three-hours. For more details, check out our GitHub.
117+
118+
119+
## Get Started and See the Difference
120+
121+
These powerful new features landed in the datasets and huggingface_hub libraries. To take advantage of them, simply update your libraries and check out [the documentation](https://huggingface.co/docs/datasets/stream):
122+
123+
```Bash
124+
pip install --upgrade datasets huggingface_hub
125+
```
126+
127+
To celebrate this, we preconcatenated and shuffled all the data sources in FineVision into [FineVisionMax](https://huggingface.co/datasets/HuggingFaceM4/FineVisionMax). You can use this single combined dataset to train your VLM – no need to handle multiple datasets manually!
128+
129+
```python
130+
from datasets import load_dataset
131+
132+
# Stream a dataset instead of downloading it
133+
dataset = load_dataset("HuggingFaceM4/FineVisionMax", split="train", streaming=True)
134+
# Get the first example
135+
print(next(iter(dataset)))
136+
```
137+
138+
And you can see how we do it at scale in [nanoVLM](https://github.com/huggingface/nanoVLM)!
139+
140+
Happy streaming! 🤗
141+

0 commit comments

Comments
 (0)