Blog to promote the improved streaming #3142

andimarafioti · 2025-10-22T10:25:27Z

Add an entry to _blog.yml.
Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
Check you use a short title and blog path.
Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
Ensure the publication date is correct.
Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

lhoestq

lgtm ! a small siggestion on the xet part + some links:

streaming-datasets.md

burtenshaw

A useful blog post. Nice work! I just left a few readability/consistency comments and one typo.

streaming-datasets.md

burtenshaw · 2025-10-22T10:48:18Z

streaming-datasets.md

+
+Together, these improvements can double your data throughput, allowing you to train faster and more efficiently.
+
+## How are we faster than plain S3: Xet


I feel like this is a big deal that could be mentioned in the tldr?

It is a huge deal, but for streaming it's not very relevant (see quentin's changes below). We can still mention it in the TLDR, but we need get the wording right :)

Yes, speaking of tldr, the super short current one is nice, but we could maybe add an intro paragraph with the main "contributions":

File resolution

Streaming optimizations

Xet

...

streaming-datasets.md

andimarafioti

Went through @burtenshaw and @lhoestq reviews :)

streaming-datasets.md

merveenoyan

TIL a ton of cool stuff 🙌🏻💗

streaming-datasets.md

merveenoyan · 2025-10-24T11:20:04Z

streaming-datasets.md

+
+1. Startup⚡️
+The initial resolution of data files was creating a ton of requests. We made two major changes:
+- Persistent Data Files Cache: We are now caching the list of data files across all DataLoader workers. The first worker resolves the file list from the Hub. All others workers read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms!


I think for people who don't know how streaming datasets with datasets and loading streamed datasets with DataLoader yourself, it's a bit confusing. perhaps before this section, you could elaborate a bit more on how things work internally so it sits better with improvements made

I think it'd suffice with explaining that each worker needs to know the list of remote files, which involves requests to the Hub. And maybe link to a Files view from one of the datasets. And maybe state that multiple datasets are usually combined for a realistic training job.

pcuenca

🔥

streaming-datasets.md

pcuenca · 2025-10-24T14:32:01Z

streaming-datasets.md

+    # or do random access with .seek()
+```
+
+Passing a `HfFileSystem` to a torch `DataLoader` reuses the cached results from `.ls()` and `.glob()` which eliminates the need for additional requests when listing data files.


so is stat() being cached?

streaming-datasets.md

Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

streaming-datasets.md

Co-authored-by: Merve Noyan <[email protected]>

Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Merve Noyan <[email protected]>

Co-authored-by: Pedro Cuenca <[email protected]>

Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

Co-authored-by: Merve Noyan <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

lhoestq approved these changes Oct 22, 2025

View reviewed changes

streaming-datasets.md Outdated Show resolved Hide resolved

streaming-datasets.md Outdated Show resolved Hide resolved

streaming-datasets.md Outdated Show resolved Hide resolved

streaming-datasets.md Outdated Show resolved Hide resolved

burtenshaw approved these changes Oct 22, 2025

View reviewed changes

andimarafioti commented Oct 22, 2025

View reviewed changes

andimarafioti force-pushed the streaming-boost branch 2 times, most recently from 114b828 to e08e0e2 Compare October 24, 2025 09:55

merveenoyan reviewed Oct 24, 2025

View reviewed changes

pcuenca approved these changes Oct 24, 2025

View reviewed changes

andimarafioti force-pushed the streaming-boost branch from e08e0e2 to af31464 Compare October 27, 2025 13:11

andimarafioti and others added 7 commits October 27, 2025 15:14

Blog to promote the improved streaming

ef2ba22

add entry to _blog.yml

811e7d5

Apply suggestions from code review from Ben and Quentin

4c796c9

Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>

more suggestions from ben

cbfbbba

change dataset name

b2833f0

adding more CTA at the end

9e1358f

fix date

c707115

andimarafioti force-pushed the streaming-boost branch from d060171 to c707115 Compare October 27, 2025 14:14

lhoestq reviewed Oct 27, 2025

View reviewed changes

streaming-datasets.md Outdated Show resolved Hide resolved

andimarafioti and others added 9 commits October 27, 2025 15:44

changing title and banner per pedro's review

07338ca

Update streaming-datasets.md

75319d1

Co-authored-by: Merve Noyan <[email protected]>

Apply suggestions from code review

1ea5320

Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Merve Noyan <[email protected]>

Apply suggestions from code review

37ec8f2

Co-authored-by: Pedro Cuenca <[email protected]>

Apply suggestions from code review

367adad

Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

Apply suggestions from code review

46600f4

Co-authored-by: Merve Noyan <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

comments from merve

7fd53e3

adding merve as an author :)

c347cc2

pedro's comment

c2b179b

andimarafioti merged commit d5bd6a5 into main Oct 27, 2025
1 of 2 checks passed

andimarafioti deleted the streaming-boost branch October 27, 2025 15:07


		Together, these improvements can double your data throughput, allowing you to train faster and more efficiently.

		## How are we faster than plain S3: Xet

Blog to promote the improved streaming #3142

Blog to promote the improved streaming #3142

Uh oh!

Conversation

andimarafioti commented Oct 22, 2025

Getting a Review

Uh oh!

lhoestq left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

burtenshaw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

burtenshaw Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

andimarafioti Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

andimarafioti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merveenoyan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merveenoyan Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

pcuenca left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pcuenca Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!