Skip to content

Conversation

@andimarafioti
Copy link
Member

  • Add an entry to _blog.yml.
  • Add a thumbnail. There are no requirements here, but there is a template if it's helpful.
  • Check you use a short title and blog path.
  • Upload any additional assets (such as images) to the Documentation Images repo. This is to reduce bloat in the GitHub base repo when cloning and pulling. Try to have small images to avoid a slow or expensive user experience.
  • Add metadata (such as authors) to your md file. You can also specify guest or org for the authors.
  • Ensure the publication date is correct.
  • Preview the content. A quick way is to paste the markdown content in https://huggingface.co/new-blog. Do not click publish, this is just a way to do an early check.

Here is an example of a complete PR: #2382

Getting a Review

Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.

Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm ! a small siggestion on the xet part + some links:

Copy link
Collaborator

@burtenshaw burtenshaw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A useful blog post. Nice work! I just left a few readability/consistency comments and one typo.


Together, these improvements can double your data throughput, allowing you to train faster and more efficiently.

## How are we faster than plain S3: Xet
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is a big deal that could be mentioned in the tldr?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a huge deal, but for streaming it's not very relevant (see quentin's changes below). We can still mention it in the TLDR, but we need get the wording right :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, speaking of tldr, the super short current one is nice, but we could maybe add an intro paragraph with the main "contributions":

  • File resolution
  • Streaming optimizations
  • Xet
  • ...

Copy link
Member Author

@andimarafioti andimarafioti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Went through @burtenshaw and @lhoestq reviews :)

@andimarafioti andimarafioti force-pushed the streaming-boost branch 2 times, most recently from 114b828 to e08e0e2 Compare October 24, 2025 09:55
Copy link
Contributor

@merveenoyan merveenoyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL a ton of cool stuff 🙌🏻💗


1. Startup⚡️
The initial resolution of data files was creating a ton of requests. We made two major changes:
- Persistent Data Files Cache: We are now caching the list of data files across all DataLoader workers. The first worker resolves the file list from the Hub. All others workers read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for people who don't know how streaming datasets with datasets and loading streamed datasets with DataLoader yourself, it's a bit confusing. perhaps before this section, you could elaborate a bit more on how things work internally so it sits better with improvements made

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd suffice with explaining that each worker needs to know the list of remote files, which involves requests to the Hub. And maybe link to a Files view from one of the datasets. And maybe state that multiple datasets are usually combined for a realistic training job.

Copy link
Member

@pcuenca pcuenca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

# or do random access with .seek()
```

Passing a `HfFileSystem` to a torch `DataLoader` reuses the cached results from `.ls()` and `.glob()` which eliminates the need for additional requests when listing data files.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so is stat() being cached?

@andimarafioti andimarafioti merged commit d5bd6a5 into main Oct 27, 2025
1 of 2 checks passed
@andimarafioti andimarafioti deleted the streaming-boost branch October 27, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants