-
Notifications
You must be signed in to change notification settings - Fork 932
Blog to promote the improved streaming #3142
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
lhoestq
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm ! a small siggestion on the xet part + some links:
burtenshaw
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A useful blog post. Nice work! I just left a few readability/consistency comments and one typo.
|
|
||
| Together, these improvements can double your data throughput, allowing you to train faster and more efficiently. | ||
|
|
||
| ## How are we faster than plain S3: Xet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this is a big deal that could be mentioned in the tldr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a huge deal, but for streaming it's not very relevant (see quentin's changes below). We can still mention it in the TLDR, but we need get the wording right :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, speaking of tldr, the super short current one is nice, but we could maybe add an intro paragraph with the main "contributions":
- File resolution
- Streaming optimizations
- Xet
- ...
andimarafioti
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Went through @burtenshaw and @lhoestq reviews :)
114b828 to
e08e0e2
Compare
merveenoyan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL a ton of cool stuff 🙌🏻💗
|
|
||
| 1. Startup⚡️ | ||
| The initial resolution of data files was creating a ton of requests. We made two major changes: | ||
| - Persistent Data Files Cache: We are now caching the list of data files across all DataLoader workers. The first worker resolves the file list from the Hub. All others workers read directly from this local cache, virtually eliminating startup requests and slashing resolution time. No more request storms! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for people who don't know how streaming datasets with datasets and loading streamed datasets with DataLoader yourself, it's a bit confusing. perhaps before this section, you could elaborate a bit more on how things work internally so it sits better with improvements made
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd suffice with explaining that each worker needs to know the list of remote files, which involves requests to the Hub. And maybe link to a Files view from one of the datasets. And maybe state that multiple datasets are usually combined for a realistic training job.
pcuenca
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
| # or do random access with .seek() | ||
| ``` | ||
|
|
||
| Passing a `HfFileSystem` to a torch `DataLoader` reuses the cached results from `.ls()` and `.glob()` which eliminates the need for additional requests when listing data files. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so is stat() being cached?
e08e0e2 to
af31464
Compare
Co-authored-by: burtenshaw <[email protected]> Co-authored-by: Quentin Lhoest <[email protected]>
d060171 to
c707115
Compare
Co-authored-by: Merve Noyan <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]> Co-authored-by: Merve Noyan <[email protected]>
Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Quentin Lhoest <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>
Co-authored-by: Merve Noyan <[email protected]> Co-authored-by: Pedro Cuenca <[email protected]>

mdfile. You can also specifyguestororgfor the authors.Here is an example of a complete PR: #2382
Getting a Review
Please make sure to get a review from someone on your team or a co-author.
Once this is done and once all the steps above are completed, you should be able to merge.
There is no need for additional reviews if you and your co-authors are happy and meet all of the above.
Feel free to add @pcuenca as a reviewer if you want a final check. Keep in mind he'll be biased toward light reviews
(e.g., check for proper metadata) rather than content reviews unless explicitly asked.