Keep hffs cache in workers when streaming #7820

lhoestq · 2025-10-15T15:51:28Z

(and also reorder the hffs args to improve caching)

When using DataLoader(iterable_dataset, num_workers=...) the dataset is pickled and passed to the worker. However previously the resulting dataset would be in a process with an empty hffs cache. By keeping the cache attached to IterableDataset, the cached hffs instances are pickled with the dataset and re-populates the cache in the DataLoader workers

this requires huggingface/huggingface_hub#3443 to work effectively though, otherwise the unpickled hffs cache would start empty

cc @andimarafioti @LTMeyer

HuggingFaceDocBuilderDev · 2025-10-15T16:01:48Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

andimarafioti

I'll try this :)

andimarafioti · 2025-10-16T14:57:32Z

src/datasets/download/download_config.py

        if name == "token" and getattr(self, "storage_options", None) is not None:
            if "hf" not in self.storage_options:
-                self.storage_options["hf"] = {"token": value, "endpoint": config.HF_ENDPOINT}
+                self.storage_options["hf"] = {"endpoint": config.HF_ENDPOINT, "token": value}


Did you need to make this change? seems weird since dicts aren't ordered in python

because fsspec cache that maps a filesystem argument to the cached instance is sensitive to the order ^^'

with those changes every instance of HfFileSystem in datasets uses the same order

andimarafioti · 2025-10-16T14:58:17Z

src/datasets/utils/file_utils.py

        storage_options = {
-            "token": token,
            "endpoint": config.HF_ENDPOINT,
+            "token": token,


same, weird that you need this :S

lhoestq added 2 commits October 15, 2025 17:49

keep hffs cache in workers when streaming

2c17758

bonus: reorder hffs args to improve caching

ad4d847

lhoestq mentioned this pull request Oct 15, 2025

[HfFileSystem] Keep cache on pickle huggingface/huggingface_hub#3443

Open

andimarafioti reviewed Oct 16, 2025

View reviewed changes

lhoestq merged commit 0b2a4c2 into main Oct 17, 2025
10 of 15 checks passed

lhoestq deleted the keep-hffs-cache-in-workers-when-streaming branch October 17, 2025 09:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Keep hffs cache in workers when streaming #7820

Keep hffs cache in workers when streaming #7820

Uh oh!

lhoestq commented Oct 15, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Oct 15, 2025

Uh oh!

andimarafioti left a comment

Uh oh!

andimarafioti Oct 16, 2025

Uh oh!

lhoestq Oct 16, 2025

Uh oh!

andimarafioti Oct 16, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Keep hffs cache in workers when streaming #7820

Keep hffs cache in workers when streaming #7820

Uh oh!

Conversation

lhoestq commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 15, 2025

Uh oh!

andimarafioti left a comment

Choose a reason for hiding this comment

Uh oh!

andimarafioti Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

lhoestq Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

andimarafioti Oct 16, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

lhoestq commented Oct 15, 2025 •

edited

Loading