Unify datasets cache path from references with regular PyTorch cache? #6727

pmeier · 2022-10-10T07:58:05Z

In the classification and video_classification references, we cache here:

vision/references/classification/train.py

Line 108 in 6e203b4

cache_path = os.path.join("~", ".torch", "vision", "datasets", "imagefolder", h[:10] + ".pt")
vision/references/video_classification/train.py

Line 124 in 6e203b4

cache_path = os.path.join("~", ".torch", "vision", "datasets", "kinetics", h[:10] + ".pt")

However, this directory is not used by PyTorch core. Instead, ~/.cache/torch is used. For example, torch.hub caches in ~/.cache/torch/hub. The datasets v2 used the same root folder and will store the datasets by default in

vision/torchvision/_internally_replaced_utils.py

Line 7 in 6e203b4

_HOME = os.path.join(_get_torch_home(), "datasets", "vision")

which expands to ~/.cache/torch/datasets/vision.

Maybe we can use ~/.cache/torch/cached_datasets or something similar as cache path in the references?

cc @datumbox @vfdev-5

The text was updated successfully, but these errors were encountered:

datumbox · 2022-10-10T08:18:09Z

Thanks for reporting @pmeier. Ideally we would like to move away from needing to pre-read the dataset and cache it. This is currently necessary due to the way that the Video Clipping class works but this causes issues with streamed datasets. @YosuaMichael is looking to fix this.

pmeier · 2022-10-10T08:28:11Z

@YosuaMichael if we won't support caching in the future, feel free to close this issue.

YosuaMichael · 2022-10-10T15:48:57Z

@datumbox In the case of VideoClipping, we indeed cache the dataset because we pre-compute all the non-sampled clips start and end. However, seems like this cache concept is not just for video dataset but rather for general dataset (for classification too).

Also, I am not sure yet if we will get rid of cache (for performance reason) even if we change the clip sampler design, so I think this issue should be still open for now.

NicolasHug · 2022-10-10T16:31:59Z

The datasets v2 used the same root folder and will store the datasets by default in

vision/torchvision/_internally_replaced_utils.py

Line 7 in 6e203b4

_HOME = os.path.join(_get_torch_home(), "datasets", "vision")

which expands to ~/.cache/torch/datasets/vision.

This will more likely be ~/.cache/torch/vision/datasets to keep domains properly separated. FYI @mthrok @parmeet and I had agreed on the following API for setting / getting assets folders, as well as their default paths (at the time we didn't consider "dataset cache" but it's just another asset type):

def set_home(root, asset="all"):
    # asset can be "all", "datasets", "models", "tutorials", etc.
    # this is placed in the main namespace e.g. torchvision.set_home() or torchtext.set_home()
      #  Note: using set_home(home=...) doesn’t persist across Python executions

def get_home(asset):
    # Priority (highest = 0)
    # 0. whatever was set earlier in the program through `set_home(root=root, asset=asset)`
    # 1. asset-specific env variable e.g. $TORCHTEXT_DATASETS_HOME
    # 2. domain-wide env variable + asset name, e.g. $TORCHTEXT_HOME / datasets
    # 3. default, which corresponds to torch.hub._get_torch_home() / DOMAIN_NAME / ASSET_NAME
    #    typically ~/.cache/torch/vision/datasets
    #                ^^^^^^^^^^^^
    #            This is returned by _get_torch_home()
    #            and can get overridden with the $TORCH_HOME variable as well.
    pass

So perhaps we'll want to go with ~/.cache/torch/vision/cached_datasets . The difference between "cached_datasets" and "datasets" isn't obvious, but I don't have a much better suggestion.

pmeier added enhancement module: reference scripts topic: classification labels Oct 10, 2022

pmeier mentioned this issue Nov 9, 2023

[TorchFix] Add weights_only to torch.load #8105

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unify datasets cache path from references with regular PyTorch cache? #6727

Unify datasets cache path from references with regular PyTorch cache? #6727

pmeier commented Oct 10, 2022 •

edited by pytorch-bot bot

Loading

datumbox commented Oct 10, 2022

pmeier commented Oct 10, 2022

YosuaMichael commented Oct 10, 2022 •

edited

Loading

NicolasHug commented Oct 10, 2022

Unify datasets cache path from references with regular PyTorch cache? #6727

Unify datasets cache path from references with regular PyTorch cache? #6727

Comments

pmeier commented Oct 10, 2022 • edited by pytorch-bot bot Loading

datumbox commented Oct 10, 2022

pmeier commented Oct 10, 2022

YosuaMichael commented Oct 10, 2022 • edited Loading

NicolasHug commented Oct 10, 2022

pmeier commented Oct 10, 2022 •

edited by pytorch-bot bot

Loading

YosuaMichael commented Oct 10, 2022 •

edited

Loading