Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify datasets cache path from references with regular PyTorch cache? #6727

Open
pmeier opened this issue Oct 10, 2022 · 4 comments
Open

Unify datasets cache path from references with regular PyTorch cache? #6727

pmeier opened this issue Oct 10, 2022 · 4 comments

Comments

@pmeier
Copy link
Contributor

pmeier commented Oct 10, 2022

In the classification and video_classification references, we cache here:

However, this directory is not used by PyTorch core. Instead, ~/.cache/torch is used. For example, torch.hub caches in ~/.cache/torch/hub. The datasets v2 used the same root folder and will store the datasets by default in

_HOME = os.path.join(_get_torch_home(), "datasets", "vision")

which expands to ~/.cache/torch/datasets/vision.

Maybe we can use ~/.cache/torch/cached_datasets or something similar as cache path in the references?

cc @datumbox @vfdev-5

@datumbox
Copy link
Contributor

Thanks for reporting @pmeier. Ideally we would like to move away from needing to pre-read the dataset and cache it. This is currently necessary due to the way that the Video Clipping class works but this causes issues with streamed datasets. @YosuaMichael is looking to fix this.

@pmeier
Copy link
Contributor Author

pmeier commented Oct 10, 2022

@YosuaMichael if we won't support caching in the future, feel free to close this issue.

@YosuaMichael
Copy link
Contributor

YosuaMichael commented Oct 10, 2022

@datumbox In the case of VideoClipping, we indeed cache the dataset because we pre-compute all the non-sampled clips start and end. However, seems like this cache concept is not just for video dataset but rather for general dataset (for classification too).

Also, I am not sure yet if we will get rid of cache (for performance reason) even if we change the clip sampler design, so I think this issue should be still open for now.

@NicolasHug
Copy link
Member

The datasets v2 used the same root folder and will store the datasets by default in

_HOME = os.path.join(_get_torch_home(), "datasets", "vision")

which expands to ~/.cache/torch/datasets/vision.

This will more likely be ~/.cache/torch/vision/datasets to keep domains properly separated. FYI @mthrok @parmeet and I had agreed on the following API for setting / getting assets folders, as well as their default paths (at the time we didn't consider "dataset cache" but it's just another asset type):

def set_home(root, asset="all"):
    # asset can be "all", "datasets", "models", "tutorials", etc.
    # this is placed in the main namespace e.g. torchvision.set_home() or torchtext.set_home()
      #  Note: using set_home(home=...) doesn’t persist across Python executions

def get_home(asset):
    # Priority (highest = 0)
    # 0. whatever was set earlier in the program through `set_home(root=root, asset=asset)`
    # 1. asset-specific env variable e.g. $TORCHTEXT_DATASETS_HOME
    # 2. domain-wide env variable + asset name, e.g. $TORCHTEXT_HOME / datasets
    # 3. default, which corresponds to torch.hub._get_torch_home() / DOMAIN_NAME / ASSET_NAME
    #    typically ~/.cache/torch/vision/datasets
    #                ^^^^^^^^^^^^
    #            This is returned by _get_torch_home()
    #            and can get overridden with the $TORCH_HOME variable as well.
    pass

So perhaps we'll want to go with ~/.cache/torch/vision/cached_datasets . The difference between "cached_datasets" and "datasets" isn't obvious, but I don't have a much better suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants