Skip to content

Optimize import-url on directories, do not download existing files again #4889

Open
@shcheklein

Description

@shcheklein

The use case is described here:

https://discord.com/channels/485586884165107732/485596304961962003/773882224248225812
https://discord.com/channels/485586884165107732/485596304961962003/776205388654968892


I have a bucket with images that is being updated (new images, etc) regularly:

gs://bucket/dataset/images

I don't want to use dvc add, primarily because I have other projects, websites using this bucket directly.
I still want to use these images in the DVC project, download them, download new, etc (basically do sync operation):

dvc run -n download_images 
              --outs-persist-no-cache dataset/images 
              "gsutil -m cp -n -r gs://bucket/dataset/images dataset/"
# -m -> multiprocess
# -n -> not downloading existing files (even though i wouldnt care because i pay the access anyway and the traffic is free) but its faster

This can be done with import-url. This is what import-url was made for:

dvc import-url gs://bucket/dataset/images

The problem is that import-url downloads all the images again on every change in the remote directory.

We should somehow utilize existing knowledge about the local cache to avoid downloading existing images and download only changes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedoptimizeOptimizes DVCp2-mediumMedium priority, should be done, but less importantperformanceimprovement over resource / time consuming tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions