Description
The use case is described here:
https://discord.com/channels/485586884165107732/485596304961962003/773882224248225812
https://discord.com/channels/485586884165107732/485596304961962003/776205388654968892
I have a bucket with images that is being updated (new images, etc) regularly:
gs://bucket/dataset/images
I don't want to use dvc add
, primarily because I have other projects, websites using this bucket directly.
I still want to use these images in the DVC project, download them, download new, etc (basically do sync operation):
dvc run -n download_images
--outs-persist-no-cache dataset/images
"gsutil -m cp -n -r gs://bucket/dataset/images dataset/"
# -m -> multiprocess
# -n -> not downloading existing files (even though i wouldnt care because i pay the access anyway and the traffic is free) but its faster
This can be done with import-url
. This is what import-url
was made for:
dvc import-url gs://bucket/dataset/images
The problem is that import-url
downloads all the images again on every change in the remote directory.
We should somehow utilize existing knowledge about the local cache to avoid downloading existing images and download only changes.