Description
Bug Report
import/pull: hangs when pulling many files from GCS and one (or a few) fails
Description
I have a directory in a data registry that is tracked with DVC which contains ~1.6k files. I pushed it to a Google Cloud Storage remote, and now I am trying to either dvc import
or dvc pull
the data from the GCS remote to a new machine. This works well for ~99% of the files, but sometimes a few seem to simply fail silently and never actually download.
When this happens, the entire process hangs once all successful downloads have completed, leaving only the hung/failed/paused (not sure) downloads remaining. They never seem to resume and I am forced to Ctrl+C
out of the process.
This is less of an issue when using dvc pull
as I can simply re-run the command and it will only download the missing files, but with dvc import
I am forced to re-run the process.
Reproduce
dvc init
(data registry project)dvc add <large directory containing many moderately-sized files (each around 30MB)>
dvc remote add <google cloud storage bucket>
dvc push <google cloud storage>
git commit -am "commit msg" && git push
- (new machine or repository)
git init <new-project>
dvc import https://github.com/<data-registry-repo> path/to/large-dir
Most files finish downloading, but some fail (or hang) silently and prevent the entire dvc import
from completing:
I eventually have to Ctrl-C
to exit the process or else it hangs for hours. Here is the full output before and after the Ctrl-C
interrupt:
$ (traits) ➜ cit-sci-traits git:(wip) ✗ dvc import https://github.com/GeoSense-Freiburg/panops-data-registry data/gbif/all_tracheophyta_non-cult_2024-04-10 -o data/raw --remote gcs --force --verbose
2024-05-07 22:49:38,950 DEBUG: v3.50.1 (pip), CPython 3.10.12 on Linux-6.5.0-1020-oem-x86_64-with-glibc2.35
2024-05-07 22:49:38,950 DEBUG: command: /home/daniel/.local/bin/dvc import https://github.com/GeoSense-Freiburg/panops-data-registry data/gbif/all_tracheophyta_non-cult_2024-04-10 -o data/raw --remote
gcs --force --verbose
2024-05-07 22:49:39,321 DEBUG: Removing output 'data/raw/all_tracheophyta_non-cult_2024-04-10' of stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc'.
2024-05-07 22:49:39,322 DEBUG: Removing '/home/daniel/projects/ufr/cit-sci-traits/data/raw/all_tracheophyta_non-cult_2024-04-10'
Importing 'data/gbif/all_tracheophyta_non-cult_2024-04-10 (https://github.com/GeoSense-Freiburg/panops-data-registry)' -> 'data/raw/all_tracheophyta_non-cult_2024-04-10'
2024-05-07 22:49:39,406 DEBUG: Computed stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc' md5: '4f2d2baf24cfc67c5da509212106b2af'
2024-05-07 22:49:39,406 DEBUG: 'md5' of stage: 'data/raw/all_tracheophyta_non-cult_2024-04-10.dvc' changed.
2024-05-07 22:49:39,407 DEBUG: Creating external repo https://github.com/GeoSense-Freiburg/panops-data-registry@None
2024-05-07 22:49:39,407 DEBUG: erepo: git clone 'https://github.com/GeoSense-Freiburg/panops-data-registry' to a temporary dir
2024-05-07 22:49:40,906 DEBUG: Lockfile '../../../../../dvc.lock' needs to be updated.
2024-05-07 22:49:41,101 DEBUG: failed to load ('data', 'gbif', 'all_tracheophyta_non-cult_2024-04-10') from storage local (/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5) - [Errno 2] No
such file or directory: '/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5/ad/a860692d425bd047bd09fd4531fdbf.dir'
Traceback (most recent call last):
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/index/index.py", line 611, in _load_from_storage
_load_from_object_storage(trie, entry, storage)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/index/index.py", line 547, in _load_from_object_storage
obj = Tree.load(storage.odb, root_entry.hash_info, hash_name=storage.odb.hash_name)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_data/hashfile/tree.py", line 193, in load
with obj.fs.open(obj.path, "r") as fobj:
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/base.py", line 324, in open
return self.fs.open(path, mode=mode, **kwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/local.py", line 131, in open
return open(path, mode=mode, encoding=encoding) # noqa: SIM115
FileNotFoundError: [Errno 2] No such file or directory: '/home/daniel/projects/ufr/cit-sci-traits/.dvc/cache/files/md5/ad/a860692d425bd047bd09fd4531fdbf.dir'
98% Downloading all_tracheophyta_non-cult_2024-04-10|████████████████████████████████████████████████████████████████████████████████████████████████████████▎ |1.54k/1.58k [54:11<00:53, 1.37s/files]
2024-05-07 23:47:02,580 ERROR: interrupted by the user
Traceback (most recent call last):
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 264, in _get
list(callback.wrap(it))
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/fsspec/callbacks.py", line 157, in wrap
for item in iterable:
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 37, in imap_unordered
done, tasks = futures.wait(tasks, return_when=futures.FIRST_COMPLETED)phyta_non-cult_2024-04-10.parquet/000000 14.2M/72.8M [13:40<07:01, 146kB/s]
File "/usr/lib/python3.10/concurrent/futures/_base.py", line 307, in wait
waiter.event.wait(timeout)
File "/usr/lib/python3.10/threading.py", line 607, in wait
signaled = self._cond.wait(timeout)
File "/usr/lib/python3.10/threading.py", line 320, in wait
waiter.acquire()
KeyboardInterrupt
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/cli/__init__.py", line 211, in main
ret = cmd.do_run()
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/cli/command.py", line 27, in do_run
return self.run()
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/commands/imp.py", line 15, in run
self.repo.imp(
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/imp.py", line 44, in imp
return self.imp_url(path, out=out, erepo=erepo, frozen=True, **kwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/__init__.py", line 58, in wrapper
return f(repo, *args, **kwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/scm_context.py", line 143, in run
return method(repo, *args, **kw)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/repo/imp_url.py", line 86, in imp_url
stage.run(jobs=jobs, no_download=no_download)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/decorators.py", line 44, in rwlocked
return call()
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
return self._func(*self._args, **self._kwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/__init__.py", line 603, in run
self._sync_import(dry, force, kwargs.get("jobs", None), no_download)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 47, in wrapper
return deco(call, *dargs, **dkwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/decorators.py", line 44, in rwlocked
return call() [4/1937]
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/funcy/decorators.py", line 68, in __call__
return self._func(*self._args, **self._kwargs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/__init__.py", line 640, in _sync_import
sync_import(self, dry, force, jobs, no_download)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/stage/imports.py", line 61, in sync_import
stage.deps[0].download(stage.outs[0], jobs=jobs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/dependency/base.py", line 48, in download
fs_download(self.fs, self.fs_path, to.fs_path, jobs=jobs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc/fs/__init__.py", line 75, in download
generic.copy(fs, from_infos, localfs, to_infos, callback=cb, batch_size=jobs)
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 97, in copy
return _get(
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/fs/generic.py", line 262, in _get
with executor:
File "/home/daniel/.local/pipx/venvs/dvc/lib/python3.10/site-packages/dvc_objects/executors.py", line 44, in __exit__
self.shutdown(wait=True, cancel_futures=cancel_futures)
File "/usr/lib/python3.10/concurrent/futures/thread.py", line 235, in shutdown
t.join()
File "/usr/lib/python3.10/threading.py", line 1096, in join
self._wait_for_tstate_lock()
File "/usr/lib/python3.10/threading.py", line 1116, in _wait_for_tstate_lock
if lock.acquire(block, timeout):
KeyboardInterrupt
2024-05-07 23:47:02,586 DEBUG: Analytics is enabled.
2024-05-07 23:47:02,644 DEBUG: Trying to spawn ['daemon', 'analytics', '/tmp/tmp50h17rzm', '-v']
2024-05-07 23:47:02,678 DEBUG: Spawned ['daemon', 'analytics', '/tmp/tmp50h17rzm', '-v'] with pid 314508
2024-05-07 23:47:02,682 DEBUG: Removing '/tmp/tmp2elzelv6dvc-clone'
2024-05-07 23:47:02,688 DEBUG: Removing '/tmp/tmp47j66_8ldvc-cache'
1%| |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000048 687k/57.1M [57:19<1:08:10, 14.5kB/s]
4%|▎ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000023 482k/12.9M [57:19<3:50:00, 947B/s]
15%|█▍ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000025 5.72M/38.4M [57:20<03:16, 175kB/s]
1%| |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000058 460k/39.3M [57:19<106:54:50, 106B/s]
0%| |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000007 276k/79.7M [57:20<14:38:23, 1.58kB/s]
16%|█▋ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000005 4.48M/27.2M [57:19<6:26:13, 1.03kB/s]
34%|███▍ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000010 11.5M/33.8M [57:19<01:55, 202kB/s]
37%|███▋ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000008 13.5M/36.7M [57:19<03:14, 125kB/s]
20%|██ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000000 14.9M/72.8M [57:20<06:56, 146kB/s]
3%|▎ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000055 1.08M/39.3M [57:19<15:07:16, 736B/s]
0%| |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000013 266k/67.4M [57:20<165:16:44, 118B/s]
3%|▎ |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000001 1.96M/60.1M [57:20<22:28:28, 754B/s]
1%| |data/gbif/all_tracheophyta_non-cult_2024-04-10/all_tracheophyta_non-cult_2024-04-10.parquet/000059 251k/39.4M [57:19<9:11:19, 1.24kB/s]
Expected
I expect dvc import
to detect download timeouts and either restart the hanging download or to complete the process with a warning and to then allow for a partial dvc import
to attempt to gather only the missing files instead of requiring a complete re-download of the data.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 3.50.1 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-6.5.0-1020-oem-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.15.1
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.4.0
scmrepo = 3.3.2
Supports:
gs (gcsfs = 2024.3.1),
http (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.5, aiohttp-retry = 2.8.3),
ssh (sshfs = 2024.4.1)
Config:
Global: /home/daniel/.config/dvc
System: /etc/xdg/xdg-cinnamon/dvc
Cache types: hardlink, symlink
Cache directory: ext4 on /dev/nvme0n1p5
Caches: local
Remotes: local, gs, gs
Workspace directory: ext4 on /dev/nvme0n1p5
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/1545d141359143110c3029289a778e59
Additional Information (if any):