Open
Description
Bug Report
Description
We have about thousand small files in DVC. We're using Python API, though CLI has the same issue. We often need to add / pull a single new file so we use something like
from dvc.repo import Repo
repo = Repo("repo_root")
repo.pull("my_file.csv.dvc")
This takes almost 10 seconds, because DVC internally loads all stages before pulling that single file. I'd expect this to be almost instant. Why does it have to go through all the other dvc files? (my .dvcignore
ignores as much files as possible, but the bottleneck is loading dvc files anyway)
Thanks!
Environment information
Output of dvc doctor
:
DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.14 on macOS-12.5-x86_64-i386-64bit
Subprojects:
dvc_data = 0.28.4
dvc_objects = 0.14.0
dvc_render = 0.0.15
dvc_task = 0.1.8
dvclive = 1.2.2
scmrepo = 0.1.4
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3, https, s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git
Additional Information (if any):