Skip to content

pull: pulling single file is really slow when there's hundreds other .dvc files #8768

Open
@Marigold

Description

@Marigold

Bug Report

Description

We have about thousand small files in DVC. We're using Python API, though CLI has the same issue. We often need to add / pull a single new file so we use something like

from dvc.repo import Repo
repo = Repo("repo_root")
repo.pull("my_file.csv.dvc")

This takes almost 10 seconds, because DVC internally loads all stages before pulling that single file. I'd expect this to be almost instant. Why does it have to go through all the other dvc files? (my .dvcignore ignores as much files as possible, but the bottleneck is loading dvc files anyway)

Thanks!

Environment information

Output of dvc doctor:

DVC version: 2.38.1 (pip)
---------------------------------
Platform: Python 3.9.14 on macOS-12.5-x86_64-i386-64bit
Subprojects:
	dvc_data = 0.28.4
	dvc_objects = 0.14.0
	dvc_render = 0.0.15
	dvc_task = 0.1.8
	dvclive = 1.2.2
	scmrepo = 0.1.4
Supports:
	http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
	s3 (s3fs = 2022.11.0, boto3 = 1.24.59)
Cache types: reflink, hardlink, symlink
Cache directory: apfs on /dev/disk1s5s1
Caches: local
Remotes: s3, https, s3
Workspace directory: apfs on /dev/disk1s5s1
Repo: dvc, git

Additional Information (if any):

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugDid we break something?performanceimprovement over resource / time consuming tasks

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions