Open
Description
Bug Report / Feature Request
I have a project with a data directory (150 GB) containing 11 files. I have added the entire directory using dvc add data
.
In my workflow each experiment I want to conduct only depends on a single file in the data
directory.
Running an experiment using the dvc queue
will dvc checkout
the entire data
directory.
It would be much faster if the command only dvc checkout
data files, which are actually required by the workflow, as defined in the dvc.yaml
.
Expected
The data
directory in .dvc/tmp/exp/...
would only contain files, specified as explicit dependencies in the dvc.yaml
workflow file.
Environment information
Output of dvc doctor
:
DVC version: 3.43.1 (pip)
-------------------------
Platform: Python 3.11.7 on Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.9.0
dvc_objects = 3.0.6
dvc_render = 1.0.1
dvc_task = 0.3.0
scmrepo = 2.0.2
Supports:
http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.2.0, boto3 = 1.34.34)
Config:
Global: /tikhome/fzills/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs on 129.69.120.13:/share/work_icp/fzills
Caches: local
Remotes: None
Workspace directory: nfs on 129.69.120.13:/share/work_icp/fzills
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/240bb452ebd33bc5c31f30d78040c7d2