Skip to content

dvc queue start checkout more files than required #10293

Open
@PythonFZ

Description

@PythonFZ

Bug Report / Feature Request

I have a project with a data directory (150 GB) containing 11 files. I have added the entire directory using dvc add data.
In my workflow each experiment I want to conduct only depends on a single file in the data directory.
Running an experiment using the dvc queue will dvc checkout the entire data directory.
It would be much faster if the command only dvc checkout data files, which are actually required by the workflow, as defined in the dvc.yaml .

Expected

The data directory in .dvc/tmp/exp/... would only contain files, specified as explicit dependencies in the dvc.yaml workflow file.

Environment information

Output of dvc doctor:

DVC version: 3.43.1 (pip)
-------------------------
Platform: Python 3.11.7 on Linux-6.5.0-15-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.9.0
        dvc_objects = 3.0.6
        dvc_render = 1.0.1
        dvc_task = 0.3.0
        scmrepo = 2.0.2
Supports:
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.2.0, boto3 = 1.34.34)
Config:
        Global: /tikhome/fzills/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: nfs on 129.69.120.13:/share/work_icp/fzills
Caches: local
Remotes: None
Workspace directory: nfs on 129.69.120.13:/share/work_icp/fzills
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/240bb452ebd33bc5c31f30d78040c7d2

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: experimentsRelated to dvc expp2-mediumMedium priority, should be done, but less importantperformanceimprovement over resource / time consuming tasks

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions