Skip to content

DVC do not cache output of pipeline properly #10549

Open
@imyhxy

Description

@imyhxy

Bug Report

repro: doesn't cache output properly with reflink setup.

Description

I have 4 pipeline to transform the same input dataset for different tasks. The images was process the same way, and the cache.type was setting to reflink. So, according to the document, there should be only one copy of the output images. But this is not the truth. All output of the pipeline was not set to reflink with the cached file.

If I run the dvc checkout -R --reflink after the pipelink was executed. Then the disk usage behavior normally.

The output of btrfs fi du -s . right after repro:

     Total   Exclusive  Set shared  Filename
  90.50GiB    31.21GiB    29.64GiB  .

The output of btrfs fi du -s . right after dvc checkout -R --reflink:

     Total   Exclusive  Set shared  Filename
  90.50GiB     1.07GiB    29.83GiB  .

Reproduce

Expected

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-6.7.10-060710-generic-x86_64-with-glibc2.39
Subprojects:
	
Supports:
	azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
	gdrive (pydrive2 = 1.20.0),
	gs (gcsfs = 2024.6.1),
	hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
	http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
	https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
	oss (ossfs = 2023.12.0),
	s3 (s3fs = 2024.6.1, boto3 = 1.35.7),
	ssh (sshfs = 2024.6.0),
	webdav (webdav4 = 0.10.0),
	webdavs (webdav4 = 0.10.0),
	webhdfs (fsspec = 2024.6.1)
Config:
	Global: /home/fkwong/.config/dvc
	System: /etc/xdg/xdg-ubuntu/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sda1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sda1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/055b5579042ae6f272efc40fc232cbdd

Additional Information (if any):

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: data-managementRelated to dvc add/checkout/commit/move/removeA: pipelinesRelated to the pipelines featureoptimizeOptimizes DVC

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions