Open
Description
Bug Report
repro: doesn't cache output properly with reflink
setup.
Description
I have 4 pipeline to transform the same input dataset for different tasks. The images was process the same way, and the cache.type
was setting to reflink
. So, according to the document, there should be only one copy of the output images. But this is not the truth. All output of the pipeline was not set to reflink with the cached file.
If I run the dvc checkout -R --reflink
after the pipelink was executed. Then the disk usage behavior normally.
The output of btrfs fi du -s .
right after repro
:
Total Exclusive Set shared Filename
90.50GiB 31.21GiB 29.64GiB .
The output of btrfs fi du -s .
right after dvc checkout -R --reflink
:
Total Exclusive Set shared Filename
90.50GiB 1.07GiB 29.83GiB .
Reproduce
Expected
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 3.55.2 (deb)
-------------------------
Platform: Python 3.10.8 on Linux-6.7.10-060710-generic-x86_64-with-glibc2.39
Subprojects:
Supports:
azure (adlfs = 2024.7.0, knack = 0.12.0, azure-identity = 1.17.1),
gdrive (pydrive2 = 1.20.0),
gs (gcsfs = 2024.6.1),
hdfs (fsspec = 2024.6.1, pyarrow = 17.0.0),
http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
oss (ossfs = 2023.12.0),
s3 (s3fs = 2024.6.1, boto3 = 1.35.7),
ssh (sshfs = 2024.6.0),
webdav (webdav4 = 0.10.0),
webdavs (webdav4 = 0.10.0),
webhdfs (fsspec = 2024.6.1)
Config:
Global: /home/fkwong/.config/dvc
System: /etc/xdg/xdg-ubuntu/dvc
Cache types: reflink, hardlink, symlink
Cache directory: btrfs on /dev/sda1
Caches: local
Remotes: None
Workspace directory: btrfs on /dev/sda1
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/055b5579042ae6f272efc40fc232cbdd
Additional Information (if any):