Description
Bug Report
Description
I use git and dvc to manage my training datasets, which consists of thousands of jsonl files.
After I modify several jsonl files, I use dvc status && dvc commit
. dvc status
operation is completed quickly (I know dvc will only hash files once until it gets modified. Here only several jsonl files are modified so dvc status
operation cost little). However, dvc commit
operation cost a lot of time.
While dvc commit
is executing, I see lots of "Checking out xxx/xxx/xxx.jsonl" shows in the terminal, and I believe those jsonl files are not modified. Why dvc need to check out files that are not modified?
Expected
Assume two files a.jsonl
and b.jsonl
are modified, I think dvc commit
should equal to dvc add a.jsonl b.jsonl
. However, it seems that dvc commit
will check out all files tracked by dvc.
I expect dvc commit
operation skip files which are not modified, so it can be completed quickly.
Environment information
Output of dvc doctor
:
$ dvc doctor
DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.9.19 on Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.31
Subprojects:
dvc_data = 3.16.5
dvc_objects = 5.1.0
dvc_render = 1.0.2
dvc_task = 0.4.0
scmrepo = 3.3.7
Supports:
http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
s3 (s3fs = 2024.6.1, boto3 = 1.35.7)
Config:
Global: /mnt/afs/jiangtan/.config/dvc
System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: fuse.quarkfs_client on quarkfs_client
Caches: local
Remotes: s3
Workspace directory: fuse.quarkfs_client on quarkfs_client
Repo: dvc, git
Repo.site_cache_dir: /path/to/repo/.dvc/site_cache_dir/repo/eddf3641719990517f0cfc808ea33376