Skip to content

Dvc commit operation is too slow #10653

Open
@jiangtann

Description

@jiangtann

Bug Report

Description

I use git and dvc to manage my training datasets, which consists of thousands of jsonl files.

After I modify several jsonl files, I use dvc status && dvc commit. dvc status operation is completed quickly (I know dvc will only hash files once until it gets modified. Here only several jsonl files are modified so dvc status operation cost little). However, dvc commit operation cost a lot of time.

While dvc commit is executing, I see lots of "Checking out xxx/xxx/xxx.jsonl" shows in the terminal, and I believe those jsonl files are not modified. Why dvc need to check out files that are not modified?

Expected

Assume two files a.jsonl and b.jsonl are modified, I think dvc commit should equal to dvc add a.jsonl b.jsonl. However, it seems that dvc commit will check out all files tracked by dvc.

I expect dvc commit operation skip files which are not modified, so it can be completed quickly.

Environment information

Output of dvc doctor:

$ dvc doctor
DVC version: 3.55.2 (pip)
-------------------------
Platform: Python 3.9.19 on Linux-5.14.0-284.25.1.el9_2.x86_64-x86_64-with-glibc2.31
Subprojects:
        dvc_data = 3.16.5
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.4.0
        scmrepo = 3.3.7
Supports:
        http (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.10.5, aiohttp-retry = 2.8.3),
        s3 (s3fs = 2024.6.1, boto3 = 1.35.7)
Config:
        Global: /mnt/afs/jiangtan/.config/dvc
        System: /etc/xdg/dvc
Cache types: hardlink, symlink
Cache directory: fuse.quarkfs_client on quarkfs_client
Caches: local
Remotes: s3
Workspace directory: fuse.quarkfs_client on quarkfs_client
Repo: dvc, git
Repo.site_cache_dir: /path/to/repo/.dvc/site_cache_dir/repo/eddf3641719990517f0cfc808ea33376

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceimprovement over resource / time consuming taskstriageNeeds to be triaged

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions