Skip to content

DVCFileSystem.find very slow with many files #8895

Open
@grahameth

Description

@grahameth

Bug Report

Calling find on DVCFileSystem with a repo of about 300,000 files takes about 70 seconds.

Description

Reproduce

This example retrieves all file paths and hashes in a given dataset directory:

from dvc.api import DVCFileSystem

fs = DVCFileSystem(repo_url, rev=rev)
files = fs.find("/Data", detail=True, dvc_only=True)
remote_files = {}
for file, info in files.items():
    hash = info['dvc_info']['md5']
    remote_files[file] = hash

Given that all this information is stored in a single internal json file, I'd expect this to take as long as downloading that file + a little housekeeping.
On a repo with 300,000 files, this takes about 70 seconds.

Expected

This example does the same, but accesses internal datastuctures to make it faster:

from dvc.api import DVCFileSystem
from dvc_data.hashfile.tree import Tree

fs = DVCFileSystem(repo_url, rev=rev)
key = fs._get_key_from_relative("Data")
_, dvc_fs, subkey = fs._get_subrepo_info(key)
entry = dvc_fs.fs.index._trie.get(subkey)
entry.obj = Tree.load(entry.remote, entry.hash_info)

remote_files = {}
for ikey, (_, hash_info) in entry.obj.iteritems():
    file = os.path.join(subkey[0], *ikey)
    hash = hash_info.value
    remote_files[file] = hash

On a repo with 300,000 files, this takes about 10 seconds. Most of that time is spent downloading.

A factor of 7 seems like too much overhead to me. Hopefully, this can be improved.

DVC version: 2.38.1 (conda)
---------------------------------
Platform: Python 3.9.15 on Windows-10-10.0.19045-SP0
Subprojects:
        dvc_data = 0.28.4
        dvc_objects = 0.14.0
        dvc_render = 0.0.15
        dvc_task = 0.1.8
        dvclive = 1.3.0
        scmrepo = 0.1.4
Supports:
        http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
        ssh (sshfs = 0.0.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: apiRelated to the dvc.apiperformanceimprovement over resource / time consuming tasks

    Type

    No type

    Projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions