Open
Description
Bug Report
Calling find on DVCFileSystem with a repo of about 300,000 files takes about 70 seconds.
Description
Reproduce
This example retrieves all file paths and hashes in a given dataset directory:
from dvc.api import DVCFileSystem
fs = DVCFileSystem(repo_url, rev=rev)
files = fs.find("/Data", detail=True, dvc_only=True)
remote_files = {}
for file, info in files.items():
hash = info['dvc_info']['md5']
remote_files[file] = hash
Given that all this information is stored in a single internal json file, I'd expect this to take as long as downloading that file + a little housekeeping.
On a repo with 300,000 files, this takes about 70 seconds.
Expected
This example does the same, but accesses internal datastuctures to make it faster:
from dvc.api import DVCFileSystem
from dvc_data.hashfile.tree import Tree
fs = DVCFileSystem(repo_url, rev=rev)
key = fs._get_key_from_relative("Data")
_, dvc_fs, subkey = fs._get_subrepo_info(key)
entry = dvc_fs.fs.index._trie.get(subkey)
entry.obj = Tree.load(entry.remote, entry.hash_info)
remote_files = {}
for ikey, (_, hash_info) in entry.obj.iteritems():
file = os.path.join(subkey[0], *ikey)
hash = hash_info.value
remote_files[file] = hash
On a repo with 300,000 files, this takes about 10 seconds. Most of that time is spent downloading.
A factor of 7 seems like too much overhead to me. Hopefully, this can be improved.
DVC version: 2.38.1 (conda)
---------------------------------
Platform: Python 3.9.15 on Windows-10-10.0.19045-SP0
Subprojects:
dvc_data = 0.28.4
dvc_objects = 0.14.0
dvc_render = 0.0.15
dvc_task = 0.1.8
dvclive = 1.3.0
scmrepo = 0.1.4
Supports:
http (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
https (aiohttp = 3.8.3, aiohttp-retry = 2.8.3),
ssh (sshfs = 0.0.0)