Skip to content

perf(ignore): don't search subdirs for git/ignore files if max depth is reached #2565

Open
@sigmaSd

Description

@sigmaSd

When searching for git/ignore files (the one set with .ignore .git_ignore .git_exclude), the crate should not search the sub-directories if its at the max depth

example:

mkdir a
for ((i=1; i<=100; i++)); do     mktemp --directory XXX; done

a.rs

use ignore::WalkBuilder;

fn main() {
    let arg = std::env::args().nth(1).unwrap();
    let files: Vec<_> = WalkBuilder::new(&arg)
        .hidden(false)
        .follow_links(false) // We're scanning over depth 1
        .max_depth(Some(1))
        .build()
        .collect();
}
cargo b --release
strace target/release/a a

you can see that we do a lot of unneeded syscalls because we're reading git/ignore files even though we won't search inside those directorates

openat(AT_FDCWD, "/home/mrcool/dev/rust/others/ripgrep/crates/ignore/a/Vwe", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 4
newfstatat(4, "", {st_mode=S_IFDIR|0700, st_size=4096, ...}, AT_EMPTY_PATH) = 0
statx(AT_FDCWD, "/home/mrcool/dev/rust/others/ripgrep/crates/ignore/a/Vwe/.git", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7fffdb02f030) = -1 ENOENT (Aucun fichier ou dossier de ce type)
statx(AT_FDCWD, "/home/mrcool/dev/rust/others/ripgrep/crates/ignore/a/Vwe/.ignore", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7fffdb02ed60) = -1 ENOENT (Aucun fichier ou dossier de ce type)
statx(AT_FDCWD, "/home/mrcool/dev/rust/others/ripgrep/crates/ignore/a/Vwe/.gitignore", AT_STATX_SYNC_AS_STAT, STATX_ALL, {stx_mask=STATX_ALL|STATX_MNT_ID, stx_attributes=0, stx_mode=S_IFREG|0644, stx_size=0, ...}) = 0
openat(AT_FDCWD, "/home/mrcool/dev/rust/others/ripgrep/crates/ignore/a/Vwe/.gitignore", O_RDONLY|O_CLOEXEC) = 5
close(5)                                = 0
statx(AT_FDCWD, "/home/mrcool/dev/rust/others/ripgrep/crates/ignore/a/Vwe/.git/info/exclude", AT_STATX_SYNC_AS_STAT, STATX_ALL, 0x7fffdb02ed60) = -1 ENOENT (Aucun fichier ou dossier de ce type)
close(4)                                = 0

this is the best case (empty dirs), in practice its worse because directory can have those files, and this crate will have to issue read syscalls for them

without that search, it would be only 3 syscalls per directory

suggestion:
I'm suggestion to not search for those files inside subdirs if we reach the max depth
Unless I'm missing something it should keep the exact same functionality while having all the performance boost

motivation:
you can checkout helix-editor/helix#7715 where disabling this behavior improved the walking from ~4000 syscalls to ~900 and from 20 seconds to 100 ms in my old pc with hdd drive
so ~3000 wasted syscalls for ~300 files

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedOthers are encouraged to work on this issue.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions