Improve the performance of PyLinter._discover_files()'s use of the os.walk() method.

### Current problem

During the early execution of Pylint, the method `PyLinter._discover_files()` uses [os.walk()](https://docs.python.org/3/library/os.html#os.walk) to potentially traverse an entire project directory tree. While a freshly cloned repository may be only a few dozen or so directories, installing dependencies using package managers such as `pip` to produce a python virtual environment, or `npm` to install JavaScript/CSS dependencies can cause that number to explode. For examples, the Pylint repository itself (a rather large project) has 875 directories before even minimal dependencies are installed, and explodes to over 4600 afterwards. And for a project of my own, it goes from 61 to nearly 5600. The traversal of an entire directory tree such as this can take a non-trivial amount of time, even on a powerful machine, depending on load. And while a second in the execution time of Pylint may not seem like much, it adds up.

### Desired solution

When traversing a project such as is done when you execute `pylint .`, the `_discover_files()` method should not traverse down into directories given in at least the ignore list, if not also the ignore patterns and ignore paths.

Fortunately, `os.walk()` allows us to alter the behaviour of this most expensive traversal. By **properly** manipulating the `dirnames` list returned by `os.walk()`, combined with the use of `topdown=True`, we can cut out the traversal of entire directory trees such as `.venv`, `.git` or `node_packages`. Also fortunately, the call to `os.walk()` in `asteroid.modutils` is given small directory trees as returned by `_discover_files()`, and as such, does not need similar optimization.

### Additional context

I started a thread on discord, and discussed this with @Pierre-Sassoulas, who showed interest in this optimization. This is closely intertwined with the changes made for PR #10970, for Issue #10969. I have changes which are fully tested, which add this optimization, and actually test whether or not the subtrees are visited or not. This did require adding `pyfakefs` as a requirement for testing, as a regular mock would drive the traversal unless the mock was made overly complex. I have moved and revised the tests for that PR/Issue into a new test file, but left `test_recursive_ignore` where it was, since it uses fixtures common to the other tests.

I have also been sure to take into account #9192 raised by @cordery, and the associated [Cpython issue](https://github.com/python/cpython/issues/111321) he raised, and while coding it, verified that leaving off the `list()` wrapper still results in a false-negative, just as a FYI.

I will be submitting the changes for this as a PR, perhaps as a part of #10970, or perhaps separately, after final discussion on discord.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve the performance of PyLinter._discover_files()'s use of the os.walk() method. #11005

Current problem

Desired solution

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Improve the performance of PyLinter._discover_files()'s use of the os.walk() method. #11005

Description

Current problem

Desired solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions