Skip to content

Improve the performance of PyLinter._discover_files()'s use of the os.walk() method. #11005

@cinnion

Description

@cinnion

Current problem

During the early execution of Pylint, the method PyLinter._discover_files() uses os.walk() to potentially traverse an entire project directory tree. While a freshly cloned repository may be only a few dozen or so directories, installing dependencies using package managers such as pip to produce a python virtual environment, or npm to install JavaScript/CSS dependencies can cause that number to explode. For examples, the Pylint repository itself (a rather large project) has 875 directories before even minimal dependencies are installed, and explodes to over 4600 afterwards. And for a project of my own, it goes from 61 to nearly 5600. The traversal of an entire directory tree such as this can take a non-trivial amount of time, even on a powerful machine, depending on load. And while a second in the execution time of Pylint may not seem like much, it adds up.

Desired solution

When traversing a project such as is done when you execute pylint ., the _discover_files() method should not traverse down into directories given in at least the ignore list, if not also the ignore patterns and ignore paths.

Fortunately, os.walk() allows us to alter the behaviour of this most expensive traversal. By properly manipulating the dirnames list returned by os.walk(), combined with the use of topdown=True, we can cut out the traversal of entire directory trees such as .venv, .git or node_packages. Also fortunately, the call to os.walk() in asteroid.modutils is given small directory trees as returned by _discover_files(), and as such, does not need similar optimization.

Additional context

I started a thread on discord, and discussed this with @Pierre-Sassoulas, who showed interest in this optimization. This is closely intertwined with the changes made for PR #10970, for Issue #10969. I have changes which are fully tested, which add this optimization, and actually test whether or not the subtrees are visited or not. This did require adding pyfakefs as a requirement for testing, as a regular mock would drive the traversal unless the mock was made overly complex. I have moved and revised the tests for that PR/Issue into a new test file, but left test_recursive_ignore where it was, since it uses fixtures common to the other tests.

I have also been sure to take into account #9192 raised by @cordery, and the associated Cpython issue he raised, and while coding it, verified that leaving off the list() wrapper still results in a false-negative, just as a FYI.

I will be submitting the changes for this as a PR, perhaps as a part of #10970, or perhaps separately, after final discussion on discord.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Needs PRThis issue is accepted, sufficiently specified and now needs an implementationperformance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions