Description
What's the problem this feature will solve?
The 2020 resolver separated the resolution logic from the rest of pip, and made the resolver much easier to read and extend. With --use-feature=fast-deps
, we began to investigate improved pip performance by avoiding the download of dependencies until we reach a complete resolution. With install --report
, we enabled pip users to read the output of the resolver without needing to download dependencies. However, there remain a few blockers to achieving performance improvements, for multiple use cases:
Uncached resolves:
When pip is executed entirely from scratch (without an existing ~/.cache/pip
directory), as is often the case in CI, we are unlikely to get too much faster than we are now (and notably, it's extremely unlikely a non-pip tool could go faster in this case without relying on some sort of remote resolve index). However, there are a couple improvements we can still make here:
- Make
--use-feature=fast-deps
default, to cover for wheels without PEP 658 metadata backfilled yet.- Importantly, this is often the case for internal corporate CI indices, which are unlikely to have performed the process of backfilling PEP 658 metadata, and may even be a
--find-links
repo instead of serving the PyPI simple repository API. fast-deps
is not as fast as it could be, which will be addressed further below.
- Importantly, this is often the case for internal corporate CI indices, which are unlikely to have performed the process of backfilling PEP 658 metadata, and may even be a
- Establish a separate metadata cache directory, so that CI runs in e.g. github actions can retain the result of metadata resolution run-over-run, without having to store large binary wheel output.
- We will discuss metadata caching further below--this currently does not exist in pip.
- This should achieve similar performance to partially-cached resolves, discussed next.
- Batch downloading of wheels all at once after a shallow metadata resolve.
- Pip already has the infrastructure to do this, it's just not implemented yet!
Partially-cached resolves with downloading
When pip is executed with a persistent ~/.cache/pip
directory, we can take advantage of much more caching, and this is the bulk of the work here. In e.g. #11111 and other work, we (mostly) separated metadata resolution from downloading, and this has allowed us to consider how to cache not just downloaded artifacts, but parts of the resolution process itself. This is directly enabled by the clean separation and design of the 2020 resolver. We can cache the following:
- PEP 658 metadata for a particular wheel (this saves us a small download).
fast-deps
metadata for a particular wheel (this saves us a few HTTP range requests).- Metadata extracted from an sdist (this saves us a medium-size download and a build process).
These alone may not seem like much, but over the course of an entire resolve, not having to make potentially multiple network requests per dependency and staying within our in-memory pip resolve logic adds up and produces a very significant performance improvement. These also reduce the number of requests we make against PyPI.
But wait! We can go even faster! Because in addition to the metadata cache (which is idempotent and not time-varying--the same wheel hash always maps to the same metadata), we can also cache the result of querying the simple repository API for the list of dists available for a given dependency name! This additional caching requires messing around with HTTP caching headers to see if a given page has changed, but it lets us cache:
- The raw content of simple repository index page, if unchanged (this saves us a medium-size download).
- The result of parsing a simple repository index page into
Link
s, if unchanged (this saves us an HTML parser invocation). - The result of filtering
Link
s by interpreter compatibility, if unchanged (this saves us having to calculate interpreter compatibility using the tags logic).
Resolves without downloading
With install --report out.json --dry-run
and the metadata resolve+caching discussed above, we should be able to avoid downloading the files associated with the resolved dists, enabling users to download those dependencies in a later phase (as I achieved at Twitter with pantsbuild/pants#8793). However, we currently don't do so (see #11512), because of a mistake I made in previous implementation work (sorry!). So for this, we just need:
- Avoid downloading dists when invoked from a command which only requests metadata (such as
install --dry-run
).
Describe the solution you'd like
I have created several PRs which achieve all of the above:
Batch downloading [0/2]
For batch downloading of metadata-only dists, we have two phases:
- improve pooled progress output for BatchDownloader #12925 produces extremely rich progress output for batched downloads.
- execute batch downloads in parallel worker threads #12923 finally makes the
BatchDownloader
download and prepare metadata-only dists in parallel. This produces a drastic performance improvement.
fast-deps
fixes [0/1]
- perform 1-3 HTTP requests for each wheel using fast-deps #12208 fixes the
fast-deps
implementation to achieve excellent performance against the current iteration of PyPI behind fastly, as well as any other HTTP host supporting range requests.
Formalize "concrete" vs metadata-only dists [0/3]
To avoid downloading dists for metadata-only commands, we have several phases:
- cache "concrete" dists by Distribution instead of InstallRequirement #12863 introduces
.is_concrete
to ourDistribution
wrappers to codify the concept of "metadata-only" dists. - refactor requirement preparer to remove duplicated code paths for metadata-only dists #12871 is a refactoring change to decouple the caching of metadata-only dists from the rest of the
RequirementPreparer
logic. - pull preparer logic out of the resolver to consume metadata-only dists in commands #12186 finally fixes use .metadata distribution info when possible #11512, so
install --dry-run
doesn't download any dists.
Metadata caching [0/1]
- cache metadata lookups for sdists and lazy wheels #12256 introduces the first "metadata cache", which is separate from and much smaller than the cache used for downloaded or built wheels. This produces the most drastic performance improvement to the resolve process.
Caching index pages [0/2]
To optimize the process of obtaining Link
s to resolve against, we have at least two phases:
- send HTTP caching headers for index pages to further reduce bandwidth usage #12257 attaches HTTP caching headers to requests against index pages, which allows our
CacheControl
to implicitly retrieve the cached response after a very fast304 Not Modified
from PyPI. - persistent cache for link parsing and interpreter compatibility #12258 checks whether the index page was updated (e.g. by a new upload) since last downloaded, and if not, retrieves the result of
Link
parsing and interpreter compatibility filtering from the metadata cache.
Each of these PRs demonstrate some nontrivial performance improvement in their description. All together, the result is quite significant, and never produces a slowdown.
Alternative Solutions
- None of these changes modify any external APIs. They do introduce new subdirectories to
~/.cache/pip
, which can be tracked separately from the wheel cache to speed up resolution without ballooning in size. - I expect the
Link
parsing and interpreter compatibility caching in persistent cache for link parsing and interpreter compatibility #12258 to involve more discussion, as they take up more cache space than the idempotent metadata cache, produce less of a performance improvement, and are more complex to implement. However, nothing else depends on them to work, and they can safely be discussed later after the preceding caching work is done.
Additional context
In writing this, I realized we may be able to modify the approach of #12257 to work with --find-links
repos as well. Those are expected to change much more frequently than index pages using the simple repository API, but may be worth considering after the other work is done.
Code of Conduct
- I agree to follow the PSF Code of Conduct.