Skip to content

metadata resolve workstream #12921

Open
Open
@cosmicexplorer

Description

@cosmicexplorer

What's the problem this feature will solve?

The 2020 resolver separated the resolution logic from the rest of pip, and made the resolver much easier to read and extend. With --use-feature=fast-deps, we began to investigate improved pip performance by avoiding the download of dependencies until we reach a complete resolution. With install --report, we enabled pip users to read the output of the resolver without needing to download dependencies. However, there remain a few blockers to achieving performance improvements, for multiple use cases:

Uncached resolves:

When pip is executed entirely from scratch (without an existing ~/.cache/pip directory), as is often the case in CI, we are unlikely to get too much faster than we are now (and notably, it's extremely unlikely a non-pip tool could go faster in this case without relying on some sort of remote resolve index). However, there are a couple improvements we can still make here:

  • Make --use-feature=fast-deps default, to cover for wheels without PEP 658 metadata backfilled yet.
    • Importantly, this is often the case for internal corporate CI indices, which are unlikely to have performed the process of backfilling PEP 658 metadata, and may even be a --find-links repo instead of serving the PyPI simple repository API.
    • fast-deps is not as fast as it could be, which will be addressed further below.
  • Establish a separate metadata cache directory, so that CI runs in e.g. github actions can retain the result of metadata resolution run-over-run, without having to store large binary wheel output.
    • We will discuss metadata caching further below--this currently does not exist in pip.
    • This should achieve similar performance to partially-cached resolves, discussed next.
  • Batch downloading of wheels all at once after a shallow metadata resolve.
    • Pip already has the infrastructure to do this, it's just not implemented yet!

Partially-cached resolves with downloading

When pip is executed with a persistent ~/.cache/pip directory, we can take advantage of much more caching, and this is the bulk of the work here. In e.g. #11111 and other work, we (mostly) separated metadata resolution from downloading, and this has allowed us to consider how to cache not just downloaded artifacts, but parts of the resolution process itself. This is directly enabled by the clean separation and design of the 2020 resolver. We can cache the following:

  • PEP 658 metadata for a particular wheel (this saves us a small download).
  • fast-deps metadata for a particular wheel (this saves us a few HTTP range requests).
  • Metadata extracted from an sdist (this saves us a medium-size download and a build process).

These alone may not seem like much, but over the course of an entire resolve, not having to make potentially multiple network requests per dependency and staying within our in-memory pip resolve logic adds up and produces a very significant performance improvement. These also reduce the number of requests we make against PyPI.

But wait! We can go even faster! Because in addition to the metadata cache (which is idempotent and not time-varying--the same wheel hash always maps to the same metadata), we can also cache the result of querying the simple repository API for the list of dists available for a given dependency name! This additional caching requires messing around with HTTP caching headers to see if a given page has changed, but it lets us cache:

  • The raw content of simple repository index page, if unchanged (this saves us a medium-size download).
  • The result of parsing a simple repository index page into Links, if unchanged (this saves us an HTML parser invocation).
  • The result of filtering Links by interpreter compatibility, if unchanged (this saves us having to calculate interpreter compatibility using the tags logic).

Resolves without downloading

With install --report out.json --dry-run and the metadata resolve+caching discussed above, we should be able to avoid downloading the files associated with the resolved dists, enabling users to download those dependencies in a later phase (as I achieved at Twitter with pantsbuild/pants#8793). However, we currently don't do so (see #11512), because of a mistake I made in previous implementation work (sorry!). So for this, we just need:

  • Avoid downloading dists when invoked from a command which only requests metadata (such as install --dry-run).

Describe the solution you'd like

I have created several PRs which achieve all of the above:

Batch downloading [0/2]

For batch downloading of metadata-only dists, we have two phases:

fast-deps fixes [0/1]

Formalize "concrete" vs metadata-only dists [0/3]

To avoid downloading dists for metadata-only commands, we have several phases:

Metadata caching [0/1]

Caching index pages [0/2]

To optimize the process of obtaining Links to resolve against, we have at least two phases:

Each of these PRs demonstrate some nontrivial performance improvement in their description. All together, the result is quite significant, and never produces a slowdown.

Alternative Solutions

  • None of these changes modify any external APIs. They do introduce new subdirectories to ~/.cache/pip, which can be tracked separately from the wheel cache to speed up resolution without ballooning in size.
  • I expect the Link parsing and interpreter compatibility caching in persistent cache for link parsing and interpreter compatibility #12258 to involve more discussion, as they take up more cache space than the idempotent metadata cache, produce less of a performance improvement, and are more complex to implement. However, nothing else depends on them to work, and they can safely be discussed later after the preceding caching work is done.

Additional context

In writing this, I realized we may be able to modify the approach of #12257 to work with --find-links repos as well. Those are expected to change much more frequently than index pages using the simple repository API, but may be worth considering after the other work is done.

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions