Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix pip cache remove pattern matching. #13094

Open
wants to merge 22 commits into
base: main
Choose a base branch
from
Open

Conversation

jsirois
Copy link
Contributor

@jsirois jsirois commented Nov 27, 2024

Previously, glob patterns were not properly accounted for, which could
lead to pip cache remove wheel-0.45.1-py3-none-any.whl failing, for
example, when exactly that file was shown to exist in the cache via
pip cache list.

Additionally, non-glob patterns previously only literally matched wheel
project names; now they match the normalized name. For example,
pip cache remove pyfftw will now remove a cached
pyFFTW-0.15.0-cp313-cp313-linux_x86_64.whl wheel since the pyfftw
pattern matches the normalized project name.

Fixes #13086

Previously, glob patterns were not properly accounted for, which could
lead to `pip cache remove wheel-0.45.1-py3-none-any.whl` failing, for
example, when exactly that file was shown to exist in the cache via
`pip cache list`.

Additionally, non-glob patterns previously only literally matched wheel
project names; now they match the normalized name. For example,
`pip cache remove pyfftw` will now remove a cached
`pyFFTW-0.15.0-cp313-cp313-linux_x86_64.whl` wheel since the `pyfftw`
pattern matches the normalized project name.

Fixes pypa#13086
glob_pattern_char in pattern for glob_pattern_char in ("[", "*", "?")
)
if not uses_glob_patterns:
project_name, sep, rest = pattern.partition("-")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

N.B. I documented punting on deciphering glob patterns with respect to name normalization. Someone braver than I might try that later. I assume forward progress here is better than no progress though.

I did not, however, call the shot on version normalization. Again, I was cowardly, but that problem still remains. If a wheel in the cache is foo-1.0-py3-none-any.whl, then, in my opinion, pip cache remove foo-1.0.0 should remove it, but does not today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume forward progress here is better than no progress though.

Indeed!

I documented punting on deciphering glob patterns with respect to name normalization.

Hmm, how does this limitation come into play? The pattern is matched with case sensitivity disabled so that's not a problem.

I did not, however, call the shot on version normalization

The proper way to handle this would be performing a fully normalized version match. Extract the version from the pattern, convert it to packaging.version.Version, do the same for the cached wheels, and compare them. For wildcards, packaging.specifiers.Specifier could be used (although we'd have to be careful to only accept * and reject any other operator because pip cache remove should not support PEP 440 semantics).

Comment on lines 234 to +235
# PEP 427: https://www.python.org/dev/peps/pep-0427/
pattern = pattern + ("*.whl" if "-" in pattern else "-*.whl")

return filesystem.find_files(wheel_dir, pattern)
# And: https://packaging.python.org/specifications/binary-distribution-format/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Pip / PyPA knowledgeable reviewers - what the heck is up with this by the way? I've always found it off-putting since the PEP banners have started appearing on some of the PyPA PEPs re-directing to these new standards pages. The fact that some PEPs get a new page and some don't and what process the ones with the new page go through to get updated is all mysterious feeling.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We’ve been pushing people toward the standard pages on packaging.python.org since those are the up-to-date standard documents, while PEP pages are no longer updated once they are done, which can be a problem when you want to track done how to do a thing that’s been revised a couple of times. I don’t have knowledge how the auto redirection is actually implemented though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The procedure for modern packaging PEPs is that once accepted, the specification is transferred to the PUG (Packaging User Guide, which is the name for packaging.python.org). This is documented in PyPA procedure:

[Proposals for new interoperability specifications] must be accompanied by a pull request to the Python Packaging User Guide repository, against the PyPA Specifications section, that adds a new subsection defining the purpose of the new specification and the role it plays in the wider Python packaging ecosystem.

I'd imagine that this doesn't happen for all packaging PEPs though... (due to forgetting or not realizing this is a requirement).

Copy link
Contributor

@duckinator duckinator left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change makes the implementation more in-line with my intentions back in #6391.

The glob check in particular looks way more robust now. 🙂

@ichard26 ichard26 self-requested a review December 26, 2024 00:49
@notatallshaw
Copy link
Member

This is another place in pip's codebase where I wish we were relying on packaging instead of hand rolled normalization techniques.

I want to suggest using parse_wheel_filename from packaging.utils to normalize the name first and then apply the pattern on the normalized name, but there's some discussion about the purpose of that function going on right now: pypa/packaging#873

So maybe this is a better solution for now, at least until packaging offers a more complete filename API.

@ichard26
Copy link
Member

I want to suggest using parse_wheel_filename from packaging.utils to normalize the name first and then apply the pattern on the normalized name, but there's some discussion about the purpose of that function going on right now: pypa/packaging#873

I strongly agree with @notatallshaw. I'd like to not roll our custom logic for matching against cache entries if possible, but if packaging's utilties aren't ready for that yet, then an interim solution is fine (as long as it can easily be replaced by packaging w/o too much fuss).

I know that this PR is waiting for my review. It is next on my list to review, sorry for the wait!

@ichard26
Copy link
Member

Heads up, CI is going to fail due to a new setuptools release from ~6 hours ago. Setuptools now normalizes the filenames of the wheels it emits. This isn't your problem, just don't bother retrying CI.

Copy link
Member

@ichard26 ichard26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heya, I'd like to apologise again for taking so long to review your PR.

While I do think it's worth using parse_wheel_filename, that can be dealt with separately. Either way, normalization is needed on both sides (the pattern AND the wheel filenames) to account for all of the potential textual mismatches.

I left more review comments than I was initially planning. Please don't take this as criticism of your work, but simply a reflection of how tricky this command really is! This is a welcomed step forward that makes pip cache remove (and friends) easier to use.

Thank you so much for the contribution and your patience. I look forward to getting this in the 25.1 release.

Comment on lines +239 to +241
uses_glob_patterns = any(
glob_pattern_char in pattern for glob_pattern_char in ("[", "*", "?")
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
uses_glob_patterns = any(
glob_pattern_char in pattern for glob_pattern_char in ("[", "*", "?")
)
uses_glob_patterns = any(char in "[]*?" for char in pattern)

Comment on lines 236 to 237
#
# PEP 427: https://www.python.org/dev/peps/pep-0427/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#
# PEP 427: https://www.python.org/dev/peps/pep-0427/

If we're linking to the canonical specification, the reference to the original PEP isn't needed anymore.

Comment on lines +256 to +263
regex = fnmatch.translate(pattern)

return [
os.path.join(root, filename)
for root, _, files in os.walk(wheel_dir)
for filename in files
if re.match(regex, filename, flags=re.IGNORECASE)
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer that filesystem.find_files() gained an optional case_insensitive keyword parameter that did this switch to regex dance under the hood.

Comment on lines +244 to +247
# N.B.: This only normalizes non-alphanumeric characters in the name, which
# is 1/2 the job. Stripping case sensitivity is the other 1/2 and is
# handled below in the conversion from a glob to a regex executed with the
# IGNORECASE flag active.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# N.B.: This only normalizes non-alphanumeric characters in the name, which
# is 1/2 the job. Stripping case sensitivity is the other 1/2 and is
# handled below in the conversion from a glob to a regex executed with the
# IGNORECASE flag active.
# NOTE: This only normalizes non-alphanumeric characters in the name.
# Stripping case sensitivity is handled below in the conversion from a glob
# to a regex executed with the IGNORECASE flag.

As someone who likes to write wordy comments, I totally get where you're coming from, but being concise is a good thing :)

glob_pattern_char in pattern for glob_pattern_char in ("[", "*", "?")
)
if not uses_glob_patterns:
project_name, sep, rest = pattern.partition("-")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume forward progress here is better than no progress though.

Indeed!

I documented punting on deciphering glob patterns with respect to name normalization.

Hmm, how does this limitation come into play? The pattern is matched with case sensitivity disabled so that's not a problem.

I did not, however, call the shot on version normalization

The proper way to handle this would be performing a fully normalized version match. Extract the version from the pattern, convert it to packaging.version.Version, do the same for the cached wheels, and compare them. For wildcards, packaging.specifiers.Specifier could be used (although we'd have to be careful to only accept * and reject any other operator because pip cache remove should not support PEP 440 semantics).

# is 1/2 the job. Stripping case sensitivity is the other 1/2 and is
# handled below in the conversion from a glob to a regex executed with the
# IGNORECASE flag active.
partially_normalized_project_name = re.sub(r"[^\w.]+", "_", project_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
partially_normalized_project_name = re.sub(r"[^\w.]+", "_", project_name)
partially_normalized_project_name = re.sub(r"[-_.]+", "_", project_name)

It's probably safer to simply reuse the pattern from the specification.

@@ -0,0 +1 @@
Fix ``pip cache remove`` pattern matching.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As evident by all of the edge cases, this is moreso a step in the right direction than a total fix. It's a good step forward and certainly needed, but I don't want to claim everything will work w/o issues now :)

ALSO, this PR will affect pip cache list too. That's a good thing, we may as well mention it.

Comment on lines 215 to 217
# Additionally, non-alphanumeric values in the distribution are
# normalized to underscores (_), meaning hyphens can never occur
# before `-{version}`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, due to bugs in build backends (like pypa/setuptools#3777), this isn't guaranteed. Dashes are always normalized (as the wheel filename would be invalid otherwise) but the name may contain periods. For example, zope.sqlalchemy-3.1-py3-none-any.whl.

Normalization of the wheel filenames is going to be necessary to make this more robust. That can be saved for later, of course.


assert not remove_matches_wheel("yyy-1.2.3", result)
assert remove_matches_wheel("zzz-4.5.6", result)
assert not remove_matches_wheel("zzz-4.5.7", result)
assert not remove_matches_wheel("zzz-7.8.9", result)


@pytest.mark.parametrize(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also include some test cases with periods and runs of underscores? The latter is not something I'd expect an user to pass to pip, but it'd be good to validate this is following the specifications.

Comment on lines +71 to +73
def _populate_wheel_cache(
wheel_cache_dir: str, *name_and_versions: str
) -> List[Tuple[str, str]]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this! It's a good clean up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pip cache remove should also work with a package name
5 participants