Skip to content

Fix regex unicode error #1853

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7 commits into from
Closed

Conversation

julienmalard
Copy link
Contributor

@julienmalard julienmalard commented Oct 2, 2019

Summary of changes

Fixed regex error in [A-Za-z] test of package names so that the full range of Python's accepted unicode identifiers are accepted.

Closes #1399 (I think).

Pull Request Checklist

  • Changes have tests
  • News fragment added in changelog.d. See documentation for details

@@ -1313,13 +1313,15 @@ def get_default_cache():
or appdirs.user_cache_dir(appname='Python-Eggs')
)

unicode_marks = "\u0300-\u0370\u0483-\u048a\u0591-\u05be\u05bf-\u05c0\u05c1-\u05c3\u05c4-\u05c6\u05c7-\u05c8\u0610-\u061b\u064b-\u0660\u0670-\u0671\u06d6-\u06dd\u06df-\u06e5\u06e7-\u06e9\u06ea-\u06ee\u0711-\u0712\u0730-\u074b\u07a6-\u07b1\u07eb-\u07f4\u07fd-\u07fe\u0816-\u081a\u081b-\u0824\u0825-\u0828\u0829-\u082e\u0859-\u085c\u08d3-\u08e2\u08e3-\u0904\u093a-\u093d\u093e-\u0950\u0951-\u0958\u0962-\u0964\u0981-\u0984\u09bc-\u09bd\u09be-\u09c5\u09c7-\u09c9\u09cb-\u09ce\u09d7-\u09d8\u09e2-\u09e4\u09fe-\u09ff\u0a01-\u0a04\u0a3c-\u0a3d\u0a3e-\u0a43\u0a47-\u0a49\u0a4b-\u0a4e\u0a51-\u0a52\u0a70-\u0a72\u0a75-\u0a76\u0a81-\u0a84\u0abc-\u0abd\u0abe-\u0ac6\u0ac7-\u0aca\u0acb-\u0ace\u0ae2-\u0ae4\u0afa-\u0b00\u0b01-\u0b04\u0b3c-\u0b3d\u0b3e-\u0b45\u0b47-\u0b49\u0b4b-\u0b4e\u0b56-\u0b58\u0b62-\u0b64\u0b82-\u0b83\u0bbe-\u0bc3\u0bc6-\u0bc9\u0bca-\u0bce\u0bd7-\u0bd8\u0c00-\u0c05\u0c3e-\u0c45\u0c46-\u0c49\u0c4a-\u0c4e\u0c55-\u0c57\u0c62-\u0c64\u0c81-\u0c84\u0cbc-\u0cbd\u0cbe-\u0cc5\u0cc6-\u0cc9\u0cca-\u0cce\u0cd5-\u0cd7\u0ce2-\u0ce4\u0d00-\u0d04\u0d3b-\u0d3d\u0d3e-\u0d45\u0d46-\u0d49\u0d4a-\u0d4e\u0d57-\u0d58\u0d62-\u0d64\u0d82-\u0d84\u0dca-\u0dcb\u0dcf-\u0dd5\u0dd6-\u0dd7\u0dd8-\u0de0\u0df2-\u0df4\u0e31-\u0e32\u0e34-\u0e3b\u0e47-\u0e4f\u0eb1-\u0eb2\u0eb4-\u0eba\u0ebb-\u0ebd\u0ec8-\u0ece\u0f18-\u0f1a\u0f35-\u0f36\u0f37-\u0f38\u0f39-\u0f3a\u0f3e-\u0f40\u0f71-\u0f85\u0f86-\u0f88\u0f8d-\u0f98\u0f99-\u0fbd\u0fc6-\u0fc7\u102b-\u103f\u1056-\u105a\u105e-\u1061\u1062-\u1065\u1067-\u106e\u1071-\u1075\u1082-\u108e\u108f-\u1090\u109a-\u109e\u135d-\u1360\u1712-\u1715\u1732-\u1735\u1752-\u1754\u1772-\u1774\u17b4-\u17d4\u17dd-\u17de\u180b-\u180e\u1885-\u1887\u18a9-\u18aa\u1920-\u192c\u1930-\u193c\u1a17-\u1a1c\u1a55-\u1a5f\u1a60-\u1a7d\u1a7f-\u1a80\u1ab0-\u1abf\u1b00-\u1b05\u1b34-\u1b45\u1b6b-\u1b74\u1b80-\u1b83\u1ba1-\u1bae\u1be6-\u1bf4\u1c24-\u1c38\u1cd0-\u1cd3\u1cd4-\u1ce9\u1ced-\u1cee\u1cf2-\u1cf5\u1cf7-\u1cfa\u1dc0-\u1dfa\u1dfb-\u1e00\u20d0-\u20f1\u2cef-\u2cf2\u2d7f-\u2d80\u2de0-\u2e00\u302a-\u3030\u3099-\u309b\ua66f-\ua673\ua674-\ua67e\ua69e-\ua6a0\ua6f0-\ua6f2\ua802-\ua803\ua806-\ua807\ua80b-\ua80c\ua823-\ua828\ua880-\ua882\ua8b4-\ua8c6\ua8e0-\ua8f2\ua8ff-\ua900\ua926-\ua92e\ua947-\ua954\ua980-\ua984\ua9b3-\ua9c1\ua9e5-\ua9e6\uaa29-\uaa37\uaa43-\uaa44\uaa4c-\uaa4e\uaa7b-\uaa7e\uaab0-\uaab1\uaab2-\uaab5\uaab7-\uaab9\uaabe-\uaac0\uaac1-\uaac2\uaaeb-\uaaf0\uaaf5-\uaaf7\uabe3-\uabeb\uabec-\uabee\ufb1e-\ufb1f\ufe00-\ufe10\ufe20-\ufe30\u101fd-\u101fe\u102e0-\u102e1\u10376-\u1037b\u10a01-\u10a04\u10a05-\u10a07\u10a0c-\u10a10\u10a38-\u10a3b\u10a3f-\u10a40\u10ae5-\u10ae7\u10d24-\u10d28\u10f46-\u10f51\u11000-\u11003\u11038-\u11047\u1107f-\u11083\u110b0-\u110bb\u11100-\u11103\u11127-\u11135\u11145-\u11147\u11173-\u11174\u11180-\u11183\u111b3-\u111c1\u111c9-\u111cd\u1122c-\u11238\u1123e-\u1123f\u112df-\u112eb\u11300-\u11304\u1133b-\u1133d\u1133e-\u11345\u11347-\u11349\u1134b-\u1134e\u11357-\u11358\u11362-\u11364\u11366-\u1136d\u11370-\u11375\u11435-\u11447\u1145e-\u1145f\u114b0-\u114c4\u115af-\u115b6\u115b8-\u115c1\u115dc-\u115de\u11630-\u11641\u116ab-\u116b8\u1171d-\u1172c\u1182c-\u1183b\u11a01-\u11a0b\u11a33-\u11a3a\u11a3b-\u11a3f\u11a47-\u11a48\u11a51-\u11a5c\u11a8a-\u11a9a\u11c2f-\u11c37\u11c38-\u11c40\u11c92-\u11ca8\u11ca9-\u11cb7\u11d31-\u11d37\u11d3a-\u11d3b\u11d3c-\u11d3e\u11d3f-\u11d46\u11d47-\u11d48\u11d8a-\u11d8f\u11d90-\u11d92\u11d93-\u11d98\u11ef3-\u11ef7\u16af0-\u16af5\u16b30-\u16b37\u16f51-\u16f7f\u16f8f-\u16f93\u1bc9d-\u1bc9f\u1d165-\u1d16a\u1d16d-\u1d173\u1d17b-\u1d183\u1d185-\u1d18c\u1d1aa-\u1d1ae\u1d242-\u1d245\u1da00-\u1da37\u1da3b-\u1da6d\u1da75-\u1da76\u1da84-\u1da85\u1da9b-\u1daa0\u1daa1-\u1dab0\u1e000-\u1e007\u1e008-\u1e019\u1e01b-\u1e022\u1e023-\u1e025\u1e026-\u1e02b\u1e8d0-\u1e8d7\u1e944-\u1e94b\ue0100-\ue01f0"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really really hope this is not the right way to do this. I certainly don't think it should be all on one line.

@pganssle
Copy link
Member

pganssle commented Oct 3, 2019

There appears to be a canonical regex for package names, unless I'm getting that wrong. I'm not quite sure why we're not using it, though.

It does not seem to be compatible with unicode packages, though, so we should probably try and fix this at the spec level.

@julienmalard
Copy link
Contributor Author

julienmalard commented Oct 3, 2019

Unfortunately it does seem to be the only way to go at this time.
The official spec for identifiers is:

The identifier syntax is <XID_Start> <XID_Continue>.
ID_Start is defined as all characters having one of the general categories uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), the underscore, and characters carrying the Other_ID_Start property. XID_Start then closes this set under normalization, by removing all characters whose NFKC normalization is not of the form ID_Start ID_Continue
anymore.

(From PEP 3131.)

Unfortunately to use these unicode classes, we have to use the regex, and not the builtin re module. Am I right that adding regex as a dependency to setuptools would be a bad idea?

If so, we are I believe effectively stuck with ugly unicode ranges.

@pganssle pganssle closed this Oct 3, 2019
@pganssle pganssle reopened this Oct 3, 2019
@pganssle
Copy link
Member

pganssle commented Oct 3, 2019

Unfortunately it does seem to be the only way the go at this time.
The official spec for identifiers is:

That is for identifiers, this is for package names and versions. It's not clear to me that they are the same (though there's some case to be made that they should be harmonized in the official spec).

Unfortunately to use these unicode classes, we have to use the regex, and not the builtin re module. Am I right that adding regex as a dependency to setuptools would be a bad idea?

If so, we are I believe effectively stuck with ugly unicode ranges.

Well, the other option is to not use a regular expression, for example, this will cover most of what your regex does. Presumably unicodedata can also get the Other_ID_start property as well.

def fc(x):
    out = ""
    allowed_categories = {"Lt", "Lm", "Lu", "Ll", "Lt", "Lm", "Lo", "NI"}
    for c in x:
        if unicodedata.category(c) in allowed_categories:
            out += c
        else:
            out += "-"
    return re.sub("-+", "-", out).strip("-")

But you'll note that in fact "safe names" are not valid identifiers, since they actually replace all _s with -.

@jaraco
Copy link
Member

jaraco commented Oct 3, 2019

Am I right that adding regex as a dependency to setuptools would be a bad idea.

I'd like to think not, but it's not as simple as adding a dependency in another project. Any dependency (and its dependencies) must be vendored. This shouldn't be done with abandon, as it does impact downstream system packagers that work to unvendor the vendored packages.

@julienmalard
Copy link
Contributor Author

Thank you for the feedback! I think I will try out @pganssle 's approach first. By the way, pardon my ignorance, but what is the practical difference between the package "safe name" and its identifier in terms of how Python (or setuptools) uses them?

@pganssle
Copy link
Member

pganssle commented Oct 4, 2019

@julienmalard The two things are nearly entirely different. For something to be a valid identifier means that it can be used as a variable name in Python. For something to be a valid package name means that it can be used for the name of a package to be installed. For example my-package is a valid package name, but not a valid identifier, and in fact my-package, my_package and My-Package all refer to the same package (my-package is the canonical name).

@julienmalard julienmalard changed the title Fix Fix regex error Oct 14, 2019
@julienmalard julienmalard changed the title Fix regex error Fix regex unicode error Oct 14, 2019
@julienmalard
Copy link
Contributor Author

Thank you @pganssle for your clear explanation!
I have implemented your alternative to the regex as well and it seems to be passing all tests now.

Please let me know what you think!

Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please re-write f35c7da to avoid the stylistic changes? Stylistic changes should be made in a separate commit or PR. Some of these changes might be acceptable, but others are not (excess indentation/whitespace added). It's these changes that are creating the merge conflicts.

)


def fc(x, replace_with='-', also_accept=None, also_refuse=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function really needs a docstring, especially because fc doesn't mean anything to me.

@jaraco
Copy link
Member

jaraco commented May 25, 2020

I started down the path of removing the stylistic changes, and re-organizing the commits into first a test failure. However, I noticed that safe_name isn't implicated in the traceback of the original report, so I'm not even confident the additional test here exhibits the failure reported in the issue.

It's not obvious to me that the reported issue is even related to the fix herein. Let's go back to the issue and work on creating a minimal reproducer that exhibits the failure.

@jaraco jaraco closed this May 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

install breaks with unicode file names
3 participants