-
-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Fix regex unicode error #1853
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix regex unicode error #1853
Conversation
@@ -1313,13 +1313,15 @@ def get_default_cache(): | |||
or appdirs.user_cache_dir(appname='Python-Eggs') | |||
) | |||
|
|||
unicode_marks = "\u0300-\u0370\u0483-\u048a\u0591-\u05be\u05bf-\u05c0\u05c1-\u05c3\u05c4-\u05c6\u05c7-\u05c8\u0610-\u061b\u064b-\u0660\u0670-\u0671\u06d6-\u06dd\u06df-\u06e5\u06e7-\u06e9\u06ea-\u06ee\u0711-\u0712\u0730-\u074b\u07a6-\u07b1\u07eb-\u07f4\u07fd-\u07fe\u0816-\u081a\u081b-\u0824\u0825-\u0828\u0829-\u082e\u0859-\u085c\u08d3-\u08e2\u08e3-\u0904\u093a-\u093d\u093e-\u0950\u0951-\u0958\u0962-\u0964\u0981-\u0984\u09bc-\u09bd\u09be-\u09c5\u09c7-\u09c9\u09cb-\u09ce\u09d7-\u09d8\u09e2-\u09e4\u09fe-\u09ff\u0a01-\u0a04\u0a3c-\u0a3d\u0a3e-\u0a43\u0a47-\u0a49\u0a4b-\u0a4e\u0a51-\u0a52\u0a70-\u0a72\u0a75-\u0a76\u0a81-\u0a84\u0abc-\u0abd\u0abe-\u0ac6\u0ac7-\u0aca\u0acb-\u0ace\u0ae2-\u0ae4\u0afa-\u0b00\u0b01-\u0b04\u0b3c-\u0b3d\u0b3e-\u0b45\u0b47-\u0b49\u0b4b-\u0b4e\u0b56-\u0b58\u0b62-\u0b64\u0b82-\u0b83\u0bbe-\u0bc3\u0bc6-\u0bc9\u0bca-\u0bce\u0bd7-\u0bd8\u0c00-\u0c05\u0c3e-\u0c45\u0c46-\u0c49\u0c4a-\u0c4e\u0c55-\u0c57\u0c62-\u0c64\u0c81-\u0c84\u0cbc-\u0cbd\u0cbe-\u0cc5\u0cc6-\u0cc9\u0cca-\u0cce\u0cd5-\u0cd7\u0ce2-\u0ce4\u0d00-\u0d04\u0d3b-\u0d3d\u0d3e-\u0d45\u0d46-\u0d49\u0d4a-\u0d4e\u0d57-\u0d58\u0d62-\u0d64\u0d82-\u0d84\u0dca-\u0dcb\u0dcf-\u0dd5\u0dd6-\u0dd7\u0dd8-\u0de0\u0df2-\u0df4\u0e31-\u0e32\u0e34-\u0e3b\u0e47-\u0e4f\u0eb1-\u0eb2\u0eb4-\u0eba\u0ebb-\u0ebd\u0ec8-\u0ece\u0f18-\u0f1a\u0f35-\u0f36\u0f37-\u0f38\u0f39-\u0f3a\u0f3e-\u0f40\u0f71-\u0f85\u0f86-\u0f88\u0f8d-\u0f98\u0f99-\u0fbd\u0fc6-\u0fc7\u102b-\u103f\u1056-\u105a\u105e-\u1061\u1062-\u1065\u1067-\u106e\u1071-\u1075\u1082-\u108e\u108f-\u1090\u109a-\u109e\u135d-\u1360\u1712-\u1715\u1732-\u1735\u1752-\u1754\u1772-\u1774\u17b4-\u17d4\u17dd-\u17de\u180b-\u180e\u1885-\u1887\u18a9-\u18aa\u1920-\u192c\u1930-\u193c\u1a17-\u1a1c\u1a55-\u1a5f\u1a60-\u1a7d\u1a7f-\u1a80\u1ab0-\u1abf\u1b00-\u1b05\u1b34-\u1b45\u1b6b-\u1b74\u1b80-\u1b83\u1ba1-\u1bae\u1be6-\u1bf4\u1c24-\u1c38\u1cd0-\u1cd3\u1cd4-\u1ce9\u1ced-\u1cee\u1cf2-\u1cf5\u1cf7-\u1cfa\u1dc0-\u1dfa\u1dfb-\u1e00\u20d0-\u20f1\u2cef-\u2cf2\u2d7f-\u2d80\u2de0-\u2e00\u302a-\u3030\u3099-\u309b\ua66f-\ua673\ua674-\ua67e\ua69e-\ua6a0\ua6f0-\ua6f2\ua802-\ua803\ua806-\ua807\ua80b-\ua80c\ua823-\ua828\ua880-\ua882\ua8b4-\ua8c6\ua8e0-\ua8f2\ua8ff-\ua900\ua926-\ua92e\ua947-\ua954\ua980-\ua984\ua9b3-\ua9c1\ua9e5-\ua9e6\uaa29-\uaa37\uaa43-\uaa44\uaa4c-\uaa4e\uaa7b-\uaa7e\uaab0-\uaab1\uaab2-\uaab5\uaab7-\uaab9\uaabe-\uaac0\uaac1-\uaac2\uaaeb-\uaaf0\uaaf5-\uaaf7\uabe3-\uabeb\uabec-\uabee\ufb1e-\ufb1f\ufe00-\ufe10\ufe20-\ufe30\u101fd-\u101fe\u102e0-\u102e1\u10376-\u1037b\u10a01-\u10a04\u10a05-\u10a07\u10a0c-\u10a10\u10a38-\u10a3b\u10a3f-\u10a40\u10ae5-\u10ae7\u10d24-\u10d28\u10f46-\u10f51\u11000-\u11003\u11038-\u11047\u1107f-\u11083\u110b0-\u110bb\u11100-\u11103\u11127-\u11135\u11145-\u11147\u11173-\u11174\u11180-\u11183\u111b3-\u111c1\u111c9-\u111cd\u1122c-\u11238\u1123e-\u1123f\u112df-\u112eb\u11300-\u11304\u1133b-\u1133d\u1133e-\u11345\u11347-\u11349\u1134b-\u1134e\u11357-\u11358\u11362-\u11364\u11366-\u1136d\u11370-\u11375\u11435-\u11447\u1145e-\u1145f\u114b0-\u114c4\u115af-\u115b6\u115b8-\u115c1\u115dc-\u115de\u11630-\u11641\u116ab-\u116b8\u1171d-\u1172c\u1182c-\u1183b\u11a01-\u11a0b\u11a33-\u11a3a\u11a3b-\u11a3f\u11a47-\u11a48\u11a51-\u11a5c\u11a8a-\u11a9a\u11c2f-\u11c37\u11c38-\u11c40\u11c92-\u11ca8\u11ca9-\u11cb7\u11d31-\u11d37\u11d3a-\u11d3b\u11d3c-\u11d3e\u11d3f-\u11d46\u11d47-\u11d48\u11d8a-\u11d8f\u11d90-\u11d92\u11d93-\u11d98\u11ef3-\u11ef7\u16af0-\u16af5\u16b30-\u16b37\u16f51-\u16f7f\u16f8f-\u16f93\u1bc9d-\u1bc9f\u1d165-\u1d16a\u1d16d-\u1d173\u1d17b-\u1d183\u1d185-\u1d18c\u1d1aa-\u1d1ae\u1d242-\u1d245\u1da00-\u1da37\u1da3b-\u1da6d\u1da75-\u1da76\u1da84-\u1da85\u1da9b-\u1daa0\u1daa1-\u1dab0\u1e000-\u1e007\u1e008-\u1e019\u1e01b-\u1e022\u1e023-\u1e025\u1e026-\u1e02b\u1e8d0-\u1e8d7\u1e944-\u1e94b\ue0100-\ue01f0" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really really hope this is not the right way to do this. I certainly don't think it should be all on one line.
There appears to be a canonical regex for package names, unless I'm getting that wrong. I'm not quite sure why we're not using it, though. It does not seem to be compatible with unicode packages, though, so we should probably try and fix this at the spec level. |
Unfortunately it does seem to be the only way to go at this time.
(From PEP 3131.) Unfortunately to use these unicode classes, we have to use the regex, and not the builtin re module. Am I right that adding regex as a dependency to setuptools would be a bad idea? If so, we are I believe effectively stuck with ugly unicode ranges. |
That is for identifiers, this is for package names and versions. It's not clear to me that they are the same (though there's some case to be made that they should be harmonized in the official spec).
Well, the other option is to not use a regular expression, for example, this will cover most of what your regex does. Presumably def fc(x):
out = ""
allowed_categories = {"Lt", "Lm", "Lu", "Ll", "Lt", "Lm", "Lo", "NI"}
for c in x:
if unicodedata.category(c) in allowed_categories:
out += c
else:
out += "-"
return re.sub("-+", "-", out).strip("-") But you'll note that in fact "safe names" are not valid identifiers, since they actually replace all |
I'd like to think not, but it's not as simple as adding a dependency in another project. Any dependency (and its dependencies) must be vendored. This shouldn't be done with abandon, as it does impact downstream system packagers that work to unvendor the vendored packages. |
Thank you for the feedback! I think I will try out @pganssle 's approach first. By the way, pardon my ignorance, but what is the practical difference between the package "safe name" and its identifier in terms of how Python (or setuptools) uses them? |
@julienmalard The two things are nearly entirely different. For something to be a valid identifier means that it can be used as a variable name in Python. For something to be a valid package name means that it can be used for the name of a package to be installed. For example |
Thank you @pganssle for your clear explanation! Please let me know what you think! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please re-write f35c7da to avoid the stylistic changes? Stylistic changes should be made in a separate commit or PR. Some of these changes might be acceptable, but others are not (excess indentation/whitespace added). It's these changes that are creating the merge conflicts.
) | ||
|
||
|
||
def fc(x, replace_with='-', also_accept=None, also_refuse=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function really needs a docstring, especially because fc
doesn't mean anything to me.
I started down the path of removing the stylistic changes, and re-organizing the commits into first a test failure. However, I noticed that It's not obvious to me that the reported issue is even related to the fix herein. Let's go back to the issue and work on creating a minimal reproducer that exhibits the failure. |
Summary of changes
Fixed regex error in [A-Za-z] test of package names so that the full range of Python's accepted unicode identifiers are accepted.
Closes #1399 (I think).
Pull Request Checklist