Description
Describe the bug
Compiling the Pomsky expression [word]
targeting the Python flavor produces \w
. But Python's \w
doesn't match the Unicode spec:
-
It matches the
Letter
(Lm
,Lt
,Lu
,Ll
,Lo
) general categories, instead of theAlphabetic
property -
It matches code points with a
Numeric_Type
ofDigit
,Decimal
, orNumeric
, but it should match just theDecimal_Number
(Nd
) general category. -
It doesn't match the
Mark
(Mn
,Mc
,Me
) general categories, norConnector_Punctuation
(Pc
), except for the underscore_
. -
It doesn't match characters with the
Join_Control
property (U+200C, U+200D)
To Reproduce
Run pomsky -f python '[word]+'
Run regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"
Expected behavior
Note that Python's re
module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.
Therefore, [word]
should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce [a-zA-Z0-9_]
.
This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find \w
useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting the regex
module, which has much better Unicode support.
Alternatives
Add a nonstandard_unicode
mode, so \w
can be used in flavors where \w
matches some non-ASCII word characters, but not all (i.e. Python and .NET)