You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compiling the Pomsky expression [word] targeting the Python flavor produces \w. But Python's \w doesn't match the Unicode spec:
It matches the Letter (Lm, Lt, Lu, Ll, Lo) general categories, instead of the Alphabetic property
It matches code points with a Numeric_Type of Digit, Decimal, or Numeric, but it should match just the Decimal_Number (Nd) general category.
It doesn't match the Mark (Mn, Mc, Me) general categories, nor Connector_Punctuation (Pc), except for the underscore _.
It doesn't match characters with the Join_Control property (U+200C, U+200D)
To Reproduce
Run pomsky -f python '[word]+'
Run regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"
Expected behavior
Note that Python's re module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.
Therefore, [word] should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce [a-zA-Z0-9_].
This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find \w useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting the regex module, which has much better Unicode support.
Alternatives
Add a nonstandard_unicode mode, so \w can be used in flavors where \w matches some non-ASCII word characters, but not all (i.e. Python and .NET)
Describe the bug
Compiling the Pomsky expression
[word]
targeting the Python flavor produces\w
. But Python's\w
doesn't match the Unicode spec:It matches the
Letter
(Lm
,Lt
,Lu
,Ll
,Lo
) general categories, instead of theAlphabetic
propertyIt matches code points with a
Numeric_Type
ofDigit
,Decimal
, orNumeric
, but it should match just theDecimal_Number
(Nd
) general category.It doesn't match the
Mark
(Mn
,Mc
,Me
) general categories, norConnector_Punctuation
(Pc
), except for the underscore_
.It doesn't match characters with the
Join_Control
property (U+200C, U+200D)To Reproduce
Run
pomsky -f python '[word]+'
Run
regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"
Expected behavior
Note that Python's
re
module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.Therefore,
[word]
should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce[a-zA-Z0-9_]
.This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find
\w
useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting theregex
module, which has much better Unicode support.Alternatives
Add a
nonstandard_unicode
mode, so\w
can be used in flavors where\w
matches some non-ASCII word characters, but not all (i.e. Python and .NET)Related
python/cpython#44795
The text was updated successfully, but these errors were encountered: