Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

\w in Python does not conform to Unicode #86

Open
Aloso opened this issue Mar 28, 2023 · 0 comments
Open

\w in Python does not conform to Unicode #86

Aloso opened this issue Mar 28, 2023 · 0 comments
Labels
bug Something isn't working C-compat Compatibility between regex flavors

Comments

@Aloso
Copy link
Member

Aloso commented Mar 28, 2023

Describe the bug

Compiling the Pomsky expression [word] targeting the Python flavor produces \w. But Python's \w doesn't match the Unicode spec:

  • It matches the Letter (Lm, Lt, Lu, Ll, Lo) general categories, instead of the Alphabetic property

  • It matches code points with a Numeric_Type of Digit, Decimal, or Numeric, but it should match just the Decimal_Number (Nd) general category.

  • It doesn't match the Mark (Mn, Mc, Me) general categories, nor Connector_Punctuation (Pc), except for the underscore _.

  • It doesn't match characters with the Join_Control property (U+200C, U+200D)

To Reproduce

Run pomsky -f python '[word]+'

Run regex-test -f python '\w+' -t "\u0939\u093f\u0928\u094d\u0926\u0940"

Expected behavior

Note that Python's re module does not support Unicode properties, so it's impossible to polyfill proper Unicode support.

Therefore, [word] should be forbidden in the Python regex flavor, unless Unicode is disabled; then it should produce [a-zA-Z0-9_].

This is not a satisfactory solution, however, since this makes it impossible to match non-ASCII word characters. Some people may find \w useful even though it is incorrect and only matches a subset of word characters. That is why another Python flavor should be added, targeting the regex module, which has much better Unicode support.

Alternatives

Add a nonstandard_unicode mode, so \w can be used in flavors where \w matches some non-ASCII word characters, but not all (i.e. Python and .NET)

Related

python/cpython#44795

@Aloso Aloso added bug Something isn't working C-compat Compatibility between regex flavors labels Mar 28, 2023
@Aloso Aloso changed the title \w in Python is does not conform to Unicode \w in Python does not conform to Unicode Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working C-compat Compatibility between regex flavors
Projects
None yet
Development

No branches or pull requests

1 participant