Support for PEP3131 (Non-ASCII Identifiers) #160

DylanLukes · 2020-01-24T01:29:03Z

In the process of using Baron for some research on source pulled from hundreds/thousands of repositories on GitHub, I've found that in many cases Baron is unable to tokenize/parse source containing non-ASCII identifiers.

Non-ASCII identifiers are supported by (at least as far back as) Python 3, as specified by PEP3131.

This pull request includes some very small changes that allow Baron to handle non-ASCII identifiers:

Replace native re module with a dependency on the regex module.
- This is because regex supports Unicode character property classes.
Replace the regex for NAME tokens:
- Before: [a-zA-Z_]\w*
- After: [\p{XID_Start}_]\p{XID_Continue}*

I have checked that all tests pass without regression, and have added another simple test:

def test_name_unicode():
    match('β', 'NAME')
    match('가사', 'NAME')

Note:

PEP3131 states:

The identifier syntax is <XID_Start> <XID_Continue>*.

However, this seems to be an error, as XID_Start does not contain _ by default (though the Unicode specifications suggest a Start class could or should contain it.

DylanLukes · 2020-01-24T23:43:14Z

Looks like there's a failing test on 2.7, will fix.

The 2.6 failure is unrelated to this PR:

0.10s$ curl -sSf --retry 5 -o python-2.6.tar.bz2 ${archive_url}
163curl: (22) The requested URL returned error: 404 Not Found

DylanLukes · 2020-01-25T00:31:29Z

Alright, tests now all pass on 2.7 and up! I ended up making them conditional on the Python version, as it turns out the derived Unicode categories differ between Python 2 and Python 3.

That is "α" is matched by "\p{XID_Start}" on Python 3, but not on Python 2.

In summary: this set of changes adds support for Python 3's Unicode identifiers... but only if you're using Python 3.

DylanLukes added 4 commits January 23, 2020 17:15

replace re => regex

88864d7

NAME regex "[a-zA-Z_]\w*" => "[\p{XID_Start}_]\p{XID_Continue}*"

99514e9

add test for unicode name

9517d7e

support for splitting non-ASCII

19d61ef

DylanLukes added 2 commits January 24, 2020 16:03

minor cleanup

e6628bc

add six, make unicode tests conditional on py version

4c20209

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for PEP3131 (Non-ASCII Identifiers) #160

Support for PEP3131 (Non-ASCII Identifiers) #160

DylanLukes commented Jan 24, 2020

DylanLukes commented Jan 24, 2020

DylanLukes commented Jan 25, 2020

Support for PEP3131 (Non-ASCII Identifiers) #160

Are you sure you want to change the base?

Support for PEP3131 (Non-ASCII Identifiers) #160

Conversation

DylanLukes commented Jan 24, 2020

Note:

DylanLukes commented Jan 24, 2020

DylanLukes commented Jan 25, 2020