Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hack grammar contains invalid Oniguruma token \xff #159

Closed
slevithan opened this issue Nov 26, 2024 · 2 comments
Closed

Hack grammar contains invalid Oniguruma token \xff #159

slevithan opened this issue Nov 26, 2024 · 2 comments

Comments

@slevithan
Copy link
Contributor

slevithan commented Nov 26, 2024

This regex includes the character class [^a-zA-Z0-9_\\x7f-\\xff] (with backslashes escaped for JSON). The \x7f-\xff range in this negated class is triggering an Oniguruma bug, resulting in a bug in the Hack grammar. Although that range appears like it excludes code points U+7F to U+FF, in fact it excludes U+7F to U+10FFFF. The reason is described in this comment.

This can be fixed by simply changing the \xff to \x{ff}, which changes Oniguruma from interpreting it as an invalid standalone encoded byte value (true for unenclosed \xHH for values above 7F) to a code point value (always true for the enclosed form \x{...}). Note that this handling is specific to Oniguruma, not other regex flavors.

Also note that this is the only place in the grammar that \\x7f-\\xff appears, but the correct version \\x{7f}-\\x{ff} appears 28 times.

To fix this, the \\xff should be replaced with \\x{ff}. It will then work correctly in Oniguruma. It is optional for the \\x7f to also be replaced with \\x{7f}.

@slevithan
Copy link
Contributor Author

Re-opening after additional research and editing my comment to add more clarifying details.

@slevithan slevithan reopened this Nov 27, 2024
@slevithan
Copy link
Contributor Author

In addition to causing edge case bugs, this issue is also preventing the Hack grammar from running in Shiki when using its JS engine (which transpiles Oniguruma regexes to JS using Oniguruma-To-ES). Oniguruma-To-ES intentionally doesn't reproduce Oniguruma's bugs related to handling of unenclosed \xF5 through \xFF, and instead throws for them (as Oniguruma does for \x80 through \xF4).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant