Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Remove unnecessary elements in character classes #59

Open
RunDevelopment opened this issue Dec 24, 2022 · 5 comments
Open
Labels
C-optimize Issue or feature request for an optimization enhancement New feature or request

Comments

@RunDevelopment
Copy link

Is your feature request related to a problem? Please describe.

Pomsky currently does not remove unnecessary elements in character classes. E.g. [ w "abc" ] compiles to [\wabc] (Java). However, the abc is unnecessary because [\wabc] == \w.

Describe the solution you'd like

Remove unnecessary elements in character classes to optimize and simplify them.

Additional context

This requires knowing the precise set of characters accepted by each character class element. For an example implementation of this, checkout the regexp/no-dupe-characters-character-class rule.

@RunDevelopment RunDevelopment added the enhancement New feature or request label Dec 24, 2022
@Aloso
Copy link
Member

Aloso commented Dec 28, 2022

Thanks for your feature request! This is already on my to-do list, but is tricky to get right.

Another reason why we need this is to prevent the following:

![w !d]

A negated character set matching neither \w nor \D matches nothing, which is forbidden in Rust. So I'm working on a way to determine whether two character classes overlap, are disjunct, or one is a subset of the other.

@RunDevelopment
Copy link
Author

So I'm working on a way to determine whether two character classes overlap, are disjunct, or one is a subset of the other.

The exact set of characters matched by each character set is defined in pomsky, right? Then couldn't you parse them into an interval set? These interval sets can be efficiently unioned, intersected, and compared (equal, subset, disjoint).
That's what the regex crate also does under the hood. We also do this for eslint-plugin-regexp. Having this representation for characters, character sets, and character classes makes it pretty easy to implement some optimizations.

@Aloso
Copy link
Member

Aloso commented Dec 28, 2022

Yes, except that we want to preserve \w, \d, \s, \p{Greek}, \p{Separator}, etc. rather than lowering them to a lot of ranges, so we can emit the smallest possible output.

@RunDevelopment
Copy link
Author

Preserving character sets and Unicode properties is not mutually exclusive with using interval sets. It's of course true that interval sets do not preserve the elements that created them, but that's also not really a problem. I meant to suggest that the optimizer should have a way to get the interval set from character elements, not that character classes should be represented by interval sets.

@Aloso Aloso added the C-optimize Issue or feature request for an optimization label Dec 28, 2022
@Aloso Aloso added this to the v0.10 milestone Jan 14, 2023
@Aloso Aloso modified the milestones: v0.10, v0.11 Oct 25, 2023
@Aloso
Copy link
Member

Aloso commented Nov 26, 2024

This has been implemented for the simple case (overlapping character ranges):

['a'-'f' 'b'-'g']  // --> [a-g]
['a'-'z' 'b']      // --> [a-z]

There's a special case to remove digit when word is present. Otherwise, character classes and Unicode properties aren't handled yet.

@Aloso Aloso removed this from the v0.11 milestone Nov 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-optimize Issue or feature request for an optimization enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants