Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization: Deduplicate alternatives #60

Open
RunDevelopment opened this issue Dec 25, 2022 · 3 comments
Open

Optimization: Deduplicate alternatives #60

RunDevelopment opened this issue Dec 25, 2022 · 3 comments
Labels
C-optimize Issue or feature request for an optimization enhancement New feature or request

Comments

@RunDevelopment
Copy link

Is your feature request related to a problem? Please describe.

Duplicate alternatives can happen when combining different variables, or simply by accident.

Example: A subset of C keyword with while appearing twice:

% (
  | "break"
  | "case"
  | "char"
  | "const"
  | "continue"
  | "default"
  | "do"
  | "double"
  | "else"
  | "float"
  | "for"
  | "if"
  | "int"
  | "long"
  | "return"
  | "short"
  | "sizeof"
  | "struct"
  | "switch"
  | "void"
  | "while"
  | "while"
) %

Describe the solution you'd like

Assuming no side effects (e.g. capturing groups), duplicate alternatives can be removed without affecting the behavior of the pattern (negatively).

More specifically, given an alternation, let index(x) -> int be a function that returns the index of a given alternative of the alternation in its list of alternatives. If there exist two alternatives a and b such that index(a) < index(b), then b can be removed without effecting the pattern.

This optimization might even prevent very simple cases of exponential backtracking. E.g. ( "a" | "a" )+ "b".

Describe alternatives you've considered

A more complete solution would be to implement a more complete DFA-based solution to determine whether a given alternative is a subset of the union of all previous alternatives in the alternation. However, this is significantly more computationally expensive and does not support irregular features such as assertions.

Additional context

The regexp/no-dupe-disjunctions rule is an implementation of this optimization. This rule uses the DFA approach along if a simpler regex-comparison approach to determine duplicate alternatives.

@RunDevelopment RunDevelopment added the enhancement New feature or request label Dec 25, 2022
@Aloso Aloso added the C-optimize Issue or feature request for an optimization label Dec 28, 2022
@lppedd
Copy link

lppedd commented Jan 2, 2023

We need to make a list of all these optimizations.
Even if they don't get into the compiler immediately, they can be supported in the IDE as inspections.

@Aloso
Copy link
Member

Aloso commented Jan 3, 2023

@lppedd while an inspection would be nice, it should also exist as an optimization, which works in the following situation:

let x = 'foo' | 'bar' | 'baz';
let y = 'bar' | 'quux';

x | y

The output is foo|bar|baz|bar|quux, but should be optimized to foo|bar|baz|quux.

@Aloso
Copy link
Member

Aloso commented Nov 26, 2024

The more general (and perhaps more common) optimization would be prefix merging:

'foobar' | 'foo'     -> 'foo' 'bar'?
'boring' | 'boeing'  -> 'bo' ('ring' | 'eing')

(The latter could even become bo[er]ing by looking at the end as well, but this is not necessary)

The tricky part is to do this without changing precedence. It can be simplified by not using ? and compiling the first one to foo(?:bar|), so we don't have to take laziness into account.

Alternatives can sometimes be reordered, but not always:

'foobar' | ['d'-'f'] | 'foo'

Here it's not allowed, since the character set includes the f. It could be turned into the following:

['de'] | 'f' ('oobar' | '' | 'oo')

by removing the f from the character set, but I won't implement that, since the benefit is unclear.

Note that compiling it to [de]|f(?:oo(bar)?)? would be wrong because it has different precedence, so the foobar and foo alternatives can't be merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-optimize Issue or feature request for an optimization enhancement New feature or request
Projects
Status: No status
Development

No branches or pull requests

3 participants