Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using trieregex for a list not of words, but of regular expressions? #1

Open
leoplusx opened this issue Sep 21, 2021 · 3 comments
Open

Comments

@leoplusx
Copy link

Thanks for making this. Tries are extremely powerful, and your module makes them easy to use.

I'm wondering if I can use it also to optimize a list of regular expressions instead of a list of words.

By regular expressions I mean anything I would feed into re.find(regex, string), for example. So flags like (?mi), non-capturing groups, etc..

I tried doing that, but your module simply escaped my regular expressions, i.e. treated them as words to be matched verbatim. So it seems the algorithm you are using to generate the trie only works for words, or strings to be matched verbatim, and not for regular expressions.

Am I overlooking something? Do you think it's even possible to do for a list of regular expressions what you've done for lists of words here? Basically, a general regex optimizer.

Thanks!

@ermanh
Copy link
Owner

ermanh commented Sep 26, 2021

Hi leobaumgardt,
Thanks for reaching out, I'm glad you find this tool interesting. But you're right, trieregex only works for a list of literal strings, not regexes.
For regexes, I have thought about it before but there are some major challenges/problems:

(1) A trie normally breaks down strings into single characters, but regexes cannot be broken down that way due to groups, escaped characters, and so on. So a minimal unbreakable unit in a regex can involve a long string of characters, and there are countless possibilities.

(2) In a regex you can declare a character or group to be of variable length (adding ?, *, +, or {2,}, for example), and it can be greedy or not greedy (e.g., *?, +?). When two regexes have different specifications on the variable length of a (sub)pattern, to combine them sometimes you would probably only be able to do a full disjunction (i.e., r“regex1|regex2”), in which case there is basically no optimization.

(3) If one regex has anchors (such as ^ and $), and the other doesn’t, then you would also do a simple disjunction. The same would be true for two regexes with different lookaround declarations.

(4) When you have too many items in a disjunctive relationship, and if you need to manage priority (i.e., regex will stop searching if an earlier listed item in the disjunction matches first), it’s easier to organize them in the code as separate regexes rather than as one big regex, especially if they are long and complicated.

These are just some of the issues off the top of my head that would make the implementation of a “regex optimizer” quite a bit more complicated, and difficult to ensure no unintended errors or side effects may result. I’m also not sure the practical benefit is high enough either (simple disjunction being a highly likely sole solution for more complicated regexes, in which case you don’t need a special tool to help you join them).

There are likely other aspects not mentioned above that would add more complexity and difficulty to creating the kind of optimizer you are proposing, especially one that truly and fully supports all the specialized features available in a pattern matching language that already has such a powerful level of abstraction. If you have some thoughts to the contrary or suggestions though, I am all ears!

@ermanh
Copy link
Owner

ermanh commented Sep 26, 2021

So in sum, for a regex optimizer you are proposing, we cannot rely on a single trie to store and then retrieve a list of complex regexes. You would likely need at least a list first, for all the regexes that have to be fully disjunctive with each other, and then maybe some separate groups of simple regexes can be combined and converted into separate tries, but then each node in a trie is not necessarily a single character. Then you would need to write detection rules for determining what series of characters should be treated as a single node, conversion rules for character classes and ranges so you can apply set algebra methods for combining them, and then also methods for restoring group backreferences ("\1" referring to the first capturing group that appears in the regex). Maybe there are some more advanced features or even more complicated situations I haven't thought about yet?

In any case, it would be a fairly different beast compared to how trieregex works, and way more complex.

@ermanh
Copy link
Owner

ermanh commented Sep 26, 2021

Oh one more practical downside is that really long regexes take a lot of time to process (probably due to heavy memory usage?), and the likelihood of ending up with a still really long regex out of combining a list of regexes is considerably higher than using trieregex for literal strings, unless the regexes are all really simple and can actually be shortened instead of resorting to full disjunction, in which case they might be simple enough to combine manually, or one would be better off keeping them separate for modularity, so that processing is less intensive (fewer bottleneck issues) and maintenance is easier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants