-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Using trieregex for a list not of words, but of regular expressions? #1
Comments
Hi leobaumgardt, (1) A trie normally breaks down strings into single characters, but regexes cannot be broken down that way due to groups, escaped characters, and so on. So a minimal unbreakable unit in a regex can involve a long string of characters, and there are countless possibilities. (2) In a regex you can declare a character or group to be of variable length (adding ?, *, +, or {2,}, for example), and it can be greedy or not greedy (e.g., *?, +?). When two regexes have different specifications on the variable length of a (sub)pattern, to combine them sometimes you would probably only be able to do a full disjunction (i.e., r“regex1|regex2”), in which case there is basically no optimization. (3) If one regex has anchors (such as ^ and $), and the other doesn’t, then you would also do a simple disjunction. The same would be true for two regexes with different lookaround declarations. (4) When you have too many items in a disjunctive relationship, and if you need to manage priority (i.e., regex will stop searching if an earlier listed item in the disjunction matches first), it’s easier to organize them in the code as separate regexes rather than as one big regex, especially if they are long and complicated. These are just some of the issues off the top of my head that would make the implementation of a “regex optimizer” quite a bit more complicated, and difficult to ensure no unintended errors or side effects may result. I’m also not sure the practical benefit is high enough either (simple disjunction being a highly likely sole solution for more complicated regexes, in which case you don’t need a special tool to help you join them). There are likely other aspects not mentioned above that would add more complexity and difficulty to creating the kind of optimizer you are proposing, especially one that truly and fully supports all the specialized features available in a pattern matching language that already has such a powerful level of abstraction. If you have some thoughts to the contrary or suggestions though, I am all ears! |
So in sum, for a regex optimizer you are proposing, we cannot rely on a single trie to store and then retrieve a list of complex regexes. You would likely need at least a list first, for all the regexes that have to be fully disjunctive with each other, and then maybe some separate groups of simple regexes can be combined and converted into separate tries, but then each node in a trie is not necessarily a single character. Then you would need to write detection rules for determining what series of characters should be treated as a single node, conversion rules for character classes and ranges so you can apply set algebra methods for combining them, and then also methods for restoring group backreferences ("\1" referring to the first capturing group that appears in the regex). Maybe there are some more advanced features or even more complicated situations I haven't thought about yet? In any case, it would be a fairly different beast compared to how trieregex works, and way more complex. |
Oh one more practical downside is that really long regexes take a lot of time to process (probably due to heavy memory usage?), and the likelihood of ending up with a still really long regex out of combining a list of regexes is considerably higher than using trieregex for literal strings, unless the regexes are all really simple and can actually be shortened instead of resorting to full disjunction, in which case they might be simple enough to combine manually, or one would be better off keeping them separate for modularity, so that processing is less intensive (fewer bottleneck issues) and maintenance is easier. |
Thanks for making this. Tries are extremely powerful, and your module makes them easy to use.
I'm wondering if I can use it also to optimize a list of regular expressions instead of a list of words.
By regular expressions I mean anything I would feed into re.find(regex, string), for example. So flags like (?mi), non-capturing groups, etc..
I tried doing that, but your module simply escaped my regular expressions, i.e. treated them as words to be matched verbatim. So it seems the algorithm you are using to generate the trie only works for words, or strings to be matched verbatim, and not for regular expressions.
Am I overlooking something? Do you think it's even possible to do for a list of regular expressions what you've done for lists of words here? Basically, a general regex optimizer.
Thanks!
The text was updated successfully, but these errors were encountered: