Rexlex, short for (R)egular (Ex)pressions and (Lex)ers, provides configurable and scalable Regular Expression
Note that Rexlex Matching is a deprecated feature. Matching and Searching is subject to the successor project patternsearchalgorithms.
Pattern pattern = Pattern.compile("\\d{2}-\\d{2}-\\d{4}", new DefaultMatcherBuilder());
Finder matcher = pattern.finder("born on 04-07-1946");
while (matcher.find()) {
System.out.println("found text = " + matcher.match.text());
System.out.println("at = " + matcher.match.start());
System.out.println("to = " + matcher.match.end());
}
for (Match match : matcher.findAll()) {
System.out.println("found text = " + match.text());
System.out.println("at = " + match.start());
System.out.println("to = " + match.end());
}
Pattern pattern = Pattern.compile("\\d{2}-\\d{2}-\\d{4}", new DefaultMatcherBuilder());
Matcher matcher = builder.matcher(04-07-1946");
System.out.prinltn("matches: " + matcher.matches());
The DefaultMatcherBuilder
is fast for matching (O(n)), but naive for searching (O(n²)). This was the primary motivation to provide more advanced MatcherBuilder
s which are more efficient in searching.
Until version 0.2.11. there were two further MatcherBuilder
s:
- SearchMatcherBuilder (two phase search with O(n) for finding position and O(n) for finding the pattern at that position)
- OptimizedMatcherBuilder (recognizes simple patterns, that can be recognized with much faster multi-string-search)
Both were correct in finding match positions, but finding the best pattern at the position was not completely correct. There also was a working alternative with patternsearchalgorithms, so we decided to exclude these implementations.
First the lexer tokens which are produced when finding a certain lexing idiom must be defined. To do this you must implement the class Token:
public class MyToken implements Token {
private String literal;
private TokenType type;
public TestToken(String literal, TokenType type) {
this.literal = literal;
this.type = type;
}
@Override
public String getLiteral() {
return literal;
}
@Override
public TokenType getType() {
return type;
}
...
}
This default implementation should be sufficient in most cases, but be free to extend this type with methods you later need.
Rexlex has three default token types (in the enum DefaultTokenType). You may want to extend the token types. TokenTypes could be enums or classes. Note that in latter case you should correctly implement the methods hashCode and equals.
public enum MyTokenType implements TokenType {
A,B,REMAINDER;
@Override
public boolean error() {
return false;
}
@Override
public boolean accept() {
return true;
}
}
Then write the token factory.
public class MyTokenFactory implements TokenFactory<TestToken>{
@Override
public MyToken createToken(String literal, TokenType type) {
return new MyToken(literal, type);
}
}
Having the tokens and the token factory you can build a lexer. In the following code we assume that you have defined additional token types A, B and REMAINDER:
Map<String, TokenType> patternToTypes = new HashMap();
patternToTypes.put("a", A); //any match for 'a' will return token type A
patternToTypes.put("b", B); //any match for 'b' will return token type B
DynamicLexer<TestToken> lexer = new DynamicLexer<TestToken>(patternToTypes, REMAINDER, factory); // nonmatched strings will return REMAINDER
Iterator<MyToken> tokens = lexer.lex("abc");
MyToken a = tokens.next(); // == new MyToken("a", A)
MyToken b = tokens.next(); // == new MyToken("b", B)
MyToken c = tokens.next(); // == new MyToken("c", REMAINDER)
Common regex packages use nondeterministic automatons (NFA) to capture the regular expression. Nondeterministic automatons are based on backtracking to match a string. This has several advantages (e.g. group capturing, greedy/lazy/possesive matching, lookahead/lookbehind, backreferences). The disadvantage is, that such implementations do not perform well - especially when the regular expression contains branches (e.g. 'a|b') or captures an infinite number of chars (e.g. 'ab'). The match time is dependent on the nodes in the automaton and the chars to match (O(m^2n), where m = number of automaton nodes, n = number of chars to match)
In many cases regular expressions do not need to provide the upper features, instead they should perform well. Rexlex compiles a deterministic automaton (DFA) from a given regular expression. The match time of such an automaton is linear dependent on the number of chars to match (O(n), where n = number of chars to match).
Consider that you have a number of regular expressions available at runtime and you want to build a lexer from this set. Typical lexer generators allow you to generate code. They are designed to be generated before you write/link the code using the Lexer.
Rexlex allows you to generate a lexer at runtime - no code generation, no class loading, no need to depend on code not generated yet. Creating and instantly using a new Lexer from a variable set of Lexing Idioms with rexlex is much faster and easier than using a lexer generator. This allows you to generate families of languages with differences only in the Lexing Idioms, as well as extendable languages (where you could add new lexing idioms at runtime).
Of course this comes with a price of less performance at lexing time and the language itself cannot be specified, but must be programmed. Whenever you have a nonomodifiable DSL based on nonvariable lexing idioms you should prefer a lexer generator. Otherwise rexlex lexing could be a fine alternative.
We support the regular expression syntax of regexparser.
Java regular expressions (java.util.Pattern) are quickly created and optimized. Simple regular expressions are executed quite fast.
Rexlex regular expressions need a long creation and optimization time. After this initial effort the execution time is no longer dependent on pattern complexity.
Use Java regular expressions:
- if the expression is short and simple
- if the expression is matched only a few times
- if the expression is often created (e.g. in a loop)
Use Rexlex regular expressions:
- if the expression is long or complex
- if the expression is matched many times
- if the expression is once created and often applied
A performance benchmark for regex packages can be found at https://github.com/almondtools/regexbench.
This benchmark does not only check the performance but also the correctness of the results:
- each benchmark fails if the expected number matches is not found
- DFA packages cannot compute the same groups as NFA packages - accepted difference
- Non-Posix-NFA packages (as jregex and java.util.regex) do not always detect the longest leftmost match - accepted difference
<dependency>
<groupId>com.github.almondtools</groupId>
<artifactId>rexlex</artifactId>
<version>0.2.13</version>
</dependency>