-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
use a open addressing map to store folding paths #36
base: master
Are you sure you want to change the base?
Conversation
java/com/google/re2j/Inst.java
Outdated
int[] folds = Unicode.optimizedFoldOrbit(r0); | ||
|
||
if (folds == null) { | ||
return (r == Unicode.optimizedFoldNonOrbit(r0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you benchmarked this against an implementation based on Character.toLowerCase(r) == r? Ideally with a few changes like that, we could throw away all the Unicode tables and use the ones built into java.lang.
Be sure to use the int (code point) not char (UTF-16 code) variants of those functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess you are talking about the optimizedFoldNonOrbit? no I have not, was trying to be the least impactful as possible :) will have a look. From what I understand the Orbit exception will still need to be there ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the orbit case will still need special handling. I suggest you do this experiment as a separate change.
java/com/google/re2j/Unicode.java
Outdated
static int[] optimizedFoldOrbit(int r) { | ||
if (r >= ORBIT_MIN_VALUE) { | ||
return FOLD_MAP.get(r); | ||
} else return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use {...} around every block, or outdent the else block:
if (r < ORBIT_MIN_VALUE) {
return null;
}
return FOLD_MAP.get(r);
java/com/google/re2j/Unicode.java
Outdated
private static final int ORBIT_MIN_VALUE = UnicodeTables.CASE_ORBIT[0][0]; | ||
|
||
static { | ||
for(int i = 0; i < UnicodeTables.CASE_ORBIT.length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for (int[][] orbit : UnicodeTables.CASE_ORBIT)
java/com/google/re2j/Unicode.java
Outdated
int[] folds = new int[0]; | ||
int r1 = r0; | ||
while((r1 = simpleFold(r1)) != r0) { | ||
folds = Arrays.copyOf(folds, folds.length + 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't an orbit necessarily of length at least 3? I realize they are small and few, but it seems perverse to copy the entire array each time we add an element. Use a temporary large enough for the longest orbit (MAX_DISTANCE?) and copy it once before each call to FOLD_MAP.put.
java/com/google/re2j/Unicode.java
Outdated
} | ||
|
||
/* | ||
associate a rune to it's unfolded case equivalent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"A FoldMap maps a rune to an array of runes that are equivalent to it in a case-insensitive pattern."
java/com/google/re2j/Unicode.java
Outdated
return toUpper(r); | ||
} | ||
|
||
private static final FoldMap FOLD_MAP = new FoldMap(UnicodeTables.CASE_ORBIT.length * 4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whence 4? Is it a load factor based on MAX_DISTANCE?
I don't think this should be a parameter. FoldMap is instantiated exactly once; specialize it.
java/com/google/re2j/Inst.java
Outdated
@@ -13,6 +13,8 @@ | |||
*/ | |||
class Inst { | |||
|
|||
private static final int RUNES_LINER_SEARCH_SIZE = 10; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LINEAR?
Move this constant into the sole function that uses it.
java/com/google/re2j/Inst.java
Outdated
// Otherwise binary search. | ||
for (int lo = 0, hi = runes.length / 2; lo < hi; ) { | ||
// Otherwise binary search on rest of the array | ||
for (int lo = 0, hi = (runes.length - RUNES_LINER_SEARCH_SIZE) / 2; lo < hi; ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it's worth the fuss to subtract the portion initially scanned. It's a logarithmic algorithm and this is not the common case. I'm sure it's fast enough to search the complete array. Did you find evidence to the contrary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, will remove it
java/com/google/re2j/Inst.java
Outdated
@@ -84,12 +98,12 @@ boolean matchRune(int r) { | |||
} | |||
} | |||
|
|||
// Otherwise binary search. | |||
for (int lo = 0, hi = runes.length / 2; lo < hi; ) { | |||
// Otherwise binary search on rest of the array |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While you're here, please document:
// Invariant: lo, hi, m are even.
java/com/google/re2j/Unicode.java
Outdated
associate a rune to it's unfolded case equivalent | ||
*/ | ||
private static class FoldMap { | ||
private static final int MAX_DISTANCE = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Document:
// number of closed-hashing probes before giving up
java/com/google/re2j/Unicode.java
Outdated
private final int mask; | ||
private final int maxDistance; | ||
|
||
FoldMap(int size) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FoldMap serves exactly one purpose. It doesn't need a parameter, and it doesn't need to be separated from the logic that instantiates and populates it.
I did check the Character.toLower/toUpper it does not affect the results, though I change the isUpper/isLower to use the Character one and it broke some test so I rolled it back. |
This optimisation came from profiling my app with real payload, the simpleFold was taking around 15% cpu and is now negligible.
I added a few test to verify that the logic is still working as expected.