-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Introduce a mapping to map sparse labels to a continuous range #14494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Whoa, exciting! I will try to review soon! Thanks @gf2121. |
This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this opto!
This was my first pass through so I left some probably head scratcher / fresh eyes confusing sort of comments :)
Thnks @gf2121 and sorry for the slow review... thank you Stale Bot (hmm we really need a better name for this incredibly important bot -- Lars?).
meta.writeVLong(root.fp); | ||
index.writeLong(0L); // additional 8 bytes for over-reading | ||
meta.writeVLong(index.getFilePointer()); | ||
status = Status.SAVED; | ||
} | ||
|
||
void saveNodes(IndexOutput index) throws IOException { | ||
/** | ||
* Save label dictionary and return a label map that narrow labels' value to a constant value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
constant
-> compact
?
assert label == labelsSeen.length() - 1 | ||
|| labelsSeen.nextSetBit(label + 1) == DocIdSetIterator.NO_MORE_DOCS; | ||
if (labels.length == 0 || labels[labels.length - 1] - labels[0] + 1 == labels.length) { | ||
out.writeVInt(0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add a comment?
I think what this is doing is writing a 0
byte if there were no labels, or, if the labels were already fully dense/compact to begin with? And something else (above) will be able to tell the difference?
Edit: OK I see that 0
at read-time means "don't map", i.e. use the int label directly. And this works for the "no labels at all" case because who cares what the mapping is if you will never use it (sort of like if a tree falls in the woods and nobody hears, did it make any sound?).
lookUpLabel = targetLabel; | ||
} else { | ||
lookUpLabel = labelMap[targetLabel]; | ||
if (lookUpLabel == -1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is this paranoia? Why would we write a label ord that wasn't mapped to a valid label?
@Override | ||
public void push(Node node) { | ||
if (labelMap != null) { | ||
node.label = labelMap[node.label]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe assert node.label != -1
? This label must've been defined?
if (cnt == 0) { | ||
return null; | ||
} else { | ||
int[] labelMap = new int[TrieBuilder.BYTE_RANGE]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this byte[]
instad? This added heap could add up for many shards X many indices for multitenant (OpenSearch, Elastcisearch, Solr, ...) cases, and with many separate indexed fields?
@@ -71,6 +74,46 @@ public void testOneByteTerms() throws Exception { | |||
} | |||
} | |||
|
|||
public void testContinuousValues() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also explicitly test both ends (0, 255) directly? Maybe 2-3 labels starting with 0, and ending with 255? Trying to tease out any scary ob1-Kenobi bugs!
@@ -63,7 +67,7 @@ private enum Status { | |||
private static class Node { | |||
|
|||
// The utf8 digit that leads to this Node, 0 for root node | |||
private final int label; | |||
private int label; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm why did we need to un-final? Don't we remap on write, not update each Node
's label in place?
@@ -74,14 +77,39 @@ IndexInput floorData(TrieReader r) throws IOException { | |||
final RandomAccessInput access; | |||
final IndexInput input; | |||
final Node root; | |||
final int[] labelMap; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a comment explaining a bit what's happening here? Explain that it is global across the entire Trie, and it remaps the written labels to compact ordinals for more efficient storage w/ negligible impact on terms lookup performance?
In ASCII encoding, numbers, lowercase letters, uppercase letters, and punctuation have discontinuous value ranges. This PR proposes to use a mapping to map the discontinuous value ranges to a continuous value range starting from 0. This can save some storage space and increase the probability of using the
BITSET
strategy instead of theARRAY
strategy, which may also improve some performance.In luceneutil, this patch reduces ~1.5% size of tip, including ~5% for the
id
field, which encoded with base36 (numbers and lowercase letters). Performance is generally even.