Skip to content

Introduce a mapping to map sparse labels to a continuous range #14494

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 111 commits into
base: main
Choose a base branch
from

Conversation

gf2121
Copy link
Contributor

@gf2121 gf2121 commented Apr 15, 2025

In ASCII encoding, numbers, lowercase letters, uppercase letters, and punctuation have discontinuous value ranges. This PR proposes to use a mapping to map the discontinuous value ranges to a continuous value range starting from 0. This can save some storage space and increase the probability of using the BITSET strategy instead of the ARRAY strategy, which may also improve some performance.

In luceneutil, this patch reduces ~1.5% size of tip, including ~5% for the id field, which encoded with base36 (numbers and lowercase letters). Performance is generally even.

@mikemccand
Copy link
Member

Whoa, exciting! I will try to review soon! Thanks @gf2121.

Copy link

github-actions bot commented May 4, 2025

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label May 4, 2025
Copy link
Member

@mikemccand mikemccand left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this opto!

This was my first pass through so I left some probably head scratcher / fresh eyes confusing sort of comments :)

Thnks @gf2121 and sorry for the slow review... thank you Stale Bot (hmm we really need a better name for this incredibly important bot -- Lars?).

meta.writeVLong(root.fp);
index.writeLong(0L); // additional 8 bytes for over-reading
meta.writeVLong(index.getFilePointer());
status = Status.SAVED;
}

void saveNodes(IndexOutput index) throws IOException {
/**
* Save label dictionary and return a label map that narrow labels' value to a constant value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

constant -> compact?

assert label == labelsSeen.length() - 1
|| labelsSeen.nextSetBit(label + 1) == DocIdSetIterator.NO_MORE_DOCS;
if (labels.length == 0 || labels[labels.length - 1] - labels[0] + 1 == labels.length) {
out.writeVInt(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a comment?

I think what this is doing is writing a 0 byte if there were no labels, or, if the labels were already fully dense/compact to begin with? And something else (above) will be able to tell the difference?

Edit: OK I see that 0 at read-time means "don't map", i.e. use the int label directly. And this works for the "no labels at all" case because who cares what the mapping is if you will never use it (sort of like if a tree falls in the woods and nobody hears, did it make any sound?).

lookUpLabel = targetLabel;
} else {
lookUpLabel = labelMap[targetLabel];
if (lookUpLabel == -1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm is this paranoia? Why would we write a label ord that wasn't mapped to a valid label?

@Override
public void push(Node node) {
if (labelMap != null) {
node.label = labelMap[node.label];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe assert node.label != -1? This label must've been defined?

if (cnt == 0) {
return null;
} else {
int[] labelMap = new int[TrieBuilder.BYTE_RANGE];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this byte[] instad? This added heap could add up for many shards X many indices for multitenant (OpenSearch, Elastcisearch, Solr, ...) cases, and with many separate indexed fields?

@@ -71,6 +74,46 @@ public void testOneByteTerms() throws Exception {
}
}

public void testContinuousValues() throws Exception {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also explicitly test both ends (0, 255) directly? Maybe 2-3 labels starting with 0, and ending with 255? Trying to tease out any scary ob1-Kenobi bugs!

@@ -63,7 +67,7 @@ private enum Status {
private static class Node {

// The utf8 digit that leads to this Node, 0 for root node
private final int label;
private int label;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm why did we need to un-final? Don't we remap on write, not update each Node's label in place?

@@ -74,14 +77,39 @@ IndexInput floorData(TrieReader r) throws IOException {
final RandomAccessInput access;
final IndexInput input;
final Node root;
final int[] labelMap;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a comment explaining a bit what's happening here? Explain that it is global across the entire Trie, and it remaps the written labels to compact ordinals for more efficient storage w/ negligible impact on terms lookup performance?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants