Introduce a mapping to map sparse labels to a continuous range #14494

gf2121 · 2025-04-15T07:53:26Z

In ASCII encoding, numbers, lowercase letters, uppercase letters, and punctuation have discontinuous value ranges. This PR proposes to use a mapping to map the discontinuous value ranges to a continuous value range starting from 0. This can save some storage space and increase the probability of using the BITSET strategy instead of the ARRAY strategy, which may also improve some performance.

In luceneutil, this patch reduces ~1.5% size of tip, including ~5% for the id field, which encoded with base36 (numbers and lowercase letters). Performance is generally even.

mikemccand · 2025-04-19T20:43:28Z

Whoa, exciting! I will try to review soon! Thanks @gf2121.

github-actions · 2025-05-04T00:28:49Z

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

mikemccand

I like this opto!

This was my first pass through so I left some probably head scratcher / fresh eyes confusing sort of comments :)

Thnks @gf2121 and sorry for the slow review... thank you Stale Bot (hmm we really need a better name for this incredibly important bot -- Lars?).

mikemccand · 2025-05-05T11:52:59Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java

    meta.writeVLong(root.fp);
    index.writeLong(0L); // additional 8 bytes for over-reading
    meta.writeVLong(index.getFilePointer());
    status = Status.SAVED;
  }

-  void saveNodes(IndexOutput index) throws IOException {
+  /**
+   * Save label dictionary and return a label map that narrow labels' value to a constant value


constant -> compact?

mikemccand · 2025-05-05T12:12:03Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java

+    assert label == labelsSeen.length() - 1
+        || labelsSeen.nextSetBit(label + 1) == DocIdSetIterator.NO_MORE_DOCS;
+    if (labels.length == 0 || labels[labels.length - 1] - labels[0] + 1 == labels.length) {
+      out.writeVInt(0);


Maybe add a comment?

I think what this is doing is writing a 0 byte if there were no labels, or, if the labels were already fully dense/compact to begin with? And something else (above) will be able to tell the difference?

Edit: OK I see that 0 at read-time means "don't map", i.e. use the int label directly. And this works for the "no labels at all" case because who cares what the mapping is if you will never use it (sort of like if a tree falls in the woods and nobody hears, did it make any sound?).

mikemccand · 2025-05-05T12:16:20Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieReader.java

+      lookUpLabel = targetLabel;
+    } else {
+      lookUpLabel = labelMap[targetLabel];
+      if (lookUpLabel == -1) {


Hmm is this paranoia? Why would we write a label ord that wasn't mapped to a valid label?

mikemccand · 2025-05-05T12:19:45Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java

+          @Override
+          public void push(Node node) {
+            if (labelMap != null) {
+              node.label = labelMap[node.label];


Maybe assert node.label != -1? This label must've been defined?

mikemccand · 2025-05-05T12:23:14Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieReader.java

+    if (cnt == 0) {
+      return null;
+    } else {
+      int[] labelMap = new int[TrieBuilder.BYTE_RANGE];


Could we make this byte[] instad? This added heap could add up for many shards X many indices for multitenant (OpenSearch, Elastcisearch, Solr, ...) cases, and with many separate indexed fields?

mikemccand · 2025-05-05T12:25:23Z

lucene/core/src/test/org/apache/lucene/codecs/lucene103/blocktree/TestTrie.java

@@ -71,6 +74,46 @@ public void testOneByteTerms() throws Exception {
    }
  }

+  public void testContinuousValues() throws Exception {


Could we also explicitly test both ends (0, 255) directly? Maybe 2-3 labels starting with 0, and ending with 255? Trying to tease out any scary ob1-Kenobi bugs!

mikemccand · 2025-05-05T12:26:21Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieBuilder.java

@@ -63,7 +67,7 @@ private enum Status {
  private static class Node {

    // The utf8 digit that leads to this Node, 0 for root node
-    private final int label;
+    private int label;


Hmm why did we need to un-final? Don't we remap on write, not update each Node's label in place?

mikemccand · 2025-05-05T12:27:54Z

lucene/core/src/java/org/apache/lucene/codecs/lucene103/blocktree/TrieReader.java

@@ -74,14 +77,39 @@ IndexInput floorData(TrieReader r) throws IOException {
  final RandomAccessInput access;
  final IndexInput input;
  final Node root;
+  final int[] labelMap;


Could you add a comment explaining a bit what's happening here? Explain that it is global across the entire Trie, and it remaps the written labels to compact ordinals for more efficient storage w/ negligible impact on terms lookup performance?

gf2121 added 30 commits March 4, 2025 16:30

trie test pass and writer test pass

8a54c70

iter

2f1da21

most tests passed

e5507bb

concurrency issue

a261ea7

concurrency issue

2ee3af9

all tests passed

aa12a84

rename classes and impl stats

583de5d

No need msb

5d30fd4

tidy

1d24e48

remove vlong

5dbb77a

no fst

53dfcb8

try to reduce virtual call

21e1798

iter

35760bf

shift one

13f10ae

delta codec

cb4c420

iter

7570df1

specialized single child

af2781a

license

63b0cdd

iter

72fa0ba

iter

8387fc7

iter

82b3172

tidy

fb4de0f

add assumption check

b09ba58

clean up writer code

94c269a

iter

0584e2d

remove assumption

2ca0255

iter

142e626

iter

10e080c

iter

5651468

iter

8d8c018

gf2121 and others added 22 commits April 4, 2025 17:04

Merge remote-tracking branch 'origin/main' into label_map

6dbe0ac

supplier

f60a03d

iter

76300e1

review iter

882c082

change 101

773be41

null map for continuous value range

922c1c3

Merge branch 'trie' into label_map

2f37e3e

Merge branch 'main' into trie

25f8248

move to backward codecs

ff02aeb

review iter

5d53689

fix distribution test

cd4e6d3

confirm tests passed

82eea61

delete PostingIndexInput for lucene101

2f1e435

fix javadoc for public fields

05df771

Merge branch 'main' into trie

b8191a4

fix 10.2.0.zip

c1b0044

fix 103 -> 90

e2384a6

Merge remote-tracking branch 'my/trie' into label_map

6d9146c

iter

ab252cd

Merge remote-tracking branch 'origin/main' into label_map

c3699c3

unnecessary diff

fd25fe1

iter

96fd1a4

github-project-automation bot added this to OpenSearch Lucene & Core Performance Tracking Apr 15, 2025

github-project-automation bot moved this to Open in OpenSearch Lucene & Core Performance Tracking Apr 15, 2025

github-actions bot added the module:core/codecs label Apr 15, 2025

gf2121 requested a review from mikemccand April 15, 2025 11:04

github-actions bot added the Stale label May 4, 2025

mikemccand approved these changes May 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce a mapping to map sparse labels to a continuous range #14494

Introduce a mapping to map sparse labels to a continuous range #14494

gf2121 commented Apr 15, 2025

mikemccand commented Apr 19, 2025

github-actions bot commented May 4, 2025

mikemccand left a comment

mikemccand May 5, 2025

mikemccand May 5, 2025

mikemccand May 5, 2025

mikemccand May 5, 2025

mikemccand May 5, 2025

mikemccand May 5, 2025

mikemccand May 5, 2025

mikemccand May 5, 2025

Introduce a mapping to map sparse labels to a continuous range #14494

Are you sure you want to change the base?

Introduce a mapping to map sparse labels to a continuous range #14494

Conversation

gf2121 commented Apr 15, 2025

mikemccand commented Apr 19, 2025

github-actions bot commented May 4, 2025

mikemccand left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment