Implement ORDER BY BM25 #1434

jbellis · 2024-11-20T20:43:18Z

What is the issue

https://github.com/riptano/cndb/issues/11725

What does this PR fix and why was it fixed

...

Checklist before you submit for review

Make sure there is a PR in the CNDB project updating the Converged Cassandra version
Use NoSpamLogger for log lines that may appear frequently in the logs
Verify test results on Butler
Test coverage for new/modified code is > 80%
Proper code formatting
Proper title for each commit staring with the project-issue number, like CNDB-1234
Each commit has a meaningful description
Each commit is not very long and contains related changes
Renames, moves and reformatting are in distinct commits

jbellis · 2024-11-22T19:34:33Z

(force push is identical code that CI ran previously, just cleaned up the history)

michaeljmarshall

Looks great! I'm really happy with how much this cleaned up the index based ordering logic in several classes.

Left a handful of minor comments/questions.

src/java/org/apache/cassandra/cql3/restrictions/StatementRestrictions.java

test/unit/org/apache/cassandra/index/sai/StorageAttachedIndexTest.java

src/java/org/apache/cassandra/index/sai/QueryContext.java

src/java/org/apache/cassandra/schema/ColumnMetadata.java

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

src/java/org/apache/cassandra/db/filter/ColumnFilter.java

src/java/org/apache/cassandra/db/Columns.java

src/java/org/apache/cassandra/cql3/statements/SelectStatement.java

…onsible for ANN index queries. Other global orderings will be represented by a SingleColumnComparator with clustered=true instead.

… of recomputing scores on the coordinator

…olumn

michaeljmarshall

Nice work! I left several minor comments and a few larger questions.

src/java/org/apache/cassandra/cql3/GeoDistanceRelation.java

src/java/org/apache/cassandra/cql3/Operator.java

src/java/org/apache/cassandra/index/sai/disk/v1/postings/IntersectingPostingList.java

src/java/org/apache/cassandra/index/sai/utils/BM25Utils.java

src/java/org/apache/cassandra/index/sai/plan/Orderer.java

src/java/org/apache/cassandra/index/sai/plan/TopKProcessor.java

src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java

michaeljmarshall · 2024-12-09T21:25:10Z

src/java/org/apache/cassandra/index/sai/disk/v1/InvertedIndexSearcher.java

+        var documentFrequencies = postingLists.entrySet().stream().collect(Collectors.toMap(Map.Entry::getKey, e -> (long) e.getValue().size()));
+
+        try (var pkm = primaryKeyMapFactory.newPerSSTablePrimaryKeyMap();
+             var merged = IntersectingPostingList.intersect(List.copyOf(postingLists.values())))


What is the motivation for using intersection here as opposed to union? I know that our : operator does intersection, but I think that is the opposite of the typical expectation for analyzed text search. Since BM25 is its own thing, I wonder if we can break with our previous design decision and union these results. I am assuming that docs with more terms will naturally be higher in the list, anyway.

(1) we're materializing the results into memory and there's no real way around that so union increases the risk that something explodes

(2) I'm relying on "Our experiments focused on conjunctive query evaluation, where the document must contain all query terms; previous work [1] has shown that this approach yields comparable end-to-end effectiveness to disjunctive query evaluation, but is faster." in https://dl.acm.org/doi/abs/10.1145/2600428.2609460 [the work cited is by one of the authors so I assume it's accurate]

we're materializing the results into memory and there's no real way around that

actually we can avoid this (while still doing a single pass) if we materialize average document length in the statistics

…ction in SingleColumnRelation.newEQRestriction. This eliminates the need for skipMerge and special cases in doMergeWith, and moves the issuing of warnings next to the place where the transformation occurs instead of doing it much later in RowFilterValidator (which is no longer needed)

…testTwoIndexes passes

…tes passes

… approach than ignoring it when serialization fails later

… breaking the assumption in BTreeRow that complex regular/static columns sort last

jbellis · 2024-12-16T16:39:08Z

3 of the CI failures are straightforwardly false positives.

VectorHybridSearchTest failure looks suspicious

junit.framework.AssertionFailedError: Resource leaks were detected during this test. Add -Dcassandra.debugrefcount=true to analyze the leaks expected:<0> but was:<1>
	at org.apache.cassandra.index.sai.utils.ResourceLeakDetector.afterIfSuccessful(ResourceLeakDetector.java:79)
	at org.apache.cassandra.index.sai.utils.ResourceLeakDetector$1.afterIfSuccessful(ResourceLeakDetector.java:65)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:37)

But I think it's also a false positive:

Does not reproduce locally
Between last successful run and failed run is exactly one commit (3ad8ae2) which only changes MultipleColumnIndexTest.java and NativeIndexDDLTest.java

add validation and reject queries with no analyzed terms

cassci-bot · 2024-12-16T19:25:06Z

❌ Build ds-cassandra-pr-gate/PR-1434 rejected by Butler

2 new test failure(s) in 16 builds
See build details here

Found 2 new test failures

Test	Explanation	Branch history	Upstream history
...a,dataset=int,wide=true,scenario=SSTABLE_QUERY]	regression	🔴🔵🔵🔵🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵
...map<time,time>,wide=false,scenario=MIXED_QUERY]	regression	🔴🔵🔵🔵🔵🔵🔵	🔵🔵🔵🔵🔵🔵🔵

Found 71 known test failures

…, which does not have bounded error, // cap frequencies to total rows so that the IDF term doesn't turn negative

sonarqubecloud · 2024-12-17T21:00:17Z

Quality Gate passed

Issues
9 New issues
0 Accepted issues

Measures
0 Security Hotspots
92.5% Coverage on New Code
0.7% Duplication on New Code

See analysis details on SonarQube Cloud

jbellis force-pushed the scoreordering-6 branch from 3a50985 to 3848e94 Compare November 22, 2024 19:34

michaeljmarshall approved these changes Nov 22, 2024

View reviewed changes

jbellis force-pushed the scoreordering-6 branch from 03036e4 to a5060e9 Compare November 25, 2024 15:39

jbellis added 2 commits November 25, 2024 10:25

remove unnecessary generification of IndexColumnComparator

ddd9b69

Simplify the ordering logic by making IndexColumnComparator only resp…

9f1b794

…onsible for ANN index queries. Other global orderings will be represented by a SingleColumnComparator with clustered=true instead.

jbellis force-pushed the scoreordering-6 branch from a5060e9 to 5776439 Compare November 25, 2024 16:25

jbellis added 2 commits November 26, 2024 09:21

CNDB-11725 use +score pseudo-column to order ANN results with instead…

9986108

… of recomputing scores on the coordinator

CNDB-11725 add SYNTHETIC ColumnMetadata.Kind to represent the score c…

e0ea872

…olumn

jbellis force-pushed the scoreordering-6 branch from 5776439 to e0ea872 Compare November 26, 2024 15:32

merge with main

478ca65

jbellis changed the title ~~Send similarity score from writer to coordinator for faster sorting~~ Implement ORDER BY BM25 Dec 6, 2024

jbellis added 2 commits December 6, 2024 16:30

implement BM25

237cec4

re-disallow DESC with ORDER BY ANN

56a6e0f

jbellis force-pushed the scoreordering-6 branch from 162e02e to 56a6e0f Compare December 6, 2024 22:30

cleanup and comments

30b6545

michaeljmarshall reviewed Dec 9, 2024

View reviewed changes

jbellis added 5 commits December 11, 2024 09:00

address review notes

6c9a0e6

remove unused limit parameter from IndexSearcher::search

e107fcc

eliminate currentRowIds

cfe204a

add testMatchingAllowed and make it work via shouldMerge

3d17e2f

jbellis force-pushed the scoreordering-6 branch from 5f5eb0e to 3d17e2f Compare December 11, 2024 21:36

jbellis added 2 commits December 12, 2024 08:31

disambiguate the BM25 error message when the index isn't analyzed

907a2ee

validateOptions treats analyzed and un-analyzed indexes as distinct, …

3315a12

…testTwoIndexes passes

jbellis force-pushed the scoreordering-6 branch from 01d3adc to b19fead Compare December 12, 2024 19:38

detect and reject ambiguous equality predicates; testAmbiguousPredica…

c0de416

…tes passes

jbellis force-pushed the scoreordering-6 branch 2 times, most recently from 2babc86 to 50c4a57 Compare December 13, 2024 14:36

don't inject +score unless coordinator requests it; this is a cleaner…

3967e7c

… approach than ignoring it when serialization fails later

jbellis force-pushed the scoreordering-6 branch from 50c4a57 to 3967e7c Compare December 13, 2024 14:44

jbellis added 8 commits December 13, 2024 14:36

fix getEqBahavior, this is most of the test failures

893b87b

LongBM25Test

c1eaa63

misc bugfixes related to zero matches for a term

ddbfd16

ramIndexer deduplicates (term, row) pairs

7ff2374

need to use compareUnsigned once we have more than 4 KINDs

cfa1157

simplify

1620d3e

make SYNTHETIC the first Column Kind instead of the last. This avoids…

73f35df

… breaking the assumption in BTreeRow that complex regular/static columns sort last

fix tests

3ad8ae2

jbellis added 2 commits December 16, 2024 12:35

DRY refactor

dbbc678

add tests for unknown query terms, duplicate query terms, no query terms

1b57d55

add validation and reject queries with no analyzed terms

jbellis added 3 commits December 17, 2024 07:56

// since doc frequencies can be an estimate from the index histogram…

0b5ce5c

…, which does not have bounded error, // cap frequencies to total rows so that the IDF term doesn't turn negative

parameterize version to test with/without histograms

f3f7a15

merge with main

3416389

jbellis requested a review from jacek-lewandowski December 17, 2024 16:00

actually parameterize both versions

d83e18d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ORDER BY BM25 #1434

Implement ORDER BY BM25 #1434

jbellis commented Nov 20, 2024 •

edited

Loading

jbellis commented Nov 22, 2024

michaeljmarshall left a comment

michaeljmarshall left a comment

michaeljmarshall Dec 9, 2024

jbellis Dec 11, 2024

jbellis Dec 11, 2024

jbellis commented Dec 16, 2024

cassci-bot commented Dec 16, 2024

sonarqubecloud bot commented Dec 17, 2024

Implement ORDER BY BM25 #1434

Are you sure you want to change the base?

Implement ORDER BY BM25 #1434

Conversation

jbellis commented Nov 20, 2024 • edited Loading

What is the issue

What does this PR fix and why was it fixed

Checklist before you submit for review

jbellis commented Nov 22, 2024

michaeljmarshall left a comment

Choose a reason for hiding this comment

michaeljmarshall left a comment

Choose a reason for hiding this comment

michaeljmarshall Dec 9, 2024

Choose a reason for hiding this comment

jbellis Dec 11, 2024

Choose a reason for hiding this comment

jbellis Dec 11, 2024

Choose a reason for hiding this comment

jbellis commented Dec 16, 2024

cassci-bot commented Dec 16, 2024

❌ Build ds-cassandra-pr-gate/PR-1434 rejected by Butler

Found 2 new test failures

Found 71 known test failures

sonarqubecloud bot commented Dec 17, 2024

Quality Gate passed

jbellis commented Nov 20, 2024 •

edited

Loading