Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ORDER BY BM25 #1434

Open
wants to merge 31 commits into
base: main
Choose a base branch
from
Open

Implement ORDER BY BM25 #1434

wants to merge 31 commits into from

Conversation

jbellis
Copy link

@jbellis jbellis commented Nov 20, 2024

What is the issue

https://github.com/riptano/cndb/issues/11725

What does this PR fix and why was it fixed

...

Checklist before you submit for review

  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits

@jbellis
Copy link
Author

jbellis commented Nov 22, 2024

(force push is identical code that CI ran previously, just cleaned up the history)

Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! I'm really happy with how much this cleaned up the index based ordering logic in several classes.

Left a handful of minor comments/questions.

…onsible for ANN index queries. Other global orderings will be represented by a SingleColumnComparator with clustered=true instead.
@jbellis jbellis changed the title Send similarity score from writer to coordinator for faster sorting Implement ORDER BY BM25 Dec 6, 2024
Copy link
Member

@michaeljmarshall michaeljmarshall left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I left several minor comments and a few larger questions.

src/java/org/apache/cassandra/cql3/Operator.java Outdated Show resolved Hide resolved
src/java/org/apache/cassandra/index/sai/plan/Orderer.java Outdated Show resolved Hide resolved
var documentFrequencies = postingLists.entrySet().stream().collect(Collectors.toMap(Map.Entry::getKey, e -> (long) e.getValue().size()));

try (var pkm = primaryKeyMapFactory.newPerSSTablePrimaryKeyMap();
var merged = IntersectingPostingList.intersect(List.copyOf(postingLists.values())))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for using intersection here as opposed to union? I know that our : operator does intersection, but I think that is the opposite of the typical expectation for analyzed text search. Since BM25 is its own thing, I wonder if we can break with our previous design decision and union these results. I am assuming that docs with more terms will naturally be higher in the list, anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) we're materializing the results into memory and there's no real way around that so union increases the risk that something explodes

(2) I'm relying on "Our experiments focused on conjunctive query evaluation, where the document must contain all query terms; previous work [1] has shown that this approach yields comparable end-to-end effectiveness to disjunctive query evaluation, but is faster." in https://dl.acm.org/doi/abs/10.1145/2600428.2609460 [the work cited is by one of the authors so I assume it's accurate]

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're materializing the results into memory and there's no real way around that

actually we can avoid this (while still doing a single pass) if we materialize average document length in the statistics

…ction in SingleColumnRelation.newEQRestriction. This eliminates the need for skipMerge and special cases in doMergeWith, and moves the issuing of warnings next to the place where the transformation occurs instead of doing it much later in RowFilterValidator (which is no longer needed)
@jbellis jbellis force-pushed the scoreordering-6 branch 2 times, most recently from 2babc86 to 50c4a57 Compare December 13, 2024 14:36
… approach than ignoring it when serialization fails later
@jbellis
Copy link
Author

jbellis commented Dec 16, 2024

3 of the CI failures are straightforwardly false positives.

VectorHybridSearchTest failure looks suspicious

junit.framework.AssertionFailedError: Resource leaks were detected during this test. Add -Dcassandra.debugrefcount=true to analyze the leaks expected:<0> but was:<1>
	at org.apache.cassandra.index.sai.utils.ResourceLeakDetector.afterIfSuccessful(ResourceLeakDetector.java:79)
	at org.apache.cassandra.index.sai.utils.ResourceLeakDetector$1.afterIfSuccessful(ResourceLeakDetector.java:65)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:37)

But I think it's also a false positive:

  1. Does not reproduce locally
  2. Between last successful run and failed run is exactly one commit (3ad8ae2) which only changes MultipleColumnIndexTest.java and NativeIndexDDLTest.java

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-1434 rejected by Butler


2 new test failure(s) in 16 builds
See build details here


Found 2 new test failures

Test Explanation Branch history Upstream history
...a,dataset=int,wide=true,scenario=SSTABLE_QUERY] regression 🔴🔵🔵🔵🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵
...map<time,time>,wide=false,scenario=MIXED_QUERY] regression 🔴🔵🔵🔵🔵🔵🔵 🔵🔵🔵🔵🔵🔵🔵

Found 71 known test failures

…, which does not have bounded error,

        // cap frequencies to total rows so that the IDF term doesn't turn negative
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants