-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ORDER BY BM25 #1434
base: main
Are you sure you want to change the base?
Implement ORDER BY BM25 #1434
Conversation
3a50985
to
3848e94
Compare
(force push is identical code that CI ran previously, just cleaned up the history) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! I'm really happy with how much this cleaned up the index based ordering logic in several classes.
Left a handful of minor comments/questions.
src/java/org/apache/cassandra/cql3/restrictions/StatementRestrictions.java
Show resolved
Hide resolved
test/unit/org/apache/cassandra/index/sai/StorageAttachedIndexTest.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/cql3/statements/SelectStatement.java
Outdated
Show resolved
Hide resolved
03036e4
to
a5060e9
Compare
…onsible for ANN index queries. Other global orderings will be represented by a SingleColumnComparator with clustered=true instead.
a5060e9
to
5776439
Compare
… of recomputing scores on the coordinator
5776439
to
e0ea872
Compare
162e02e
to
56a6e0f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I left several minor comments and a few larger questions.
src/java/org/apache/cassandra/index/sai/disk/v1/postings/IntersectingPostingList.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/v1/postings/IntersectingPostingList.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/disk/v1/postings/IntersectingPostingList.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/TopKProcessor.java
Outdated
Show resolved
Hide resolved
src/java/org/apache/cassandra/index/sai/plan/StorageAttachedIndexSearcher.java
Outdated
Show resolved
Hide resolved
var documentFrequencies = postingLists.entrySet().stream().collect(Collectors.toMap(Map.Entry::getKey, e -> (long) e.getValue().size())); | ||
|
||
try (var pkm = primaryKeyMapFactory.newPerSSTablePrimaryKeyMap(); | ||
var merged = IntersectingPostingList.intersect(List.copyOf(postingLists.values()))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for using intersection here as opposed to union? I know that our :
operator does intersection, but I think that is the opposite of the typical expectation for analyzed text search. Since BM25 is its own thing, I wonder if we can break with our previous design decision and union these results. I am assuming that docs with more terms will naturally be higher in the list, anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1) we're materializing the results into memory and there's no real way around that so union increases the risk that something explodes
(2) I'm relying on "Our experiments focused on conjunctive query evaluation, where the document must contain all query terms; previous work [1] has shown that this approach yields comparable end-to-end effectiveness to disjunctive query evaluation, but is faster." in https://dl.acm.org/doi/abs/10.1145/2600428.2609460 [the work cited is by one of the authors so I assume it's accurate]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're materializing the results into memory and there's no real way around that
actually we can avoid this (while still doing a single pass) if we materialize average document length in the statistics
…ction in SingleColumnRelation.newEQRestriction. This eliminates the need for skipMerge and special cases in doMergeWith, and moves the issuing of warnings next to the place where the transformation occurs instead of doing it much later in RowFilterValidator (which is no longer needed)
5f5eb0e
to
3d17e2f
Compare
01d3adc
to
b19fead
Compare
2babc86
to
50c4a57
Compare
… approach than ignoring it when serialization fails later
50c4a57
to
3967e7c
Compare
… breaking the assumption in BTreeRow that complex regular/static columns sort last
3 of the CI failures are straightforwardly false positives. VectorHybridSearchTest failure looks suspicious
But I think it's also a false positive:
|
add validation and reject queries with no analyzed terms
❌ Build ds-cassandra-pr-gate/PR-1434 rejected by Butler2 new test failure(s) in 16 builds Found 2 new test failures
Found 71 known test failures |
…, which does not have bounded error, // cap frequencies to total rows so that the IDF term doesn't turn negative
Quality Gate passedIssues Measures |
What is the issue
https://github.com/riptano/cndb/issues/11725
What does this PR fix and why was it fixed
...
Checklist before you submit for review
NoSpamLogger
for log lines that may appear frequently in the logs