Boost symbol matches in BM25 #876

jtibshirani · 2024-12-11T18:08:00Z

When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.

This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.

Note: I also tried rewarding matches on word boundaries, taking inspiration from candidateMatchScore. I did not see any improvement in results by "stacking" this on top of this symbol boost. It also felt like I was really venturing into "overfitting" territory, as it requires a new tunable parameter.

Closes SPLF-758

jtibshirani · 2024-12-11T18:08:46Z

contentprovider.go

+	termFreqs := map[string]int{}
+	for _, m := range cands {
+		term := string(m.substrLowered)
+		if m.fileName || p.matchesSymbol(m) {


Personally, this is still on the right side of the "black magic" line :) I didn't tune any parameters, just threw in this check and it works well across two eval datasets.

jtibshirani · 2024-12-11T18:09:49Z

score_test.go

-	"testing"
-)
-
-func TestCalculateTermFrequency(t *testing.T) {


Now calculateTermFrequency requires access to indexData, so it's hard to unit test. I checked the e2e test scoring_test.go carefully to confirm the new calculation is correct.

jtibshirani · 2024-12-11T18:11:11Z

contentprovider.go

@@ -588,6 +588,22 @@ func findMaxOverlappingSection(secs []DocumentSection, off, sz uint32) (uint32,
 	return uint32(j), ol1 > 0
 }

+func (p *contentProvider) matchesSymbol(cm *candidateMatch) bool {


We are duplicating some checks, since we run both calculateTermFrequency for the overall file score, plus candidateMatchScore for the individual chunk scores. It would be good to unify these, but I didn't want to embark on a big refactor in this PR.

jtibshirani · 2024-12-11T18:12:57Z

NLS eval results (before vs. after)
Results improve across all question types and metrics.

camdencheek

Spent some time trying to understand this, and I think I've got a decent handle. Seems reasonable to me, but I'm not super well versed in this part of the codebase

Boost symbol matches in BM25

5a5f3de

cla-bot bot added the cla-signed label Dec 11, 2024

jtibshirani commented Dec 11, 2024

View reviewed changes

jtibshirani requested review from camdencheek and a team December 11, 2024 18:13

camdencheek approved these changes Dec 12, 2024

View reviewed changes

keegancsmith approved these changes Dec 12, 2024

View reviewed changes

jtibshirani merged commit c03b77f into main Dec 12, 2024
11 checks passed

jtibshirani deleted the jtibs/bm25 branch December 12, 2024 17:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Boost symbol matches in BM25 #876

Boost symbol matches in BM25 #876

jtibshirani commented Dec 11, 2024 •

edited

Loading

jtibshirani Dec 11, 2024

jtibshirani Dec 11, 2024

jtibshirani Dec 11, 2024

jtibshirani commented Dec 11, 2024

camdencheek left a comment

Boost symbol matches in BM25 #876

Boost symbol matches in BM25 #876

Conversation

jtibshirani commented Dec 11, 2024 • edited Loading

jtibshirani Dec 11, 2024

Choose a reason for hiding this comment

jtibshirani Dec 11, 2024

Choose a reason for hiding this comment

jtibshirani Dec 11, 2024

Choose a reason for hiding this comment

jtibshirani commented Dec 11, 2024

camdencheek left a comment

Choose a reason for hiding this comment

jtibshirani commented Dec 11, 2024 •

edited

Loading