Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boost symbol matches in BM25 #876

Merged
merged 1 commit into from
Dec 12, 2024
Merged

Boost symbol matches in BM25 #876

merged 1 commit into from
Dec 12, 2024

Conversation

jtibshirani
Copy link
Member

@jtibshirani jtibshirani commented Dec 11, 2024

When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.

This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.

Note: I also tried rewarding matches on word boundaries, taking inspiration from candidateMatchScore. I did not see any improvement in results by "stacking" this on top of this symbol boost. It also felt like I was really venturing into "overfitting" territory, as it requires a new tunable parameter.

Closes SPLF-758

@cla-bot cla-bot bot added the cla-signed label Dec 11, 2024
termFreqs := map[string]int{}
for _, m := range cands {
term := string(m.substrLowered)
if m.fileName || p.matchesSymbol(m) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally, this is still on the right side of the "black magic" line :) I didn't tune any parameters, just threw in this check and it works well across two eval datasets.

"testing"
)

func TestCalculateTermFrequency(t *testing.T) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now calculateTermFrequency requires access to indexData, so it's hard to unit test. I checked the e2e test scoring_test.go carefully to confirm the new calculation is correct.

@@ -588,6 +588,22 @@ func findMaxOverlappingSection(secs []DocumentSection, off, sz uint32) (uint32,
return uint32(j), ol1 > 0
}

func (p *contentProvider) matchesSymbol(cm *candidateMatch) bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are duplicating some checks, since we run both calculateTermFrequency for the overall file score, plus candidateMatchScore for the individual chunk scores. It would be good to unify these, but I didn't want to embark on a big refactor in this PR.

@jtibshirani
Copy link
Member Author

NLS eval results (before vs. after)
Results improve across all question types and metrics.

Screenshot 2024-12-11 at 9 38 28 AM Screenshot 2024-12-11 at 9 38 20 AM

@jtibshirani jtibshirani requested review from camdencheek and a team December 11, 2024 18:13
Copy link
Member

@camdencheek camdencheek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spent some time trying to understand this, and I think I've got a decent handle. Seems reasonable to me, but I'm not super well versed in this part of the codebase

@jtibshirani jtibshirani merged commit c03b77f into main Dec 12, 2024
11 checks passed
@jtibshirani jtibshirani deleted the jtibs/bm25 branch December 12, 2024 17:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants