-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Boost symbol matches in BM25 #876
Conversation
termFreqs := map[string]int{} | ||
for _, m := range cands { | ||
term := string(m.substrLowered) | ||
if m.fileName || p.matchesSymbol(m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, this is still on the right side of the "black magic" line :) I didn't tune any parameters, just threw in this check and it works well across two eval datasets.
"testing" | ||
) | ||
|
||
func TestCalculateTermFrequency(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now calculateTermFrequency
requires access to indexData
, so it's hard to unit test. I checked the e2e test scoring_test.go
carefully to confirm the new calculation is correct.
@@ -588,6 +588,22 @@ func findMaxOverlappingSection(secs []DocumentSection, off, sz uint32) (uint32, | |||
return uint32(j), ol1 > 0 | |||
} | |||
|
|||
func (p *contentProvider) matchesSymbol(cm *candidateMatch) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are duplicating some checks, since we run both calculateTermFrequency
for the overall file score, plus candidateMatchScore
for the individual chunk scores. It would be good to unify these, but I didn't want to embark on a big refactor in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spent some time trying to understand this, and I think I've got a decent handle. Seems reasonable to me, but I'm not super well versed in this part of the codebase
When digging into our Natural Language Search (NLS) eval results, I found that one of the leading causes for flexible search types like "Fuzzy symbol search" and "Find logic" was noisy matches in top results. Currently, our BM25 ranking rewards any substring match equally. So for queries like 'extract tar', any match on 'tar' (even within unrelated terms like 'start', etc.) counts towards the term frequency.
This PR helps reduce noise by boosting symbol matches the same as we do filename matches. Our NLS evals show positive improvement, and context evals are the tiniest bit better.
Note: I also tried rewarding matches on word boundaries, taking inspiration from
candidateMatchScore
. I did not see any improvement in results by "stacking" this on top of this symbol boost. It also felt like I was really venturing into "overfitting" territory, as it requires a new tunable parameter.Closes SPLF-758