Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dampen repetition-boost with log2 #658

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 8 additions & 3 deletions eval.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"fmt"
"log"
"math"
"math/bits"
"regexp/syntax"
"sort"
"strings"
Expand Down Expand Up @@ -415,7 +416,7 @@ func (d *indexData) scoreFile(fileMatch *FileMatch, doc uint32, mt matchTree, kn
}

maxFileScore := 0.0
repetitions := 0
repetitions := uint(0)
for i := range fileMatch.LineMatches {
if maxFileScore < fileMatch.LineMatches[i].Score {
maxFileScore = fileMatch.LineMatches[i].Score
Expand All @@ -442,8 +443,12 @@ func (d *indexData) scoreFile(fileMatch *FileMatch, doc uint32, mt matchTree, kn
// the matches.
fileMatch.addScore("fragment", maxFileScore, opts.DebugScore)

// Prefer docs with several top-scored matches.
fileMatch.addScore("repetition-boost", scoreRepetitionFactor*float64(repetitions), opts.DebugScore)
if repetitions != 0 {
// Prefer docs with several top-scored matches. We use log_2 (bits.Len) to
// prevent the repetitions overriding other factors. In this way it acts
// more like a tie break.
fileMatch.addScore("repetition-boost", scoreRepetitionFactor*float64(bits.Len(repetitions)), opts.DebugScore)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious why you pick bits.Len instead of math.Log2? Zoekt already uses math.Log for scoring traction. Doesn't bits.Len lead to a step-like behavior, boosting 7 repetitions just as much as 4?

math.Log (or Log2) would give an almost linear or at least strictly monotonous behavior for small repetitions and still dampen the boost for large number of repetitions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I over worried about perf. You are right, I should probably just call log.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stefanhengl What do you reckon? This will be a lot of calls to this. The other use of log happens at index time not per filematch.

BenchmarkBoostLen-32            1000000000               0.2085 ns/op
BenchmarkBoostLog-32            191207702                6.110 ns/op

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be more worried about the behaviour of the step-like function than about the performance. With 6ns/op you have to call this a lot to make a dent on a search? I think we can merge this, but we have to watch out for cases of good files being ranked lower because we haven't crossed the next 2x threshold to get a higher boost.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It also makes sense to me to use math.Log -- it seems really minor compared to the overall cost of assessing a match. And it makes things a bit more robust / easier to reason about.

}

if opts.UseDocumentRanks && len(d.ranks) > int(doc) {
weight := scoreFileRankFactor
Expand Down