-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dampen repetition-boost with log2 #658
Conversation
I sometimes notice very poor quality documents getting boosted on common terms due to them containing lots of results. This factor feels like it should work more as a tie-breaker, than overriding all other factors. Additionally this only adds the score if it will be non-zero. I've noticed the majority of search results have 0 here, so removes the noise in the debug output. Test Plan: searching for "class user" only had code in the top results.
8820c2c
to
6a372ee
Compare
// Prefer docs with several top-scored matches. We use log_2 (bits.Len) to | ||
// prevent the repetitions overriding other factors. In this way it acts | ||
// more like a tie break. | ||
fileMatch.addScore("repetition-boost", scoreRepetitionFactor*float64(bits.Len(repetitions)), opts.DebugScore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you pick bits.Len
instead of math.Log2
? Zoekt already uses math.Log
for scoring traction
. Doesn't bits.Len lead to a step-like behavior, boosting 7 repetitions just as much as 4?
math.Log (or Log2) would give an almost linear or at least strictly monotonous behavior for small repetitions and still dampen the boost for large number of repetitions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I over worried about perf. You are right, I should probably just call log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefanhengl What do you reckon? This will be a lot of calls to this. The other use of log happens at index time not per filematch.
BenchmarkBoostLen-32 1000000000 0.2085 ns/op
BenchmarkBoostLog-32 191207702 6.110 ns/op
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be more worried about the behaviour of the step-like function than about the performance. With 6ns/op you have to call this a lot to make a dent on a search? I think we can merge this, but we have to watch out for cases of good files being ranked lower because we haven't crossed the next 2x threshold to get a higher boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also makes sense to me to use math.Log
-- it seems really minor compared to the overall cost of assessing a match. And it makes things a bit more robust / easier to reason about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When I was working on ranking, I actually didn't see good examples where repetition factor helped! Its definition is a bit narrow: the number of matches in the document that share the top score. Also, am I reading the code correctly that it doesn't apply to chunk matches (which is what we use in Sourcegraph search?)
Maybe we could just remove it to simplify and think about adding it back in a more solid way.
Here is the context why we added the repetition boost. |
Agreed I only ever saw it annoy me, which is why I ended up making it use log. I quite like the idea of removing it. Especially since we should really also be taking into account the document size (and maybe an estimate on how often a keyword appears in the corpus) |
@stefanhengl sorry I missed this comment. Lets chat about this later. |
I sometimes notice very poor quality documents getting boosted on common terms due to them containing lots of results. This factor feels like it should work more as a tie-breaker, than overriding all other factors.
Note: This work was guided by experimentation locally. Still need to set up a more robust way to track the effects of ranking changes.
Test Plan: searching for "class user" only had code in the top results.