-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranking: allow partial overlap with symbol #742
Conversation
@@ -5,7 +5,7 @@ targetRank: 13 | |||
github.com/golang/go/src/internal/coverage/stringtab/stringtab.go | |||
19:type Writer struct { | |||
27:func (stw *Writer) InitWriter() { | |||
9: "internal/coverage/slicereader" | |||
70:func (stw *Writer) Write(w io.Writer) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: this looks like a much better line match than the older import line.
@@ -0,0 +1,38 @@ | |||
queryString: time compare\( | |||
query: (and substr:"time" substr:"compare(") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
neat that this works now :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check I understand -- will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?
Not exactly. In Sourcegraph, if the pattern compare(
is labeled as "literal" (which it is if the query is patternType:keyword time compare(
), we escape the "(" before parsing it as regexp. Then we convert it to a substring query. This means, what Zoekt gets is a substring query with pattern "compare("
.
The test here is based on the behavior of Zoekt, which expects valid regex as input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking good to me! Playing "Devil's Advocate", are there any classes of queries where you think this will make things worse?
@@ -0,0 +1,38 @@ | |||
queryString: time compare\( | |||
query: (and substr:"time" substr:"compare(") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check I understand -- will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?
@@ -637,7 +666,7 @@ func (p *contentProvider) candidateMatchScore(ms []*candidateMatch, language str | |||
} else if startMatch || endMatch { | |||
addScore("EdgeSymbol", (scoreSymbol+scorePartialSymbol)/2) | |||
} else { | |||
addScore("InnerSymbol", scorePartialSymbol) | |||
addScore("OverlapSymbol", scorePartialSymbol) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how this change is so simple. Switching from Inner -> Overlap makes sense to me conceptually.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some dark arts, but should we consider innersymbol worth more than overlapsymbol? Hard to imagine a query where you get both innersymbol and overlapsymbol so likely not worth it.
Whenever you search something that didn't match a symbol and now it matches one partially, things might get worse. For example, you might be looking for all TODOs in a repo. Previously, you could have used |
I see. This sounds like an unusual case, so not a big concern. I guess another example would be... given a symbol
That may be a bit surprising. But I like your simple approach, we can play around with it and adjust later if needed. |
I think this isn't possible? For
I'm having a hard time coming up with a query that looks reasonable where previously we had nicer behaviour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
contentprovider.go
Outdated
// Find the first section that overlaps. | ||
j := sort.Search(len(secs), func(i int) bool { return secs[i].End > off }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have an example where it is important to find the max overlap? Can't we just score this based on the first overlap found. If I am not mistaken it doesn't affect the score since any overlap you get a boost.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a bit more about this and went the following route: We now calculate the relative overlap and return the section with the largest relative overlap. This way we prefer complete over partial overlaps, even if the partial overlap is bigger. It's important which section we pick because the boost depends on the symbol type behind the section.
contentprovider.go
Outdated
// Find the first section that might overlap | ||
j := sort.Search(len(secs), func(i int) bool { return secs[i].End > off }) | ||
|
||
if j == len(secs) || secs[j].Start >= off+sz { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I don't know if this makes the code better, but I felt like I had to keep ensuring that .Start was with off and .End was off+sz. Maybe two local variables start := off; end := off+sz
would make it more obvious at each site we do a comparison that we are doing the right one?
return 0, false | ||
} | ||
|
||
relOverlap := func(j int) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can be more direct than relative overlap. If I am not mistaken, we only care about partial vs full overlap? Could we not do something like:
// if the 2nd section has a full overlap always return it
if j+1 < len(secs) && secs[j+1].Start >= off && secs[j+1].End <= off+sz {
return j+1, true
}
// Otherwise the first section is atleast as good of an overlap
return j, secs[j].Start < off+sz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I am not mistaken, we only care about partial vs full overlap?
I think the amount of the overlap should count. For example, if your code looks like
aaaa bbbb
both aaaa
and bbbb
are symbols. If you search for a bbb
it does make sense to prefer the second symbol over the first: the smaller the overlap is, the higher the chance that the match is low quality.
Sure, both are just partial overlaps and get the same boost for that. However, we don't just care about partial vs. complete overlap. The two symbols might be of a different kind and hence get a different score in scoreSymbolKind
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aah that is a good point. Almost makes me think we should be scoring every symbol match and taking the max score?
With this update, we score matches which overlap with a symbol in any way. Before, we only added a score if the match was fully contained in a symbol.
Test plan: