Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranking: allow partial overlap with symbol #742

Merged
merged 6 commits into from
Mar 1, 2024
Merged

Conversation

stefanhengl
Copy link
Member

@stefanhengl stefanhengl commented Feb 22, 2024

With this update, we score matches which overlap with a symbol in any way. Before, we only added a score if the match was fully contained in a symbol.

Test plan:

  • new unit tests
  • updated scoring tests

@@ -5,7 +5,7 @@ targetRank: 13
github.com/golang/go/src/internal/coverage/stringtab/stringtab.go
19:type Writer struct {
27:func (stw *Writer) InitWriter() {
9: "internal/coverage/slicereader"
70:func (stw *Writer) Write(w io.Writer) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this looks like a much better line match than the older import line.

@@ -0,0 +1,38 @@
queryString: time compare\(
query: (and substr:"time" substr:"compare(")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat that this works now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I understand -- will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?

Copy link
Member Author

@stefanhengl stefanhengl Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?

Not exactly. In Sourcegraph, if the pattern compare( is labeled as "literal" (which it is if the query is patternType:keyword time compare(), we escape the "(" before parsing it as regexp. Then we convert it to a substring query. This means, what Zoekt gets is a substring query with pattern "compare(".

The test here is based on the behavior of Zoekt, which expects valid regex as input.

Copy link
Member

@jtibshirani jtibshirani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me! Playing "Devil's Advocate", are there any classes of queries where you think this will make things worse?

@@ -0,0 +1,38 @@
queryString: time compare\(
query: (and substr:"time" substr:"compare(")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I understand -- will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?

@@ -637,7 +666,7 @@ func (p *contentProvider) candidateMatchScore(ms []*candidateMatch, language str
} else if startMatch || endMatch {
addScore("EdgeSymbol", (scoreSymbol+scorePartialSymbol)/2)
} else {
addScore("InnerSymbol", scorePartialSymbol)
addScore("OverlapSymbol", scorePartialSymbol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this change is so simple. Switching from Inner -> Overlap makes sense to me conceptually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some dark arts, but should we consider innersymbol worth more than overlapsymbol? Hard to imagine a query where you get both innersymbol and overlapsymbol so likely not worth it.

@stefanhengl
Copy link
Member Author

Playing "Devil's Advocate", are there any classes of queries where you think this will make things worse?

Whenever you search something that didn't match a symbol and now it matches one partially, things might get worse. For example, you might be looking for all TODOs in a repo. Previously, you could have used TODO( to match the pattern "TODO()", but now this will rank functions TODO(....) as top hit.

@stefanhengl stefanhengl marked this pull request as ready for review February 23, 2024 14:23
@jtibshirani
Copy link
Member

Whenever you search something that didn't match a symbol and now it matches one partially, things might get worse

I see. This sounds like an unusual case, so not a big concern.

I guess another example would be... given a symbol WriteFiles, these two queries now produce same score:

  • WriteFile
  • FilesBuffer

That may be a bit surprising. But I like your simple approach, we can play around with it and adjust later if needed.

@keegancsmith
Copy link
Member

I guess another example would be... given a symbol WriteFiles, these two queries now produce same score:

  • WriteFile
  • FilesBuffer

I think this isn't possible? For FilesBuffer query to get the overlap symbol you need both FilesBuffer to appear in the corpus and partially be symbol. So it won't match on WriteFiles unless your language somehow has

WriteFilesBuffer
^^^^^^^^^^        only this part is symbol
          ^^^^^^  this part is not a symbol

I'm having a hard time coming up with a query that looks reasonable where previously we had nicer behaviour.

Copy link
Member

@keegancsmith keegancsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Comment on lines 561 to 562
// Find the first section that overlaps.
j := sort.Search(len(secs), func(i int) bool { return secs[i].End > off })
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have an example where it is important to find the max overlap? Can't we just score this based on the first overlap found. If I am not mistaken it doesn't affect the score since any overlap you get a boost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought a bit more about this and went the following route: We now calculate the relative overlap and return the section with the largest relative overlap. This way we prefer complete over partial overlaps, even if the partial overlap is bigger. It's important which section we pick because the boost depends on the symbol type behind the section.

// Find the first section that might overlap
j := sort.Search(len(secs), func(i int) bool { return secs[i].End > off })

if j == len(secs) || secs[j].Start >= off+sz {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: I don't know if this makes the code better, but I felt like I had to keep ensuring that .Start was with off and .End was off+sz. Maybe two local variables start := off; end := off+sz would make it more obvious at each site we do a comparison that we are doing the right one?

return 0, false
}

relOverlap := func(j int) float64 {
Copy link
Member

@keegancsmith keegancsmith Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be more direct than relative overlap. If I am not mistaken, we only care about partial vs full overlap? Could we not do something like:

// if the 2nd section has a full overlap always return it
if j+1 < len(secs) && secs[j+1].Start >= off && secs[j+1].End <= off+sz {
  return j+1, true
}
// Otherwise the first section is atleast as good of an overlap
return j, secs[j].Start < off+sz

Copy link
Member Author

@stefanhengl stefanhengl Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, we only care about partial vs full overlap?

I think the amount of the overlap should count. For example, if your code looks like

aaaa bbbb

both aaaa and bbbb are symbols. If you search for a bbb it does make sense to prefer the second symbol over the first: the smaller the overlap is, the higher the chance that the match is low quality.

Sure, both are just partial overlaps and get the same boost for that. However, we don't just care about partial vs. complete overlap. The two symbols might be of a different kind and hence get a different score in scoreSymbolKind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah that is a good point. Almost makes me think we should be scoring every symbol match and taking the max score?

@stefanhengl stefanhengl merged commit 74bbcc9 into main Mar 1, 2024
8 checks passed
@stefanhengl stefanhengl deleted the sh/scoring/overlap branch March 1, 2024 09:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants