Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranking: allow partial overlap with symbol #742

Merged
merged 6 commits into from
Mar 1, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,8 @@ name: CI
jobs:
test:
runs-on: ubuntu-latest
container: alpine:edge # go1.19 needs > alpine 3.15
# Pinned to alpine 3.19 to fix go version to 1.21. Remove this once Sourcegraph is on Go 1.22.
container: alpine:3.19
steps:
- name: checkout
uses: actions/checkout@v3
Expand Down
32 changes: 32 additions & 0 deletions build/scoring_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,38 @@ func TestJava(t *testing.T) {
// 7000 (symbol) + 900 (Java enum) + 500 (word) + 300 (atom) + 10 (file order)
wantScore: 8710,
},
{
fileName: "example.java",
content: exampleJava,
query: &query.Substring{Content: true, Pattern: "unInnerInterface("},
language: "Java",
// 4000 (overlap Symbol) + 700 (Java method) + 50 (partial word) + 10 (file order)
wantScore: 4760,
},
{
fileName: "example.java",
content: exampleJava,
query: &query.Substring{Content: true, Pattern: "InnerEnum"},
language: "Java",
// 7000 (Symbol) + 900 (Java enum) + 500 (word) + 10 (file order)
wantScore: 8410,
},
{
fileName: "example.java",
content: exampleJava,
query: &query.Substring{Content: true, Pattern: "enum InnerEnum"},
language: "Java",
// 5500 (edge Symbol) + 900 (Java enum) + 500 (word) + 10 (file order)
wantScore: 6910,
},
{
fileName: "example.java",
content: exampleJava,
query: &query.Substring{Content: true, Pattern: "public enum InnerEnum {"},
language: "Java",
// 4000 (overlap Symbol) + 900 (Java enum) + 500 (word) + 10 (file order)
wantScore: 5410,
},
}

for _, c := range cases {
Expand Down
51 changes: 48 additions & 3 deletions contentprovider.go
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,51 @@ func findSection(secs []DocumentSection, off, sz uint32) (uint32, bool) {
return 0, false
}

// findMaxOverlappingSection returns the index of the section in secs that
// overlaps the most with the area defined by off and sz, relative to the size
// of the section. If no section overlaps, it returns 0, false. If multiple
// sections overlap the same amount, the first one is returned.
//
// The implementation assumes that sections do not overlap and are sorted by
// DocumentSection.Start.
func findMaxOverlappingSection(secs []DocumentSection, off, sz uint32) (uint32, bool) {
start := off
end := off + sz

// Find the first section that might overlap
j := sort.Search(len(secs), func(i int) bool { return secs[i].End > start })

if j == len(secs) || secs[j].Start >= end {
// No overlap.
return 0, false
}

relOverlap := func(j int) float64 {
Copy link
Member

@keegancsmith keegancsmith Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can be more direct than relative overlap. If I am not mistaken, we only care about partial vs full overlap? Could we not do something like:

// if the 2nd section has a full overlap always return it
if j+1 < len(secs) && secs[j+1].Start >= off && secs[j+1].End <= off+sz {
  return j+1, true
}
// Otherwise the first section is atleast as good of an overlap
return j, secs[j].Start < off+sz

Copy link
Member Author

@stefanhengl stefanhengl Mar 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I am not mistaken, we only care about partial vs full overlap?

I think the amount of the overlap should count. For example, if your code looks like

aaaa bbbb

both aaaa and bbbb are symbols. If you search for a bbb it does make sense to prefer the second symbol over the first: the smaller the overlap is, the higher the chance that the match is low quality.

Sure, both are just partial overlaps and get the same boost for that. However, we don't just care about partial vs. complete overlap. The two symbols might be of a different kind and hence get a different score in scoreSymbolKind.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aah that is a good point. Almost makes me think we should be scoring every symbol match and taking the max score?

secSize := secs[j].End - secs[j].Start
if secSize == 0 {
return 0
}
// This cannot overflow because we make sure there is overlap before calling relOverlap
overlap := min(secs[j].End, end) - max(secs[j].Start, start)
return float64(overlap) / float64(secSize)
}

ol1 := relOverlap(j)
if epsilonEqualsOne(ol1) || j == len(secs)-1 || secs[j+1].Start >= end {
return uint32(j), ol1 > 0
}

// We know that [off,off+sz[ overlaps with at least 2 sections. We only have to check
// if the second section overlaps more than the first one, because a third
// section can only overlap if the overlap with the second section is complete.
ol2 := relOverlap(j + 1)
if ol2 > ol1 {
return uint32(j + 1), ol2 > 0
}

return uint32(j), ol1 > 0
}

func (p *contentProvider) findSymbol(cm *candidateMatch) (DocumentSection, *Symbol, bool) {
if cm.fileName {
return DocumentSection{}, nil, false
Expand All @@ -561,8 +606,8 @@ func (p *contentProvider) findSymbol(cm *candidateMatch) (DocumentSection, *Symb

secIdx, ok := cm.symbolIdx, cm.symbol
if !ok {
// Not from a symbol matchtree. Lets see if it intersects with a symbol.
secIdx, ok = findSection(secs, cm.byteOffset, cm.byteMatchSz)
// Not from a symbol matchTree. Let's see if it overlaps with a symbol.
secIdx, ok = findMaxOverlappingSection(secs, cm.byteOffset, cm.byteMatchSz)
}
if !ok {
return DocumentSection{}, nil, false
Expand Down Expand Up @@ -637,7 +682,7 @@ func (p *contentProvider) candidateMatchScore(ms []*candidateMatch, language str
} else if startMatch || endMatch {
addScore("EdgeSymbol", (scoreSymbol+scorePartialSymbol)/2)
} else {
addScore("InnerSymbol", scorePartialSymbol)
addScore("OverlapSymbol", scorePartialSymbol)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like how this change is so simple. Switching from Inner -> Overlap makes sense to me conceptually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some dark arts, but should we consider innersymbol worth more than overlapsymbol? Hard to imagine a query where you get both innersymbol and overlapsymbol so likely not worth it.

}

// Score based on symbol data
Expand Down
53 changes: 53 additions & 0 deletions contentprovider_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -407,3 +407,56 @@ func TestColumnHelper(t *testing.T) {
t.Fatal("unexpected value for backwards call for first column on second line", got)
}
}

func TestFindMaxOverlappingSection(t *testing.T) {
secs := []DocumentSection{
{Start: 0, End: 5},
{Start: 8, End: 19},
{Start: 22, End: 26},
}
// 012345678901234567890123456
// [....[
// [..........[
// [...[

testcases := []struct {
name string
off uint32
sz uint32
wantSecIdx uint32
wantOverlap bool
}{
{off: 0, sz: 1, wantSecIdx: 0, wantOverlap: true},
{off: 0, sz: 5, wantSecIdx: 0, wantOverlap: true},
{off: 2, sz: 5, wantSecIdx: 0, wantOverlap: true},
{off: 2, sz: 50, wantSecIdx: 1, wantOverlap: true},
{off: 4, sz: 10, wantSecIdx: 1, wantOverlap: true},
{off: 5, sz: 15, wantSecIdx: 1, wantOverlap: true},
{off: 18, sz: 10, wantSecIdx: 2, wantOverlap: true},

// Prefer full overlap, break ties by preferring the earlier section
{off: 10, sz: 20, wantSecIdx: 2, wantOverlap: true},
{off: 0, sz: 100, wantSecIdx: 0, wantOverlap: true},
{off: 8, sz: 100, wantSecIdx: 1, wantOverlap: true},
{off: 0, sz: 10, wantSecIdx: 0, wantOverlap: true},
{off: 16, sz: 10, wantSecIdx: 2, wantOverlap: true},

// No overlap
{off: 5, sz: 2, wantOverlap: false},
{off: 20, sz: 1, wantOverlap: false},
{off: 99, sz: 1, wantOverlap: false},
{off: 0, sz: 0, wantOverlap: false},
}

for _, tt := range testcases {
t.Run(fmt.Sprintf("off=%d size=%d", tt.off, tt.sz), func(t *testing.T) {
got, haveOverlap := findMaxOverlappingSection(secs, tt.off, tt.sz)
if haveOverlap != tt.wantOverlap {
t.Fatalf("expected overlap %v, got %v", tt.wantOverlap, haveOverlap)
}
if got != tt.wantSecIdx {
t.Fatalf("expected section %d, got %d", tt.wantSecIdx, got)
}
})
}
}
1 change: 1 addition & 0 deletions internal/e2e/e2e_rank_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,7 @@ func TestRanking(t *testing.T) {
q("test server", "github.com/golang/go/src/net/http/httptest/server.go"),
q("bytes buffer", "github.com/golang/go/src/bytes/buffer.go"),
q("bufio buffer", "github.com/golang/go/src/bufio/scan.go"),
q("time compare\\(", "github.com/golang/go/src/time/time.go"),

// sourcegraph/sourcegraph
q("graphql type User", "github.com/sourcegraph/sourcegraph/cmd/frontend/graphqlbackend/schema.graphql"),
Expand Down
2 changes: 1 addition & 1 deletion internal/e2e/testdata/coverage_data_writer.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ targetRank: 13
github.com/golang/go/src/internal/coverage/stringtab/stringtab.go
19:type Writer struct {
27:func (stw *Writer) InitWriter() {
9: "internal/coverage/slicereader"
70:func (stw *Writer) Write(w io.Writer) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor: this looks like a much better line match than the older import line.

hidden 16 more line matches

github.com/golang/go/src/cmd/cover/func.go
Expand Down
8 changes: 4 additions & 4 deletions internal/e2e/testdata/rank_stats.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
queries: 14
recall@1: 9 (64%)
recall@5: 11 (79%)
mrr: 0.710733
queries: 15
recall@1: 10 (67%)
recall@5: 12 (80%)
mrr: 0.730017
38 changes: 38 additions & 0 deletions internal/e2e/testdata/time_compare.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
queryString: time compare\(
query: (and substr:"time" substr:"compare(")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

neat that this works now :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To check I understand -- will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?

Copy link
Member Author

@stefanhengl stefanhengl Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will the Sourcegraph query parser convert user input like "func resolve(" into this form, where the parenthesis is escaped?

Not exactly. In Sourcegraph, if the pattern compare( is labeled as "literal" (which it is if the query is patternType:keyword time compare(), we escape the "(" before parsing it as regexp. Then we convert it to a substring query. This means, what Zoekt gets is a substring query with pattern "compare(".

The test here is based on the behavior of Zoekt, which expects valid regex as input.

targetRank: 1

**github.com/golang/go/src/time/time.go**
129:type Time struct {
79:package time
271:func (t Time) Compare(u Time) int {
hidden 250 more line matches

github.com/sourcegraph/sourcegraph/internal/api/api.go
127:func (r ExternalRepoSpec) Compare(s ExternalRepoSpec) int {
7: "time"
170: CreatedAt time.Time // the date when this settings value was created

github.com/sourcegraph/sourcegraph/client/shared/src/codeintel/scip.ts
117: public compare(other: Range): number {
53: return this.compare(other) < 0
56: return this.compare(other) <= 0
hidden 10 more line matches

github.com/golang/go/src/strings/compare.go
13:func Compare(a, b string) int {
14: // NOTE(rsc): This function does NOT call the runtime cmpstring function,

github.com/golang/go/src/go/constant/value.go
1337:func Compare(x_ Value, op token.Token, y_ Value) bool {
1102:// Division by zero leads to a run-time panic.
1381: re := Compare(x.re, token.EQL, y.re)
hidden 1 more line matches

github.com/golang/go/src/syscall/zsyscall_windows.go
878:func GetSystemTimeAsFileTime(time *Filetime) {
1088:func SetFileTime(handle Handle, ctime *Filetime, atime *Filetime, wtime *Filetime) (err error) {
132: procGetSystemTimeAsFileTime = modkernel32.NewProc("GetSystemTimeAsFileTime")
hidden 19 more line matches

hidden 139 more file matches
4 changes: 2 additions & 2 deletions internal/e2e/testdata/zoekt_searcher.txt
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ hidden 13 more line matches
github.com/sourcegraph/zoekt/rpc/internal/srv/srv.go
33:type Searcher struct {
34: Searcher zoekt.Searcher
7: "github.com/sourcegraph/zoekt"
37:func (s *Searcher) Search(ctx context.Context, args *SearchArgs, reply *SearchReply) error {
hidden 9 more line matches

github.com/sourcegraph/sourcegraph/doc/admin/observability/dashboards.md
Expand All @@ -35,7 +35,7 @@ hidden 1 more line matches
github.com/sourcegraph/zoekt/json/json.go
26: Searcher zoekt.Searcher
25:type jsonSearcher struct {
9: "github.com/sourcegraph/zoekt"
48:func (s *jsonSearcher) jsonSearch(w http.ResponseWriter, req *http.Request) {
hidden 16 more line matches

hidden 119 more file matches
4 changes: 2 additions & 2 deletions read_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -306,8 +306,8 @@ func TestReadSearch(t *testing.T) {
continue
}

if d := cmp.Diff(res.Files, want.FileMatches[j]); d != "" {
t.Errorf("matches for %s on %s\n%s", q, name, d)
if d := cmp.Diff(want.FileMatches[j], res.Files); d != "" {
t.Errorf("matches for %s on %s (-want +got)\n%s", q, name, d)
}
}
}
Expand Down
4 changes: 2 additions & 2 deletions testdata/golden/TestReadSearch/ctagsrepo_v16.00000.golden
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"Before": null,
"After": null,
"FileName": false,
"Score": 501,
"Score": 6801,
"DebugScore": "",
"LineFragments": [
{
Expand All @@ -29,7 +29,7 @@
}
],
"Checksum": "n9fUYqacPXg=",
"Score": 510
"Score": 6810
}
],
[
Expand Down
4 changes: 2 additions & 2 deletions testdata/golden/TestReadSearch/ctagsrepo_v17.00000.golden
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"Before": null,
"After": null,
"FileName": false,
"Score": 501,
"Score": 6801,
"DebugScore": "",
"LineFragments": [
{
Expand All @@ -29,7 +29,7 @@
}
],
"Checksum": "n9fUYqacPXg=",
"Score": 510
"Score": 6810
}
],
[
Expand Down
4 changes: 2 additions & 2 deletions testdata/golden/TestReadSearch/repo2_v16.00000.golden
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
"Before": null,
"After": null,
"FileName": false,
"Score": 501,
"Score": 6801,
"DebugScore": "",
"LineFragments": [
{
Expand All @@ -29,7 +29,7 @@
}
],
"Checksum": "Ju1TnQKZ6mE=",
"Score": 510
"Score": 6810
}
],
[
Expand Down
Loading