-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive: e2e test for ranking against sourcegraph repo #695
Conversation
This is an initial framework for having golden file results for search results against a real repository. At first we have only added one query and one repository, but it should be straightforward to grow this list further. The golden files we write to disk are a summary of results with debug information. This matches how we have been using the zoekt CLI tool on the keyword branch during our ranking work. Test Plan: go test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a solid + simple way to index a repo snapshot at a particular time!
One overall comment: I had been thinking that along with each query, we'd also provide the 1-2 files we consider to be most relevant. We could show some visual indication of the result that's relevant, and also report a metric like "recall at 5". This helps make trade-offs when reviewing changes to results. For example, maybe we see a bunch of changes in results: knowing what files are relevant and how their positions changed can help determine if the changes are overall positive.
What do you think? How were you thinking we'd make use of this test for evaluating changes to ranking?
Agreed, I can add that. Mind me doing that as a follow up PR to avoid it getting to big?
In tests we want to assert on behaviour. I am thinking we could assert on acceptable recall? Or maybe just log it? Or hook this up to a tool we can run outside of tests? WDYT?
Agreed, this would be useful. Right now I am concerned for example including the debugscore information is too noisy.
I was imagining we would inspect the changes to the snapshot files to realise the impact. It would still be "feeling based", so your idea of outputting a single metric (etc) is really useful. I'll follow-up with those ideas. I've added a task to the tracking issue titled "ranking: add recall to zoekt test" |
Makes sense to do this in a follow-up. I'll review this PR right now.
I like your approach of asserting on these as "gold" results -- it helps prevent accidental ranking changes, and also forces us to evaluate any ranking changes in a more rigorous way. For me it'd be useful to just log the recall as part of the test output. It'd also be nice if we could show a visual indication in results of what file is relevant... something like
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! I left a few non-blocking comments.
@jtibshirani I'm going to follow up with a recall measurement. I realised it isn't clear to me what exactly the metric will be. Googling recall measurements it is often presented as a % of the total corpus, which is not that helpful to us. I can think of two systems: 1 point everytime a document we want appears in the top 5. Or a score where the top doc is worth 5 points and that continues to decrease. Additionally I was also thinking it may be useful which line we show. EG some of the improvements we have made have made us more likely to show the class definition in the file at top, rather than a random other part of the document that matches. |
In my experience it's common to report a couple metrics to try to capture the overall quality. For our problem I think these are most helpful:
As background, a lot of the traditional metrics (recall as % of corpus, mAP, NDCG) assume that there are a bunch of relevant docs throughout the corpus that are relevant to the query in different degrees. I don't think that matches our use case well -- there are usually 1-2 docs that are highly relevant or "correct" answers, so we can use simple binary metrics that focus on whether we retrieved those. |
This is an initial framework for having golden file results for search results against a real repository. At first we have only added one query and one repository, but it should be straightforward to grow this list further.
The golden files we write to disk are a summary of results with debug information. This matches how we have been using the zoekt CLI tool on the keyword branch during our ranking work.
Test Plan: go test
Fixes https://github.com/sourcegraph/sourcegraph/issues/57666