Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

score: experimental extension novelty in sorting #665

Merged
merged 3 commits into from
Oct 26, 2023
Merged

Conversation

keegancsmith
Copy link
Member

@keegancsmith keegancsmith commented Oct 19, 2023

Right now we boost a file extension that hasn't been seen to the 3rd position. This is gated by an environment variable which defaults to on. I want to explore if there are ways we can turn on this behaviour with the query language.

Test Plan: go run ./cmd/zoekt foo

Right now we boost a file extension that hasn't been seen to the 3rd
position. This is gated by an environment variable. I want to explore if
there are ways we can turn on this behaviour with the query language.

Test Plan: ZOEKT_NOVELTY=1 go run ./cmd/zoekt foo
@keegancsmith keegancsmith requested a review from a team October 19, 2023 15:24
@keegancsmith keegancsmith marked this pull request as ready for review October 24, 2023 15:08
@keegancsmith
Copy link
Member Author

@jtibshirani @stefanhengl after playing around with this a bunch, I'm really enjoying it. Keen to ship it in sourcegraph. What do you think?

There is maybe one potential change I make before following up. When aggregating results in the frontend with streaming we may call this multiple times. I suppose we only want this behaviour for the very first call.

@jtibshirani
Copy link
Member

@keegancsmith seems like a nice direction! Can you share some examples where it really helps, to help me get a feel for things too? I'm also curious -- if we had a great notion of "file importance", would this still be as helpful? In my work with keyword search, I've noticed the top results can be filled with build files or other noise, but we could try to address that directly.

@keegancsmith
Copy link
Member Author

I don't have concrete examples of it boosting something I wanted, most of my testing the result I wanted was at the top. However, it feels good and that is what I am going on (sorry for being so non empirical).

@keegancsmith
Copy link
Member Author

Stefan just reminded me of one real example we came across. We boosted a markdown file into the third spot which was related to the query and it was part of what we wanted to see.

@keegancsmith keegancsmith merged commit 1a3dddc into main Oct 26, 2023
8 checks passed
@keegancsmith keegancsmith deleted the k/novelty branch October 26, 2023 09:02
@jtibshirani
Copy link
Member

Sorry for the slow review on my end! In general it does feel important to balance relevance vs. diversity for broad queries. Another "diversity rule" that could be helpful: in the absence of file filters, at least one file in the top 3 should be a code file (not build, not docs).

@keegancsmith
Copy link
Member Author

Diversity is the word I was looking for, feels like a much better descriptor than novelty. Nice idea, filed https://github.com/sourcegraph/sourcegraph/issues/57975

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants