-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new similar note search algorithm #971
Add new similar note search algorithm #971
Conversation
5ab5d95
to
5672023
Compare
5672023
to
281bd69
Compare
281bd69
to
f1ddbf1
Compare
src/search/findSimilarNotes.ts
Outdated
* @param db - The Orama database. | ||
* @returns The embeddings for the given note titles. | ||
*/ | ||
async function getEmbeddings(noteTitles: string[], db: Orama<any>): Promise<number[][]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: getNoteEmbeddings
may be a clearer name?
src/search/findSimilarNotes.ts
Outdated
const debug = getSettings().debug; | ||
const embeddings: number[][] = []; | ||
for (const noteTitle of noteTitles) { | ||
const noteFile = await getNoteFileFromTitle(app.vault, noteTitle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not particularly for this PR but right now we are not differentiating notes with the same title under different paths. This was on me. In the future we should use a unique identifier i.e. note path.
When a user types [[
the list should show the corresponding path next to each title (just like OB itself does it), and in the background only path should be used as the unique id.
Thanks for the detailed descriptions! Generally LGTM! Just to prompt some ideas:
Some future considerations: this feature should be conditional on whether indexing is enabled on mobile. |
Thanks for the great feedback.
|
|
Another thing just came to mind: although we will enable auto index in all modes, we should still have an option NEVER available since some users I know don't want to trigger auto indexing ever. In that case, should the note similarity UI show a "refresh" button if NEVER is selected? |
5db1551
to
cd86e2d
Compare
cd86e2d
to
8d15e1c
Compare
@@ -314,18 +314,25 @@ export default class CopilotPlugin extends Plugin { | |||
}); | |||
|
|||
this.addCommand({ | |||
id: "find-similar-notes", | |||
name: "Find similar notes to active note", | |||
id: "find-relevant-notes", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the term "relevant" is more accurate as we are adding information that is more than just similarity
@@ -57,7 +57,7 @@ | |||
"prettier": "^3.3.3", | |||
"ts-jest": "^29.1.0", | |||
"tslib": "2.4.0", | |||
"typescript": "4.7.4", | |||
"typescript": "^5.7.2", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upgrade TypeScript to get the filter function type working correctly
@logancyang I revamped the implementation and addressed some of your concerns:
I haven't implemented the tag similarity score yet because I don't have a good solution to run Jaccard similarity efficiently. I hope this is something that we can rely on Orama. We can follow up on this if you have ideas. We can add similarity adjustment in settings in a follow-up PR, too. It's worth discussing whether we want to let the user adjust the vector search threshold, the merged score threshold, or maybe the weights for each category. |
if (!db) throw new Error("DB not initialized"); | ||
if (!path) return; | ||
const result = await search(db, { | ||
term: path, | ||
properties: ["path"], | ||
exact: true, | ||
includeVectors: true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you figure out how to make exact
work? Is it still returning partial matches?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. I created a repro sandbox and shared it with this issue: oramasearch/orama#866
This looks awesome! I can immediately tell this is gonna be so useful!
Questions:
Some more explorations if you are interested: graph-based similarity metrics |
/** | ||
* @deprecated File display title can be duplicated, so we should use file path | ||
* instead of title to find the note file. | ||
*/ | ||
export async function getNoteFileFromTitle(vault: Vault, noteTitle: string): Promise<TFile | null> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this should be replaced everywhere, thanks for marking it deprecated.
@@ -36,6 +40,9 @@ export async function getNoteFileFromTitle(vault: Vault, noteTitle: string): Pro | |||
return null; | |||
} | |||
|
|||
/** | |||
* @deprecated Use app.vault.getAbstractFileByPath() instead. | |||
*/ | |||
export const getNotesFromPath = async (vault: Vault, path: string): Promise<TFile[]> => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
getNotesFromPath
gets an array of notes from a folder path, it's not the same as app.vault.getAbstractFileByPath()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Removed the deprecated annotation. Maybe it's worth renaming the function to be getNotesFromFolder and rewrite the manual matching logic with getAbstractFileByPath. It works with folder as well. I will leave it to a future PR.
@@ -565,6 +575,7 @@ export async function safeFetch(url: string, options: RequestInit): Promise<Resp | |||
url: url, | |||
type: "basic", | |||
redirected: false, | |||
bytes: () => Promise.resolve(new Uint8Array(0)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this for? Why do I get The bytes property isn't part of the standard Response interface
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yes I didn't manually set it in the IDE. Now it's fine 👍
I agree. I can see when notes with higher similarity score are ranked lower, it can cause confusion. A few thoughts about revealing one score:
Maybe we can hide all numeric value and just show badges. For example:
Users can see similarity score on hover the similarity badge
We can add a small ℹ️ icon in the corner and user can see the recommendation in the tooltip. Maybe we hide it if the model used is good enough.
That's a fair point. Let me update that.
I can turn it up to 0.6. It will be challenging to have one formula that fits everyone's note taking habit. I personally don't use tags as much and weight manual linked note more. I can imagine people can have a different preference. We probably need to open up the relevance adjustment in the settings eventually to give the choice back to users. |
c8a628c
to
aae1a6a
Compare
This commit addressed the latest feedback. PTAL. |
I also prefer no numeric metric and only have badges. The v0 version you have is spot on! So to summarize, we may benefit from exposing some things to user settings in the future:
(As for max number of results, are we returning 20 from similarity search and including all from links at the moment? Effectively we don't have a max on the overall result, correct?) We can defer things to future PRs, this one LGTM! |
Add a new
similarrelevant notes-searching algorithm with the following flows:The algorithm divided the relevance score calculation into three steps:
The results of each step then merge and dictate the relevance ranking. Compared to the previous implementation, which takes the linked note embedding into the vector search, the new version adds weight to the linked notes without polluting the vector search results.
The UI change is experimental. We eventually want to integrate relevant notes into the chat UI.
Here are some examples running the new algorithm in my vault and how they compare to smart connections. My vault is indexed with text-embedding-3-large. Smart connection used BGE-micro-v2.
Copilot results are better. Showed more relevant notes (probably thanks to a better embedding model) and ranked the backlink note higher.
Copilot results include links and backlinks.
Relevant notes of note in Chinese work properly now. Backlink and link also helped.
Everything below is from the original post, which is no longer relevant. Kept for documentation purposes.