Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new similar note search algorithm #971

Merged
merged 4 commits into from
Dec 28, 2024

Conversation

zeroliu
Copy link
Collaborator

@zeroliu zeroliu commented Dec 25, 2024

Add a new similar relevant notes-searching algorithm with the following flows:

Screenshot 2024-12-26 at 11 49 28 PM

The algorithm divided the relevance score calculation into three steps:

  • Vector similarity of the current note (0.5)
  • Outgoing links (0.25)
  • Backlinks (0.25)

The results of each step then merge and dictate the relevance ranking. Compared to the previous implementation, which takes the linked note embedding into the vector search, the new version adds weight to the linked notes without polluting the vector search results.

The UI change is experimental. We eventually want to integrate relevant notes into the chat UI.

Here are some examples running the new algorithm in my vault and how they compare to smart connections. My vault is indexed with text-embedding-3-large. Smart connection used BGE-micro-v2.

Copilot results are better. Showed more relevant notes (probably thanks to a better embedding model) and ranked the backlink note higher.

Screenshot 2024-12-26 at 11 37 03 PM

Copilot results include links and backlinks.

Screenshot 2024-12-26 at 11 34 03 PM

Relevant notes of note in Chinese work properly now. Backlink and link also helped.

Screenshot 2024-12-26 at 11 35 07 PM

Everything below is from the original post, which is no longer relevant. Kept for documentation purposes.

Screenshot 2024-12-25 at 3 48 48 PM
  1. Find embeddings of the current note
  2. Find embeddings of the linked notes ([[linked note]]
  3. Calculate a weighted average embedding of the found embeddings. Current note embeddings weight more than the linked note embeddings.
  4. Execute a vector search based on the average embedding.
  5. Pick the highest score if multiple chunks of a note are included in the results.
  6. Filter out the matching chunks of the current note.

@zeroliu zeroliu force-pushed the zero/similar-notes-algo branch from 5ab5d95 to 5672023 Compare December 25, 2024 22:18
@zeroliu zeroliu marked this pull request as ready for review December 25, 2024 23:12
@zeroliu zeroliu requested a review from logancyang December 25, 2024 23:13
@zeroliu zeroliu force-pushed the zero/similar-notes-algo branch from 5672023 to 281bd69 Compare December 25, 2024 23:43
@zeroliu zeroliu force-pushed the zero/similar-notes-algo branch from 281bd69 to f1ddbf1 Compare December 25, 2024 23:52
* @param db - The Orama database.
* @returns The embeddings for the given note titles.
*/
async function getEmbeddings(noteTitles: string[], db: Orama<any>): Promise<number[][]> {
Copy link
Owner

@logancyang logancyang Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getNoteEmbeddings may be a clearer name?

const debug = getSettings().debug;
const embeddings: number[][] = [];
for (const noteTitle of noteTitles) {
const noteFile = await getNoteFileFromTitle(app.vault, noteTitle);
Copy link
Owner

@logancyang logancyang Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not particularly for this PR but right now we are not differentiating notes with the same title under different paths. This was on me. In the future we should use a unique identifier i.e. note path.

When a user types [[ the list should show the corresponding path next to each title (just like OB itself does it), and in the background only path should be used as the unique id.
SCR-20241226-clof

@logancyang
Copy link
Owner

logancyang commented Dec 26, 2024

Thanks for the detailed descriptions! Generally LGTM!

Just to prompt some ideas:

  • What's the effect of changing the weights 0.7 and 0.3? Do you see any particular outcome that you may prefer and why?
  • Right now we are using outward links but not incoming ones. If we were to design for bidirectional links vs in vs out ones, what could be the best strategy?
  • Do you think tags can contribute as a strong signal like links? If so, we can consider using Jaccard similarity of tags as a signal. (tags is in the doc property but I'm not sure if it's correctly populated, we need to double-check.)
  • Explanability and user perception are important IMO. If we are using links and other things as our similarity signals, we may consider showing badges "linked", "#<shared tags>" next to the result titles to let them know why
  • We may consider letting the user set the min similarity threshold
  • As for Chinese and other languages, I will add more plus embedding models: plus-medium, plus-large and plus-multimodal. They should have better performance for multilingual use.
    • I'm particularly curious about the Chinese performance of Cohere's multilingual-lite vs Voyage AI's voyage-3 and voyage-3-large, if you have time you can compare them, if not don't worry!

Some future considerations: this feature should be conditional on whether indexing is enabled on mobile.

@zeroliu
Copy link
Collaborator Author

zeroliu commented Dec 26, 2024

Thanks for the detailed descriptions! Generally LGTM!

Just to prompt some ideas:

  • What's the effect of changing the weights 0.7 and 0.3? Do you see any particular outcome that you may prefer and why?

  • Right now we are using outward links but not incoming ones. If we were to design for bidirectional links vs in vs out ones, what could be the best strategy?

  • Do you think tags can contribute as a strong signal like links? If so, we can consider using Jaccard similarity of tags as a signal. (tags is in the doc property but I'm not sure if it's correctly populated, we need to double-check.)

  • Explanability and user perception are important IMO. If we are using links and other things as our similarity signals, we may consider showing badges "linked", "#<shared tags>" next to the result titles to let them know why

  • We may consider letting the user set the min similarity threshold

  • As for Chinese and other languages, I will add more plus embedding models: plus-medium, plus-large and plus-multimodal. They should have better performance for multilingual use.

    • I'm particularly curious about the Chinese performance of Cohere's multilingual-lite vs Voyage AI's voyage-3 and voyage-3-large, if you have time you can compare them, if not don't worry!

Some future considerations: this feature should be conditional on whether indexing is enabled on mobile.

Thanks for the great feedback.

  • I started with 50/50. The content of the linked notes felt ranked too high for notes that link other notes for reference. 70/30 felt more balanced but I'm sure there are rooms for improvement. I wonder if there is a benchmark of some sort that we can use to optimize.
  • I found information about getting links and backlinks from file cache instead of from string matches. It should help us target the right file and take back links into consideration. Let me try those APIs later today.
  • can you help understand how we can use hybrid search to help with tags? Will it call external APIs?
  • will the plus model more powerful than the text-embedding-3-large? I wonder there are other factors that affected the results. I can't imagine the local model used by smart connections is better.

@logancyang
Copy link
Owner

  • I initially suggested HybridRetriever with tags as salience terms but then changed it to tags in the doc property. Reasons being
      1. Orama hybrid search does a fine job with text and vector weights but it's still hard to debug if its not giving the desired result.
      1. Our own metric like Jaccard similarity of tags can be combined with other metrics like links and vector similarity in a more flexible way. Easier to debug and tune.
    • The option is yours, both are feasible IMO.
  • Current plus-small is using voyage-3-lite which is mainly an English-focused model. BGE ones are specially tuned with Chinese so it's understandable they are good with Chinese. But I'm still surprised that the voyage one is giving out almost random results, very suspicious. You can check C-MTEB leaderboard and voyage-3 series still have higher ranks than BGE-micro-v2 if I'm not mistaken. So a deeper dive may be needed. I suggest trying some other ones out from the higher ranked C-MTEB ones and see if you can at least get some good results for Chinese.

@logancyang
Copy link
Owner

Another thing just came to mind: although we will enable auto index in all modes, we should still have an option NEVER available since some users I know don't want to trigger auto indexing ever. In that case, should the note similarity UI show a "refresh" button if NEVER is selected?

@zeroliu zeroliu force-pushed the zero/similar-notes-algo branch 3 times, most recently from 5db1551 to cd86e2d Compare December 27, 2024 07:22
@zeroliu zeroliu force-pushed the zero/similar-notes-algo branch from cd86e2d to 8d15e1c Compare December 27, 2024 08:03
@@ -314,18 +314,25 @@ export default class CopilotPlugin extends Plugin {
});

this.addCommand({
id: "find-similar-notes",
name: "Find similar notes to active note",
id: "find-relevant-notes",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the term "relevant" is more accurate as we are adding information that is more than just similarity

@@ -57,7 +57,7 @@
"prettier": "^3.3.3",
"ts-jest": "^29.1.0",
"tslib": "2.4.0",
"typescript": "4.7.4",
"typescript": "^5.7.2",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upgrade TypeScript to get the filter function type working correctly

@zeroliu
Copy link
Collaborator Author

zeroliu commented Dec 27, 2024

@logancyang I revamped the implementation and addressed some of your concerns:

  • Both outgoing and backlinks are considered for the relevance score calculation.
  • Use the obsidian file cache instead of relying on the title string match for links and backlinks. This will work for notes with the same display title.
  • Explanability is improved. Users can see the following metadata in the result: (Similarity: 80% | Backlink | link). This UI probably needs some tweaks when moving to the chat UI.
  • The Chinese language is fixed. (I'm unsure whether it was because of my previous implementation or a mistake in my indexing.)

I haven't implemented the tag similarity score yet because I don't have a good solution to run Jaccard similarity efficiently. I hope this is something that we can rely on Orama. We can follow up on this if you have ideas. We can add similarity adjustment in settings in a follow-up PR, too. It's worth discussing whether we want to let the user adjust the vector search threshold, the merged score threshold, or maybe the weights for each category.

@zeroliu zeroliu requested a review from logancyang December 27, 2024 08:20
if (!db) throw new Error("DB not initialized");
if (!path) return;
const result = await search(db, {
term: path,
properties: ["path"],
exact: true,
includeVectors: true,
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you figure out how to make exact work? Is it still returning partial matches?

Copy link
Collaborator Author

@zeroliu zeroliu Dec 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No. I created a repro sandbox and shared it with this issue: oramasearch/orama#866

@logancyang
Copy link
Owner

This looks awesome! I can immediately tell this is gonna be so useful!

  • For the final UI, my two cents is that we should show only one "number" metric for each title. It seems showing the final "relevance score" is better because the user may ask "why a higher similarity score isn't ranked higher".
  • As for Jaccard similarity, there's no need to compute pairwise results for all pairs. We can just use the tags from the source note to search Orama on the "tags" field for docs with any shared tag, and compute Jaccard on this subset, it shouldn't be a large set. Let me know if you have any concerns.

Questions:

  • The most critical question is that this feature relies heavily on the embedding model used. Is there something we could do in UIUX that informs/guides the user as to which embedding models we recommend?
  • Should bidirectional links have a score of 2x that of only a link or backlink? To me personally, it feels like a 1.25x or 1.5x importance because I usually see any link as bidirectional.
  • Can a weight of 0.5 for links potentially overshadow vector similarity? Perhaps when we have shared tags, links and shared tags can have a combined weight of 0.5?

Some more explorations if you are interested: graph-based similarity metrics

/**
* @deprecated File display title can be duplicated, so we should use file path
* instead of title to find the note file.
*/
export async function getNoteFileFromTitle(vault: Vault, noteTitle: string): Promise<TFile | null> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah this should be replaced everywhere, thanks for marking it deprecated.

@@ -36,6 +40,9 @@ export async function getNoteFileFromTitle(vault: Vault, noteTitle: string): Pro
return null;
}

/**
* @deprecated Use app.vault.getAbstractFileByPath() instead.
*/
export const getNotesFromPath = async (vault: Vault, path: string): Promise<TFile[]> => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getNotesFromPath gets an array of notes from a folder path, it's not the same as app.vault.getAbstractFileByPath().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Removed the deprecated annotation. Maybe it's worth renaming the function to be getNotesFromFolder and rewrite the manual matching logic with getAbstractFileByPath. It works with folder as well. I will leave it to a future PR.

src/utils.ts Outdated Show resolved Hide resolved
@@ -565,6 +575,7 @@ export async function safeFetch(url: string, options: RequestInit): Promise<Resp
url: url,
type: "basic",
redirected: false,
bytes: () => Promise.resolve(new Uint8Array(0)),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? Why do I get The bytes property isn't part of the standard Response interface?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got a type error here after upgrading TypeScript. Without change, the build will fail.

Did you npm install with the latest package.json in this PR and selected the latest TypeScript version in the IDE?

Screenshot 2024-12-27 at 11 03 01 PM

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes I didn't manually set it in the IDE. Now it's fine 👍

@zeroliu
Copy link
Collaborator Author

zeroliu commented Dec 28, 2024

  • For the final UI, my two cents is that we should show only one "number" metric for each title. It seems showing the final "relevance score" is better because the user may ask "why a higher similarity score isn't ranked higher".

I agree. I can see when notes with higher similarity score are ranked lower, it can cause confusion. A few thoughts about revealing one score:

  • How would we normalize the score? If the score is contributed by similar content, links, and tags. A note that shares very similar content but no links or tags can get at most 0.5. It can be confusing to show people 0.5 or 50%.
  • When a note has links but is not indexed, would people understand a relevance score 50% mean and how they are calculated from links? (e.g No index | backlink | link).

Maybe we can hide all numeric value and just show badges. For example:

  • Note Title 1 (Strong similarity) (Links) (Tags)
  • Note Title 2 (Moderate similarity)
  • Note Title 3 (Moderate similarity) (Links)

Users can see similarity score on hover the similarity badge

The most critical question is that this feature relies heavily on the embedding model used. Is there something we could do in UIUX that informs/guides the user as to which embedding models we recommend?

We can add a small ℹ️ icon in the corner and user can see the recommendation in the tooltip. Maybe we hide it if the model used is good enough.

Should bidirectional links have a score of 2x that of only a link or backlink? To me personally, it feels like a 1.25x or 1.5x importance because I usually see any link as bidirectional.

That's a fair point. Let me update that.

Can a weight of 0.5 for links potentially overshadow vector similarity? Perhaps when we have shared tags, links and shared tags can have a combined weight of 0.5?

I can turn it up to 0.6. It will be challenging to have one formula that fits everyone's note taking habit. I personally don't use tags as much and weight manual linked note more. I can imagine people can have a different preference. We probably need to open up the relevance adjustment in the settings eventually to give the choice back to users.

@zeroliu zeroliu force-pushed the zero/similar-notes-algo branch from c8a628c to aae1a6a Compare December 28, 2024 07:02
@zeroliu
Copy link
Collaborator Author

zeroliu commented Dec 28, 2024

@logancyang 34208dc (#971)

This commit addressed the latest feedback. PTAL.

@zeroliu zeroliu requested a review from logancyang December 28, 2024 07:14
@logancyang
Copy link
Owner

  • For the final UI, my two cents is that we should show only one "number" metric for each title. It seems showing the final "relevance score" is better because the user may ask "why a higher similarity score isn't ranked higher".

I agree. I can see when notes with higher similarity score are ranked lower, it can cause confusion. A few thoughts about revealing one score:

  • How would we normalize the score? If the score is contributed by similar content, links, and tags. A note that shares very similar content but no links or tags can get at most 0.5. It can be confusing to show people 0.5 or 50%.
  • When a note has links but is not indexed, would people understand a relevance score 50% mean and how they are calculated from links? (e.g No index | backlink | link).

Maybe we can hide all numeric value and just show badges. For example:

  • Note Title 1 (Strong similarity) (Links) (Tags)
  • Note Title 2 (Moderate similarity)
  • Note Title 3 (Moderate similarity) (Links)

Users can see similarity score on hover the similarity badge

The most critical question is that this feature relies heavily on the embedding model used. Is there something we could do in UIUX that informs/guides the user as to which embedding models we recommend?

We can add a small ℹ️ icon in the corner and user can see the recommendation in the tooltip. Maybe we hide it if the model used is good enough.

Should bidirectional links have a score of 2x that of only a link or backlink? To me personally, it feels like a 1.25x or 1.5x importance because I usually see any link as bidirectional.

That's a fair point. Let me update that.

Can a weight of 0.5 for links potentially overshadow vector similarity? Perhaps when we have shared tags, links and shared tags can have a combined weight of 0.5?

I can turn it up to 0.6. It will be challenging to have one formula that fits everyone's note taking habit. I personally don't use tags as much and weight manual linked note more. I can imagine people can have a different preference. We probably need to open up the relevance adjustment in the settings eventually to give the choice back to users.

I also prefer no numeric metric and only have badges. The v0 version you have is spot on!

So to summarize, we may benefit from exposing some things to user settings in the future:

  • Min similarity threshold
  • Weights for similarity, links and tags
  • Max number of results

(As for max number of results, are we returning 20 from similarity search and including all from links at the moment? Effectively we don't have a max on the overall result, correct?)

We can defer things to future PRs, this one LGTM!

@logancyang logancyang merged commit a462b82 into logancyang:master Dec 28, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants