Add new similar note search algorithm #971

zeroliu · 2024-12-25T21:51:53Z

Add a new ~~similar~~ relevant notes-searching algorithm with the following flows:

The algorithm divided the relevance score calculation into three steps:

Vector similarity of the current note (0.5)
Outgoing links (0.25)
Backlinks (0.25)

The results of each step then merge and dictate the relevance ranking. Compared to the previous implementation, which takes the linked note embedding into the vector search, the new version adds weight to the linked notes without polluting the vector search results.

The UI change is experimental. We eventually want to integrate relevant notes into the chat UI.

Here are some examples running the new algorithm in my vault and how they compare to smart connections. My vault is indexed with text-embedding-3-large. Smart connection used BGE-micro-v2.

Copilot results are better. Showed more relevant notes (probably thanks to a better embedding model) and ranked the backlink note higher.

Copilot results include links and backlinks.

Relevant notes of note in Chinese work properly now. Backlink and link also helped.

Everything below is from the original post, which is no longer relevant. Kept for documentation purposes.

Find embeddings of the current note
Find embeddings of the linked notes ([[linked note]]
Calculate a weighted average embedding of the found embeddings. Current note embeddings weight more than the linked note embeddings.
Execute a vector search based on the average embedding.
Pick the highest score if multiple chunks of a note are included in the results.
Filter out the matching chunks of the current note.

logancyang · 2024-12-26T09:42:15Z

src/search/findSimilarNotes.ts

+ * @param db - The Orama database.
+ * @returns The embeddings for the given note titles.
+ */
+async function getEmbeddings(noteTitles: string[], db: Orama<any>): Promise<number[][]> {


nit: getNoteEmbeddings may be a clearer name?

logancyang · 2024-12-26T09:44:54Z

src/search/findSimilarNotes.ts

+  const debug = getSettings().debug;
+  const embeddings: number[][] = [];
+  for (const noteTitle of noteTitles) {
+    const noteFile = await getNoteFileFromTitle(app.vault, noteTitle);


Not particularly for this PR but right now we are not differentiating notes with the same title under different paths. This was on me. In the future we should use a unique identifier i.e. note path.

When a user types [[ the list should show the corresponding path next to each title (just like OB itself does it), and in the background only path should be used as the unique id.

logancyang · 2024-12-26T09:59:16Z

Thanks for the detailed descriptions! Generally LGTM!

Just to prompt some ideas:

What's the effect of changing the weights 0.7 and 0.3? Do you see any particular outcome that you may prefer and why?
Right now we are using outward links but not incoming ones. If we were to design for bidirectional links vs in vs out ones, what could be the best strategy?
Do you think tags can contribute as a strong signal like links? If so, we can consider using Jaccard similarity of tags as a signal. (tags is in the doc property but I'm not sure if it's correctly populated, we need to double-check.)
Explanability and user perception are important IMO. If we are using links and other things as our similarity signals, we may consider showing badges "linked", "#<shared tags>" next to the result titles to let them know why
We may consider letting the user set the min similarity threshold
As for Chinese and other languages, I will add more plus embedding models: plus-medium, plus-large and plus-multimodal. They should have better performance for multilingual use.
- I'm particularly curious about the Chinese performance of Cohere's multilingual-lite vs Voyage AI's voyage-3 and voyage-3-large, if you have time you can compare them, if not don't worry!

Some future considerations: this feature should be conditional on whether indexing is enabled on mobile.

zeroliu · 2024-12-26T21:17:52Z

Thanks for the detailed descriptions! Generally LGTM!

Just to prompt some ideas:

What's the effect of changing the weights 0.7 and 0.3? Do you see any particular outcome that you may prefer and why?

Right now we are using outward links but not incoming ones. If we were to design for bidirectional links vs in vs out ones, what could be the best strategy?

Do you think tags can contribute as a strong signal like links? If so, we can consider using Jaccard similarity of tags as a signal. (tags is in the doc property but I'm not sure if it's correctly populated, we need to double-check.)

Explanability and user perception are important IMO. If we are using links and other things as our similarity signals, we may consider showing badges "linked", "#<shared tags>" next to the result titles to let them know why

We may consider letting the user set the min similarity threshold

As for Chinese and other languages, I will add more plus embedding models: plus-medium, plus-large and plus-multimodal. They should have better performance for multilingual use.

I'm particularly curious about the Chinese performance of Cohere's multilingual-lite vs Voyage AI's voyage-3 and voyage-3-large, if you have time you can compare them, if not don't worry!

Some future considerations: this feature should be conditional on whether indexing is enabled on mobile.

Thanks for the great feedback.

I started with 50/50. The content of the linked notes felt ranked too high for notes that link other notes for reference. 70/30 felt more balanced but I'm sure there are rooms for improvement. I wonder if there is a benchmark of some sort that we can use to optimize.
I found information about getting links and backlinks from file cache instead of from string matches. It should help us target the right file and take back links into consideration. Let me try those APIs later today.
can you help understand how we can use hybrid search to help with tags? Will it call external APIs?
will the plus model more powerful than the text-embedding-3-large? I wonder there are other factors that affected the results. I can't imagine the local model used by smart connections is better.

logancyang · 2024-12-27T01:18:40Z

I initially suggested HybridRetriever with tags as salience terms but then changed it to tags in the doc property. Reasons being
- 1. Orama hybrid search does a fine job with text and vector weights but it's still hard to debug if its not giving the desired result.
- 1. Our own metric like Jaccard similarity of tags can be combined with other metrics like links and vector similarity in a more flexible way. Easier to debug and tune.
- The option is yours, both are feasible IMO.
Current plus-small is using voyage-3-lite which is mainly an English-focused model. BGE ones are specially tuned with Chinese so it's understandable they are good with Chinese. But I'm still surprised that the voyage one is giving out almost random results, very suspicious. You can check C-MTEB leaderboard and voyage-3 series still have higher ranks than BGE-micro-v2 if I'm not mistaken. So a deeper dive may be needed. I suggest trying some other ones out from the higher ranked C-MTEB ones and see if you can at least get some good results for Chinese.

logancyang · 2024-12-27T07:12:33Z

Another thing just came to mind: although we will enable auto index in all modes, we should still have an option NEVER available since some users I know don't want to trigger auto indexing ever. In that case, should the note similarity UI show a "refresh" button if NEVER is selected?

zeroliu · 2024-12-27T08:03:42Z

src/main.ts

@@ -314,18 +314,25 @@ export default class CopilotPlugin extends Plugin {
    });

    this.addCommand({
-      id: "find-similar-notes",
-      name: "Find similar notes to active note",
+      id: "find-relevant-notes",


I think the term "relevant" is more accurate as we are adding information that is more than just similarity

zeroliu · 2024-12-27T08:08:57Z

package.json

@@ -57,7 +57,7 @@
    "prettier": "^3.3.3",
    "ts-jest": "^29.1.0",
    "tslib": "2.4.0",
-    "typescript": "4.7.4",
+    "typescript": "^5.7.2",


upgrade TypeScript to get the filter function type working correctly

zeroliu · 2024-12-27T08:16:52Z

@logancyang I revamped the implementation and addressed some of your concerns:

Both outgoing and backlinks are considered for the relevance score calculation.
Use the obsidian file cache instead of relying on the title string match for links and backlinks. This will work for notes with the same display title.
Explanability is improved. Users can see the following metadata in the result: (Similarity: 80% | Backlink | link). This UI probably needs some tweaks when moving to the chat UI.
The Chinese language is fixed. (I'm unsure whether it was because of my previous implementation or a mistake in my indexing.)

I haven't implemented the tag similarity score yet because I don't have a good solution to run Jaccard similarity efficiently. I hope this is something that we can rely on Orama. We can follow up on this if you have ideas. We can add similarity adjustment in settings in a follow-up PR, too. It's worth discussing whether we want to let the user adjust the vector search threshold, the merged score threshold, or maybe the weights for each category.

logancyang · 2024-12-27T21:33:30Z

src/search/dbOperations.ts

    if (!db) throw new Error("DB not initialized");
    if (!path) return;
    const result = await search(db, {
      term: path,
      properties: ["path"],
      exact: true,
+      includeVectors: true,


Did you figure out how to make exact work? Is it still returning partial matches?

No. I created a repro sandbox and shared it with this issue: oramasearch/orama#866

logancyang · 2024-12-27T22:56:29Z

This looks awesome! I can immediately tell this is gonna be so useful!

For the final UI, my two cents is that we should show only one "number" metric for each title. It seems showing the final "relevance score" is better because the user may ask "why a higher similarity score isn't ranked higher".
As for Jaccard similarity, there's no need to compute pairwise results for all pairs. We can just use the tags from the source note to search Orama on the "tags" field for docs with any shared tag, and compute Jaccard on this subset, it shouldn't be a large set. Let me know if you have any concerns.

Questions:

The most critical question is that this feature relies heavily on the embedding model used. Is there something we could do in UIUX that informs/guides the user as to which embedding models we recommend?
Should bidirectional links have a score of 2x that of only a link or backlink? To me personally, it feels like a 1.25x or 1.5x importance because I usually see any link as bidirectional.
Can a weight of 0.5 for links potentially overshadow vector similarity? Perhaps when we have shared tags, links and shared tags can have a combined weight of 0.5?

Some more explorations if you are interested: graph-based similarity metrics

logancyang · 2024-12-27T23:12:47Z

src/utils.ts

+/**
+ * @deprecated File display title can be duplicated, so we should use file path
+ * instead of title to find the note file.
+ */
 export async function getNoteFileFromTitle(vault: Vault, noteTitle: string): Promise<TFile | null> {


Yeah this should be replaced everywhere, thanks for marking it deprecated.

logancyang · 2024-12-27T23:13:34Z

src/utils.ts

@@ -36,6 +40,9 @@ export async function getNoteFileFromTitle(vault: Vault, noteTitle: string): Pro
  return null;
 }

+/**
+ * @deprecated Use app.vault.getAbstractFileByPath() instead.
+ */
 export const getNotesFromPath = async (vault: Vault, path: string): Promise<TFile[]> => {


getNotesFromPath gets an array of notes from a folder path, it's not the same as app.vault.getAbstractFileByPath().

I see. Removed the deprecated annotation. Maybe it's worth renaming the function to be getNotesFromFolder and rewrite the manual matching logic with getAbstractFileByPath. It works with folder as well. I will leave it to a future PR.

src/utils.ts

logancyang · 2024-12-27T23:16:24Z

src/utils.ts

@@ -565,6 +575,7 @@ export async function safeFetch(url: string, options: RequestInit): Promise<Resp
    url: url,
    type: "basic",
    redirected: false,
+    bytes: () => Promise.resolve(new Uint8Array(0)),


What's this for? Why do I get The bytes property isn't part of the standard Response interface?

I got a type error here after upgrading TypeScript. Without change, the build will fail.

Did you npm install with the latest package.json in this PR and selected the latest TypeScript version in the IDE?

Oh yes I didn't manually set it in the IDE. Now it's fine 👍

zeroliu · 2024-12-28T06:42:59Z

For the final UI, my two cents is that we should show only one "number" metric for each title. It seems showing the final "relevance score" is better because the user may ask "why a higher similarity score isn't ranked higher".

I agree. I can see when notes with higher similarity score are ranked lower, it can cause confusion. A few thoughts about revealing one score:

How would we normalize the score? If the score is contributed by similar content, links, and tags. A note that shares very similar content but no links or tags can get at most 0.5. It can be confusing to show people 0.5 or 50%.
When a note has links but is not indexed, would people understand a relevance score 50% mean and how they are calculated from links? (e.g No index | backlink | link).

Maybe we can hide all numeric value and just show badges. For example:

Note Title 1 (Strong similarity) (Links) (Tags)
Note Title 2 (Moderate similarity)
Note Title 3 (Moderate similarity) (Links)

Users can see similarity score on hover the similarity badge

The most critical question is that this feature relies heavily on the embedding model used. Is there something we could do in UIUX that informs/guides the user as to which embedding models we recommend?

We can add a small ℹ️ icon in the corner and user can see the recommendation in the tooltip. Maybe we hide it if the model used is good enough.

Should bidirectional links have a score of 2x that of only a link or backlink? To me personally, it feels like a 1.25x or 1.5x importance because I usually see any link as bidirectional.

That's a fair point. Let me update that.

Can a weight of 0.5 for links potentially overshadow vector similarity? Perhaps when we have shared tags, links and shared tags can have a combined weight of 0.5?

I can turn it up to 0.6. It will be challenging to have one formula that fits everyone's note taking habit. I personally don't use tags as much and weight manual linked note more. I can imagine people can have a different preference. We probably need to open up the relevance adjustment in the settings eventually to give the choice back to users.

zeroliu · 2024-12-28T07:14:07Z

@logancyang 34208dc (#971)

This commit addressed the latest feedback. PTAL.

logancyang · 2024-12-28T18:39:02Z

For the final UI, my two cents is that we should show only one "number" metric for each title. It seems showing the final "relevance score" is better because the user may ask "why a higher similarity score isn't ranked higher".

I agree. I can see when notes with higher similarity score are ranked lower, it can cause confusion. A few thoughts about revealing one score:

How would we normalize the score? If the score is contributed by similar content, links, and tags. A note that shares very similar content but no links or tags can get at most 0.5. It can be confusing to show people 0.5 or 50%.

When a note has links but is not indexed, would people understand a relevance score 50% mean and how they are calculated from links? (e.g No index | backlink | link).

Maybe we can hide all numeric value and just show badges. For example:

Note Title 1 (Strong similarity) (Links) (Tags)

Note Title 2 (Moderate similarity)

Note Title 3 (Moderate similarity) (Links)

Users can see similarity score on hover the similarity badge

The most critical question is that this feature relies heavily on the embedding model used. Is there something we could do in UIUX that informs/guides the user as to which embedding models we recommend?

We can add a small ℹ️ icon in the corner and user can see the recommendation in the tooltip. Maybe we hide it if the model used is good enough.

Should bidirectional links have a score of 2x that of only a link or backlink? To me personally, it feels like a 1.25x or 1.5x importance because I usually see any link as bidirectional.

That's a fair point. Let me update that.

Can a weight of 0.5 for links potentially overshadow vector similarity? Perhaps when we have shared tags, links and shared tags can have a combined weight of 0.5?

I can turn it up to 0.6. It will be challenging to have one formula that fits everyone's note taking habit. I personally don't use tags as much and weight manual linked note more. I can imagine people can have a different preference. We probably need to open up the relevance adjustment in the settings eventually to give the choice back to users.

I also prefer no numeric metric and only have badges. The v0 version you have is spot on!

So to summarize, we may benefit from exposing some things to user settings in the future:

Min similarity threshold
Weights for similarity, links and tags
Max number of results

(As for max number of results, are we returning 20 from similarity search and including all from links at the moment? Effectively we don't have a max on the overall result, correct?)

We can defer things to future PRs, this one LGTM!

zeroliu force-pushed the zero/similar-notes-algo branch from 5ab5d95 to 5672023 Compare December 25, 2024 22:18

zeroliu marked this pull request as ready for review December 25, 2024 23:12

zeroliu requested a review from logancyang December 25, 2024 23:13

zeroliu force-pushed the zero/similar-notes-algo branch from 5672023 to 281bd69 Compare December 25, 2024 23:43

Add new similar note search algorithm

f1ddbf1

zeroliu force-pushed the zero/similar-notes-algo branch from 281bd69 to f1ddbf1 Compare December 25, 2024 23:52

logancyang reviewed Dec 26, 2024

View reviewed changes

zeroliu force-pushed the zero/similar-notes-algo branch 3 times, most recently from 5db1551 to cd86e2d Compare December 27, 2024 07:22

Change to new algorithm

8d15e1c

zeroliu force-pushed the zero/similar-notes-algo branch from cd86e2d to 8d15e1c Compare December 27, 2024 08:03

zeroliu commented Dec 27, 2024

View reviewed changes

zeroliu requested a review from logancyang December 27, 2024 08:20

logancyang reviewed Dec 27, 2024

View reviewed changes

src/utils.ts Outdated Show resolved Hide resolved

logancyang reviewed Dec 27, 2024

View reviewed changes

Upgrade typescript

aae1a6a

zeroliu force-pushed the zero/similar-notes-algo branch from c8a628c to aae1a6a Compare December 28, 2024 07:02

Redo links score algorithm

34208dc

zeroliu requested a review from logancyang December 28, 2024 07:14

logancyang merged commit a462b82 into logancyang:master Dec 28, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new similar note search algorithm #971

Add new similar note search algorithm #971

zeroliu commented Dec 25, 2024 •

edited

Loading

logancyang Dec 26, 2024 •

edited

Loading

logancyang Dec 26, 2024 •

edited

Loading

logancyang commented Dec 26, 2024 •

edited

Loading

zeroliu commented Dec 26, 2024

logancyang commented Dec 27, 2024

logancyang commented Dec 27, 2024

zeroliu Dec 27, 2024

zeroliu Dec 27, 2024

zeroliu commented Dec 27, 2024 •

edited

Loading

logancyang Dec 27, 2024

zeroliu Dec 28, 2024 •

edited

Loading

logancyang commented Dec 27, 2024

logancyang Dec 27, 2024

logancyang Dec 27, 2024

zeroliu Dec 28, 2024

logancyang Dec 27, 2024

zeroliu Dec 28, 2024

logancyang Dec 28, 2024

zeroliu commented Dec 28, 2024 •

edited

Loading

zeroliu commented Dec 28, 2024

logancyang commented Dec 28, 2024

Add new similar note search algorithm #971

Add new similar note search algorithm #971

Conversation

zeroliu commented Dec 25, 2024 • edited Loading

logancyang Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

logancyang Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

logancyang commented Dec 26, 2024 • edited Loading

zeroliu commented Dec 26, 2024

logancyang commented Dec 27, 2024

logancyang commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroliu commented Dec 27, 2024 • edited Loading

Choose a reason for hiding this comment

zeroliu Dec 28, 2024 • edited Loading

Choose a reason for hiding this comment

logancyang commented Dec 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zeroliu commented Dec 28, 2024 • edited Loading

zeroliu commented Dec 28, 2024

logancyang commented Dec 28, 2024

zeroliu commented Dec 25, 2024 •

edited

Loading

logancyang Dec 26, 2024 •

edited

Loading

logancyang Dec 26, 2024 •

edited

Loading

logancyang commented Dec 26, 2024 •

edited

Loading

zeroliu commented Dec 27, 2024 •

edited

Loading

zeroliu Dec 28, 2024 •

edited

Loading

zeroliu commented Dec 28, 2024 •

edited

Loading