Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embeddings: add support for prefixes in embeddings #4524

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

eliransin
Copy link

Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering).

Some refferences to justify this change:

  1. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes The nomic-embed-text prefixes section
  2. https://www.youtube.com/watch?v=76EIC_RaDNw (Don’t Embed Wrong! by Matt Williams)

Description

[ What changed? Feel free to be brief. ]

Checklist

  • The relevant docs, if any, have been updated or created
  • The relevant tests, if any, have been updated or created

Screenshots

[ For visual changes, include screenshots. ]

Testing instructions

Nothing special just adding some prefixes and making sure that a different index is being created. Some better retrievals might be observed but I guess it depend on the size of the index (the smaller the index the less the effect I assume).

Copy link

netlify bot commented Mar 7, 2025

Deploy Preview for continuedev canceled.

Name Link
🔨 Latest commit 51090ef
🔍 Latest deploy log https://app.netlify.com/sites/continuedev/deploys/67de531fdd868e0008e9dee1

Copy link
Contributor

@sestinj sestinj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eliransin thanks for the great work here! Everything written so far looks good to me. There's one thing we should add, which is support for the config.yaml format. This would be added in the config-yaml/src/schemas/models.ts embedOptionsSchema, and then should be passed through here: https://github.com/continuedev/continue/blob/c1699c975fc252d2ed14b8c4a9770d802852cd3b/core/config/yaml/models.ts

@eliransin eliransin requested a review from a team as a code owner March 18, 2025 04:49
@eliransin eliransin requested review from RomneyDa and removed request for a team March 18, 2025 04:49
@eliransin
Copy link
Author

Changes in this version:

  1. rebased on main and resolved conflicts
  2. added zod definitions as instructed in cr comments
    🙂

@eliransin eliransin requested a review from sestinj March 19, 2025 19:22
@eliransin
Copy link
Author

eliransin commented Mar 20, 2025

@sestinj I have a bit of a noob qusetion/s, I noticed that config-yaml/src/schemas/models.ts embedOptionsSchema sits in its own package.
should I bump the version of this package (and probably also bump up the coresponding entry in the core package.json) ? If I do, is the dependency installed from local source or should I first open a PR to update this package and have the new version uploaded and registered in package registy and only then do a second PR that uses this package in core?
Tx

@eliransin
Copy link
Author

Summary:
Figured out I should do the following, but unsure if I need to do it in two separate steps:

  1. Make changes to packages/config-yaml including bumping up the version
  2. Update the dependencies of core to the new @continuedev/config-yml version
  3. Add the extra assignment from the newly added option in core/config/yaml/models.ts
  4. All of the current already made changes

My question is:
Should 1 be in a separate PR?

I am going to go ahead and submit one unified PR and will assume that if this is the wrong thing to do, checks will fail 🙂

Probably will only do it in a day or so, so maybe I'll get an answer by then...

Some of the embedding models (e.g  nomic-embed-text) are trained to embed for
several purposes, this on the flip side, means that if not given the purpose
for the embedding the retrieval results might and probably will be suboptimal.
This change aims to deal with it by adding a configuration option for chunk and
query prefixes with the option to support more embedding purposes in the future
(i.e clustering).

Some refferences to justify this change:
1. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes
   The nomic-embed-text prefixes section
2. https://www.youtube.com/watch?v=76EIC_RaDNw (Don’t Embed Wrong! by Matt
   Williams)

Signed-off-by: Eliran Sinvani <eliransin@gmail.com>
After adding embedding prefix model support we also need to add support for it
in config-yaml, this in turn require us to bump up the version of config-yaml.

Signed-off-by: Eliran Sinvani <eliransin@gmail.com>
after adding zod schema definitions for the newly added embeddings
prefix in config-yaml package. let's integrate it.

Signed-off-by: Eliran Sinvani <eliransin@gmail.com>
@eliransin
Copy link
Author

@sestinj I digged a bit in the previous commits, it seams like I need to first have the package in registry so probably it should occure in 2 separate stages. I will break this PR into two.

@eliransin eliransin marked this pull request as draft March 22, 2025 06:58
@eliransin
Copy link
Author

@sestinj converted this to draft until the config-yaml package is published and proper adjutments are made as a result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants