-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
embeddings: add support for prefixes in embeddings #4524
base: main
Are you sure you want to change the base?
Conversation
✅ Deploy Preview for continuedev canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eliransin thanks for the great work here! Everything written so far looks good to me. There's one thing we should add, which is support for the config.yaml format. This would be added in the config-yaml/src/schemas/models.ts
embedOptionsSchema
, and then should be passed through here: https://github.com/continuedev/continue/blob/c1699c975fc252d2ed14b8c4a9770d802852cd3b/core/config/yaml/models.ts
611c16b
to
cf83f62
Compare
Changes in this version:
|
@sestinj I have a bit of a noob qusetion/s, I noticed that |
Summary:
My question is: I am going to go ahead and submit one unified PR and will assume that if this is the wrong thing to do, checks will fail 🙂 Probably will only do it in a day or so, so maybe I'll get an answer by then... |
Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering). Some refferences to justify this change: 1. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5#task-instruction-prefixes The nomic-embed-text prefixes section 2. https://www.youtube.com/watch?v=76EIC_RaDNw (Don’t Embed Wrong! by Matt Williams) Signed-off-by: Eliran Sinvani <eliransin@gmail.com>
After adding embedding prefix model support we also need to add support for it in config-yaml, this in turn require us to bump up the version of config-yaml. Signed-off-by: Eliran Sinvani <eliransin@gmail.com>
cf83f62
to
10e72a6
Compare
after adding zod schema definitions for the newly added embeddings prefix in config-yaml package. let's integrate it. Signed-off-by: Eliran Sinvani <eliransin@gmail.com>
10e72a6
to
51090ef
Compare
@sestinj I digged a bit in the previous commits, it seams like I need to first have the package in registry so probably it should occure in 2 separate stages. I will break this PR into two. |
@sestinj converted this to draft until the config-yaml package is published and proper adjutments are made as a result. |
Some of the embedding models (e.g nomic-embed-text) are trained to embed for several purposes, this on the flip side, means that if not given the purpose for the embedding the retrieval results might and probably will be suboptimal. This change aims to deal with it by adding a configuration option for chunk and query prefixes with the option to support more embedding purposes in the future (i.e clustering).
Some refferences to justify this change:
Description
[ What changed? Feel free to be brief. ]
Checklist
Screenshots
[ For visual changes, include screenshots. ]
Testing instructions
Nothing special just adding some prefixes and making sure that a different index is being created. Some better retrievals might be observed but I guess it depend on the size of the index (the smaller the index the less the effect I assume).