-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Builtin Cohere Embed processor #2167
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @ grvsahil ! Main thing I'd like to see changed is not needing a README and relying on the automatic specification (running make generate
). Other comment are small details. Let me know if they made sense.
@@ -20,6 +20,19 @@ Provides Cohere processors for command, embed and rerank models. | |||
| `backoffRetry.min` | The minimum waiting time before retrying. | false | `100ms` | | |||
| `backoffRetry.max` | The maximum waiting time before retrying. | false | `5s` | | |||
|
|||
### Embed Processor Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll leave a similar comment in https://github.com/ConduitIO/conduit/pull/2165/files, but I believe a README is neither needed nor encouraged. The reason is that we rely on specgen to generate the parameters automatically to then generate the .json spec which then will be exposed to the conduit site.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -57,6 +57,7 @@ var DefaultBuiltinProcessors = map[string]ProcessorPluginConstructor{ | |||
"unwrap.opencdc": unwrap.NewOpenCDCProcessor, | |||
"webhook.http": webhook.NewHTTPProcessor, | |||
"cohere.command": cohere.NewCommandProcessor, | |||
"cohere.embed": cohere.NewEmbedProcessor, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to https://github.com/ConduitIO/conduit/pull/2165/files, it'd be nice if it's added alphabetically to respect that existing order, thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just left some questions. Happy to review everything on a single branch ([feat/cohere-processor](https://github.com/ConduitIO/conduit/tree/feat/cohere-processor)
)
APIKey string `json:"apiKey" validate:"required"` | ||
// Specifies the type of input passed to the model. Required for embed models v3 and higher. | ||
// Allowed values: search_document, search_query, classification, clustering, image. | ||
InputType string `json:"inputType"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per the commented code above, this is required for embed models v3 and higher. Maybe we should add a validation to make error if this value is not provided in those situations?
Or does this not refer to Model
, but rather if we used a different client (different to to V2EmbedRequest
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add a validation for InputType
if model is V3 or higher.
It indeed refers to Model
.
config *embedProcConfig | ||
} | ||
|
||
func (e *embedClient) Embed(ctx context.Context, texts []string) ([][]float64, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look in the other PR, but does this need to be exported?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I'll make it unexported.
return nil, err | ||
} | ||
|
||
return resp.GetEmbeddings().GetFloat(), nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this could return nil, but we're not checking for this possibility on processBatch
. Should we return an error instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
embeddings
can only be nil if p.client.embed()
fails, but in that case, the function enters the if err != nil
block, retries if possible, or returns with an error if unrecoverable. Since we only access embeddings after a successful response, I don't think it is necessary.
* feat: cohere processor * fix: missing apikey * fix: linters * fix: readme lints * fix: small fix * fix: refactored separate processors * fix: refactor local scope * feat: handling chat response * fix: configurable request * fix: readme config updated * fix: small change * feat: refactored and examples * fix: linters * fix: examples test for command model and generate json spec * remove mock processor impl * Builtin Cohere Embed processor (#2167) * feat: embed processor * update mod file * resolved PR comments, added batching support & test cases * fix: header in embed_examples_test.go * fix: update examples test and generate json spec * resolved PR comments --------- Co-authored-by: Gaurav Sahil <[email protected]> --------- Co-authored-by: Gaurav Sahil <[email protected]> Co-authored-by: Gaurav Kumar Sahil <[email protected]>
Description
This includes Cohere's embed processor.
Quick checks