Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Builtin Cohere Embed processor #2167

Merged
merged 7 commits into from
Mar 6, 2025

Conversation

grvsahil
Copy link
Contributor

Description

This includes Cohere's embed processor.

Quick checks

  • I have followed the Code Guidelines.
  • There is no other pull request for the same update/change.
  • I have written unit tests.
  • I have made sure that the PR is of reasonable size and can be easily reviewed.

@grvsahil grvsahil linked an issue Feb 25, 2025 that may be closed by this pull request
Copy link
Member

@raulb raulb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @ grvsahil ! Main thing I'd like to see changed is not needing a README and relying on the automatic specification (running make generate). Other comment are small details. Let me know if they made sense.

@@ -20,6 +20,19 @@ Provides Cohere processors for command, embed and rerank models.
| `backoffRetry.min` | The minimum waiting time before retrying. | false | `100ms` |
| `backoffRetry.max` | The maximum waiting time before retrying. | false | `5s` |

### Embed Processor Configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave a similar comment in https://github.com/ConduitIO/conduit/pull/2165/files, but I believe a README is neither needed nor encouraged. The reason is that we rely on specgen to generate the parameters automatically to then generate the .json spec which then will be exposed to the conduit site.

Copy link
Contributor Author

@grvsahil grvsahil Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -57,6 +57,7 @@ var DefaultBuiltinProcessors = map[string]ProcessorPluginConstructor{
"unwrap.opencdc": unwrap.NewOpenCDCProcessor,
"webhook.http": webhook.NewHTTPProcessor,
"cohere.command": cohere.NewCommandProcessor,
"cohere.embed": cohere.NewEmbedProcessor,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to https://github.com/ConduitIO/conduit/pull/2165/files, it'd be nice if it's added alphabetically to respect that existing order, thank you!

Copy link
Contributor Author

@grvsahil grvsahil Feb 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@raulb raulb mentioned this pull request Feb 26, 2025
4 tasks
@grvsahil grvsahil mentioned this pull request Feb 27, 2025
4 tasks
Copy link
Member

@raulb raulb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just left some questions. Happy to review everything on a single branch ([feat/cohere-processor](https://github.com/ConduitIO/conduit/tree/feat/cohere-processor))

APIKey string `json:"apiKey" validate:"required"`
// Specifies the type of input passed to the model. Required for embed models v3 and higher.
// Allowed values: search_document, search_query, classification, clustering, image.
InputType string `json:"inputType"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per the commented code above, this is required for embed models v3 and higher. Maybe we should add a validation to make error if this value is not provided in those situations?

Or does this not refer to Model, but rather if we used a different client (different to to V2EmbedRequest?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add a validation for InputType if model is V3 or higher.
It indeed refers to Model.

config *embedProcConfig
}

func (e *embedClient) Embed(ctx context.Context, texts []string) ([][]float64, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look in the other PR, but does this need to be exported?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I'll make it unexported.

return nil, err
}

return resp.GetEmbeddings().GetFloat(), nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this could return nil, but we're not checking for this possibility on processBatch. Should we return an error instead?

Copy link
Contributor Author

@grvsahil grvsahil Mar 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

embeddings can only be nil if p.client.embed() fails, but in that case, the function enters the if err != nil block, retries if possible, or returns with an error if unrecoverable. Since we only access embeddings after a successful response, I don't think it is necessary.

@grvsahil grvsahil merged commit 6ec08bb into feat/cohere-processor Mar 6, 2025
3 checks passed
@grvsahil grvsahil deleted the feat/cohere-embed-processor branch March 6, 2025 15:58
parikshitg added a commit that referenced this pull request Mar 7, 2025
* feat: cohere processor

* fix: missing apikey

* fix: linters

* fix: readme lints

* fix: small fix

* fix: refactored separate processors

* fix: refactor local scope

* feat: handling chat response

* fix: configurable request

* fix: readme config updated

* fix: small change

* feat: refactored and examples

* fix: linters

* fix: examples test for command model and generate json spec

* remove mock processor impl

* Builtin Cohere Embed processor (#2167)

* feat: embed processor

* update mod file

* resolved PR comments, added batching support & test cases

* fix: header in embed_examples_test.go

* fix: update examples test and generate json spec

* resolved PR comments

---------

Co-authored-by: Gaurav Sahil <[email protected]>

---------

Co-authored-by: Gaurav Sahil <[email protected]>
Co-authored-by: Gaurav Kumar Sahil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cohere Embed Processor
2 participants