Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cohere Embedders are not functioning correctly #283

Closed
bilgeyucel opened this issue Jan 29, 2024 · 2 comments
Closed

Cohere Embedders are not functioning correctly #283

bilgeyucel opened this issue Jan 29, 2024 · 2 comments
Assignees
Labels
bug Something isn't working integration:cohere

Comments

@bilgeyucel
Copy link
Contributor

Describe the bug
Cohere embedders don't work as expected. Somehow the retrieved documents are unexpected compared to the OpenAI embeddings. You can find the detailed information in this discord thread

To Reproduce
See the code snippet and the example docs in this discord thread
I could reproduce both with InMemoryDocumentStore and ChromaDocumentStore
Updated version for cohere-haystack 0.3.0 👇

from haystack.document_stores.types import DuplicatePolicy
from haystack.components.embedders.openai_document_embedder import OpenAIDocumentEmbedder
from haystack.components.embedders.openai_text_embedder import OpenAITextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack import Pipeline
from haystack_integrations.components.embedders.cohere import CohereDocumentEmbedder, CohereTextEmbedder

import pickle

with open('docs.pkl', 'rb') as f:
    docs = pickle.load(f)
print(len(docs))

embedding = "Cohere"

if embedding == "OpenAI":
    doc_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
    query_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
elif embedding == "Cohere":
    doc_embedder = CohereDocumentEmbedder(model="embed-multilingual-v3.0", input_type="search_document")
    query_embedder = CohereTextEmbedder(model="embed-multilingual-v3.0", input_type="search_query")

# Ingestion
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")  # tried with the default `dot_product` and the scores were the same 
doc_writer = DocumentWriter(document_store = doc_store, policy = DuplicatePolicy.SKIP)

ingestion_pipe = Pipeline()
ingestion_pipe.add_component("doc_embedder", doc_embedder)
ingestion_pipe.add_component("doc_writer", doc_writer)

ingestion_pipe.connect("doc_embedder.documents", "doc_writer.documents")
ingestion_pipe.run({"doc_embedder": {"documents": docs}})

# Retrieval
retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=5)

retrieval_pipe = Pipeline()
retrieval_pipe.add_component("query_embedder", query_embedder)
retrieval_pipe.add_component("retriever", retriever)
retrieval_pipe.connect("query_embedder.embedding", "retriever.query_embedding")

# Run
query = "What do participants say about their families?"
retrieved_chunks = retrieval_pipe.run({"query_embedder": {"text": query}})

for chunk in retrieved_chunks["retriever"]["documents"]:
    print(chunk.meta, round(chunk.score, 3), chunk.content)

Describe your environment (please complete the following information):

  • OS: [e.g. iOS] MacOS sonoma 14.2.1
  • Haystack version: beta-5
  • Integration version: cohere-haystack 0.3.0
@sjrl
Copy link
Contributor

sjrl commented Jan 29, 2024

Hey @bilgeyucel running the code you provide on the PR #284 I get the following result

{'filename': 'Interview Mateo', 'doc_id': 1, 'paragraph_id': 53, 'sentence_id': 137} 0.681 Interviewer: And what about your family?
{'filename': 'Interview Riley', 'doc_id': 2, 'paragraph_id': 95, 'sentence_id': 237} 0.681 Interviewer: And what about your family?
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 191, 'sentence_id': 500} 0.595 Interviewer: How satisfied are you with the time you spend with your family?
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 22, 'sentence_id': 53} 0.576 Interviewer: And this applies to friends as well as to family?
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 20, 'sentence_id': 48} 0.545 Interviewer: And how satisfied are you with the amount of time you spend with your family or friends?

which I think looks much better in comparison to what was first provided in discord. The previous result was

{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 174, 'sentence_id': 461} 0.835 Do you have to care for another family member, or do you have some other jobs or volunteering you need to tend to?
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 188, 'sentence_id': 493} 0.835 And then I kind of like also learned that I am in a different phase of life now, with more responsibilities and with a kid, my own things on the side and my job.
{'filename': 'Interview Riley', 'doc_id': 2, 'paragraph_id': 103, 'sentence_id': 265} 0.835 What would a typical working day look like for you?
{'filename': 'Interview Selim', 'doc_id': 3, 'paragraph_id': 116, 'sentence_id': 297} 0.835 That always takes a little time.
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 34, 'sentence_id': 85} 0.696 Let me put it this way.

Could you try running the code again with my PR to see if it also works for you?

@ZanSara
Copy link
Contributor

ZanSara commented Feb 6, 2024

Fixed by #284

@ZanSara ZanSara closed this as completed Feb 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working integration:cohere
Projects
None yet
Development

No branches or pull requests

3 participants