Cohere Embedders are not functioning correctly #283

bilgeyucel · 2024-01-29T12:22:35Z

Describe the bug
Cohere embedders don't work as expected. Somehow the retrieved documents are unexpected compared to the OpenAI embeddings. You can find the detailed information in this discord thread

To Reproduce
See the code snippet and the example docs in this discord thread
I could reproduce both with InMemoryDocumentStore and ChromaDocumentStore
Updated version for cohere-haystack 0.3.0 👇

from haystack.document_stores.types import DuplicatePolicy
from haystack.components.embedders.openai_document_embedder import OpenAIDocumentEmbedder
from haystack.components.embedders.openai_text_embedder import OpenAITextEmbedder
from haystack.document_stores.in_memory import InMemoryDocumentStore
from haystack.components.writers import DocumentWriter
from haystack.components.retrievers.in_memory import InMemoryEmbeddingRetriever
from haystack import Pipeline
from haystack_integrations.components.embedders.cohere import CohereDocumentEmbedder, CohereTextEmbedder

import pickle

with open('docs.pkl', 'rb') as f:
    docs = pickle.load(f)
print(len(docs))

embedding = "Cohere"

if embedding == "OpenAI":
    doc_embedder = OpenAIDocumentEmbedder(model="text-embedding-3-small")
    query_embedder = OpenAITextEmbedder(model="text-embedding-3-small")
elif embedding == "Cohere":
    doc_embedder = CohereDocumentEmbedder(model="embed-multilingual-v3.0", input_type="search_document")
    query_embedder = CohereTextEmbedder(model="embed-multilingual-v3.0", input_type="search_query")

# Ingestion
doc_store = InMemoryDocumentStore(embedding_similarity_function="cosine")  # tried with the default `dot_product` and the scores were the same 
doc_writer = DocumentWriter(document_store = doc_store, policy = DuplicatePolicy.SKIP)

ingestion_pipe = Pipeline()
ingestion_pipe.add_component("doc_embedder", doc_embedder)
ingestion_pipe.add_component("doc_writer", doc_writer)

ingestion_pipe.connect("doc_embedder.documents", "doc_writer.documents")
ingestion_pipe.run({"doc_embedder": {"documents": docs}})

# Retrieval
retriever = InMemoryEmbeddingRetriever(document_store=doc_store, top_k=5)

retrieval_pipe = Pipeline()
retrieval_pipe.add_component("query_embedder", query_embedder)
retrieval_pipe.add_component("retriever", retriever)
retrieval_pipe.connect("query_embedder.embedding", "retriever.query_embedding")

# Run
query = "What do participants say about their families?"
retrieved_chunks = retrieval_pipe.run({"query_embedder": {"text": query}})

for chunk in retrieved_chunks["retriever"]["documents"]:
    print(chunk.meta, round(chunk.score, 3), chunk.content)

Describe your environment (please complete the following information):

OS: [e.g. iOS] MacOS sonoma 14.2.1
Haystack version: beta-5
Integration version: cohere-haystack 0.3.0

The text was updated successfully, but these errors were encountered:

sjrl · 2024-01-29T14:11:52Z

Hey @bilgeyucel running the code you provide on the PR #284 I get the following result

{'filename': 'Interview Mateo', 'doc_id': 1, 'paragraph_id': 53, 'sentence_id': 137} 0.681 Interviewer: And what about your family?
{'filename': 'Interview Riley', 'doc_id': 2, 'paragraph_id': 95, 'sentence_id': 237} 0.681 Interviewer: And what about your family?
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 191, 'sentence_id': 500} 0.595 Interviewer: How satisfied are you with the time you spend with your family?
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 22, 'sentence_id': 53} 0.576 Interviewer: And this applies to friends as well as to family?
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 20, 'sentence_id': 48} 0.545 Interviewer: And how satisfied are you with the amount of time you spend with your family or friends?

which I think looks much better in comparison to what was first provided in discord. The previous result was

{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 174, 'sentence_id': 461} 0.835 Do you have to care for another family member, or do you have some other jobs or volunteering you need to tend to?
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 188, 'sentence_id': 493} 0.835 And then I kind of like also learned that I am in a different phase of life now, with more responsibilities and with a kid, my own things on the side and my job.
{'filename': 'Interview Riley', 'doc_id': 2, 'paragraph_id': 103, 'sentence_id': 265} 0.835 What would a typical working day look like for you?
{'filename': 'Interview Selim', 'doc_id': 3, 'paragraph_id': 116, 'sentence_id': 297} 0.835 That always takes a little time.
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 34, 'sentence_id': 85} 0.696 Let me put it this way.

Could you try running the code again with my PR to see if it also works for you?

ZanSara · 2024-02-06T10:16:01Z

Fixed by #284

bilgeyucel added bug Something isn't working integration:cohere labels Jan 29, 2024

sjrl mentioned this issue Jan 29, 2024

fix: Cohere inconsistent embeddings and documents lengths #284

Merged

masci assigned ZanSara Feb 5, 2024

ZanSara closed this as completed Feb 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cohere Embedders are not functioning correctly #283

Cohere Embedders are not functioning correctly #283

bilgeyucel commented Jan 29, 2024

sjrl commented Jan 29, 2024 •

edited

Loading

ZanSara commented Feb 6, 2024

Cohere Embedders are not functioning correctly #283

Cohere Embedders are not functioning correctly #283

Comments

bilgeyucel commented Jan 29, 2024

sjrl commented Jan 29, 2024 • edited Loading

ZanSara commented Feb 6, 2024

sjrl commented Jan 29, 2024 •

edited

Loading