You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Cohere embedders don't work as expected. Somehow the retrieved documents are unexpected compared to the OpenAI embeddings. You can find the detailed information in this discord thread
To Reproduce
See the code snippet and the example docs in this discord thread
I could reproduce both with InMemoryDocumentStore and ChromaDocumentStore Updated version for cohere-haystack 0.3.0 👇
fromhaystack.document_stores.typesimportDuplicatePolicyfromhaystack.components.embedders.openai_document_embedderimportOpenAIDocumentEmbedderfromhaystack.components.embedders.openai_text_embedderimportOpenAITextEmbedderfromhaystack.document_stores.in_memoryimportInMemoryDocumentStorefromhaystack.components.writersimportDocumentWriterfromhaystack.components.retrievers.in_memoryimportInMemoryEmbeddingRetrieverfromhaystackimportPipelinefromhaystack_integrations.components.embedders.cohereimportCohereDocumentEmbedder, CohereTextEmbedderimportpicklewithopen('docs.pkl', 'rb') asf:
docs=pickle.load(f)
print(len(docs))
embedding="Cohere"ifembedding=="OpenAI":
doc_embedder=OpenAIDocumentEmbedder(model="text-embedding-3-small")
query_embedder=OpenAITextEmbedder(model="text-embedding-3-small")
elifembedding=="Cohere":
doc_embedder=CohereDocumentEmbedder(model="embed-multilingual-v3.0", input_type="search_document")
query_embedder=CohereTextEmbedder(model="embed-multilingual-v3.0", input_type="search_query")
# Ingestiondoc_store=InMemoryDocumentStore(embedding_similarity_function="cosine") # tried with the default `dot_product` and the scores were the same doc_writer=DocumentWriter(document_store=doc_store, policy=DuplicatePolicy.SKIP)
ingestion_pipe=Pipeline()
ingestion_pipe.add_component("doc_embedder", doc_embedder)
ingestion_pipe.add_component("doc_writer", doc_writer)
ingestion_pipe.connect("doc_embedder.documents", "doc_writer.documents")
ingestion_pipe.run({"doc_embedder": {"documents": docs}})
# Retrievalretriever=InMemoryEmbeddingRetriever(document_store=doc_store, top_k=5)
retrieval_pipe=Pipeline()
retrieval_pipe.add_component("query_embedder", query_embedder)
retrieval_pipe.add_component("retriever", retriever)
retrieval_pipe.connect("query_embedder.embedding", "retriever.query_embedding")
# Runquery="What do participants say about their families?"retrieved_chunks=retrieval_pipe.run({"query_embedder": {"text": query}})
forchunkinretrieved_chunks["retriever"]["documents"]:
print(chunk.meta, round(chunk.score, 3), chunk.content)
Describe your environment (please complete the following information):
OS: [e.g. iOS] MacOS sonoma 14.2.1
Haystack version: beta-5
Integration version: cohere-haystack 0.3.0
The text was updated successfully, but these errors were encountered:
Hey @bilgeyucel running the code you provide on the PR #284 I get the following result
{'filename': 'Interview Mateo', 'doc_id': 1, 'paragraph_id': 53, 'sentence_id': 137} 0.681 Interviewer: And what about your family?
{'filename': 'Interview Riley', 'doc_id': 2, 'paragraph_id': 95, 'sentence_id': 237} 0.681 Interviewer: And what about your family?
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 191, 'sentence_id': 500} 0.595 Interviewer: How satisfied are you with the time you spend with your family?
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 22, 'sentence_id': 53} 0.576 Interviewer: And this applies to friends as well as to family?
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 20, 'sentence_id': 48} 0.545 Interviewer: And how satisfied are you with the amount of time you spend with your family or friends?
which I think looks much better in comparison to what was first provided in discord. The previous result was
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 174, 'sentence_id': 461} 0.835 Do you have to care for another family member, or do you have some other jobs or volunteering you need to tend to?
{'filename': 'Interview Thanh', 'doc_id': 4, 'paragraph_id': 188, 'sentence_id': 493} 0.835 And then I kind of like also learned that I am in a different phase of life now, with more responsibilities and with a kid, my own things on the side and my job.
{'filename': 'Interview Riley', 'doc_id': 2, 'paragraph_id': 103, 'sentence_id': 265} 0.835 What would a typical working day look like for you?
{'filename': 'Interview Selim', 'doc_id': 3, 'paragraph_id': 116, 'sentence_id': 297} 0.835 That always takes a little time.
{'filename': 'Interview Chris', 'doc_id': 0, 'paragraph_id': 34, 'sentence_id': 85} 0.696 Let me put it this way.
Could you try running the code again with my PR to see if it also works for you?
Describe the bug
Cohere embedders don't work as expected. Somehow the retrieved documents are unexpected compared to the OpenAI embeddings. You can find the detailed information in this discord thread
To Reproduce
See the code snippet and the example docs in this discord thread
I could reproduce both with
InMemoryDocumentStore
andChromaDocumentStore
Updated version for cohere-haystack 0.3.0 👇
Describe your environment (please complete the following information):
The text was updated successfully, but these errors were encountered: