Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mongo Dense Retriever - Unrecognized pipeline stage name: '$vectorSearch' #583

Closed
1 task done
tillwf opened this issue Mar 15, 2024 · 1 comment
Closed
1 task done
Assignees
Labels

Comments

@tillwf
Copy link

tillwf commented Mar 15, 2024

Describe the bug
Like this bug: deepset-ai/haystack#7031 but with haystack-ai==2.0.0

Error message

haystack.document_stores.errors.errors.DocumentStoreError: Retrieval of documents from MongoDB Atlas failed: Unrecognized pipeline stage name: '$vectorSearch', full error: {'ok': 0.0, 'errms
g': "Unrecognized pipeline stage name: '$vectorSearch'", 'code': 40324, 'codeName': 'Location40324', '$clusterTime': {'clusterTime': Timestamp(1710497081, 11), 'signature': {'hash': b'\xd7"b
C\xf4\xd0\xb8\r\xc8\xe56b/xn\xda\'\x98B\x7f', 'keyId': 7294832502911270929}}, 'operationTime': Timestamp(1710497081, 11)}

To Reproduce
Here is a simple code to reproduce:

import os

from haystack import Pipeline
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersTextEmbedder
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
from haystack.components.rankers import TransformersSimilarityRanker
from haystack.utils import ComponentDevice
from haystack.utils import Secret
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore

document_store = MongoDBAtlasDocumentStore(
    mongo_connection_string=Secret.from_env_var("MONGO_CONNECTION_STRING"),
    database_name=os.getenv("MONGO_DB"),
    collection_name="articles_embeddings",
    vector_search_index="embedding_index",
)
document_cleaner = DocumentCleaner(
    remove_empty_lines=True,
    remove_extra_whitespaces=True,
    remove_repeated_substrings=False,
)
document_splitter = DocumentSplitter(
    split_by="word",
    split_length=512,
    split_overlap=32
)
document_embedder = SentenceTransformersDocumentEmbedder(
    model="BAAI/bge-small-en-v1.5",
    device=ComponentDevice.from_str("cuda:0")
)
text_embedder = SentenceTransformersTextEmbedder(
    model="BAAI/bge-small-en-v1.5",
    device=ComponentDevice.from_str("cuda:0")
)

embedding_retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store)
ranker = TransformersSimilarityRanker(model="BAAI/bge-reranker-base")

pipeline = Pipeline()
pipeline.add_component("text_embedder", text_embedder)
pipeline.add_component("embedding_retriever", embedding_retriever)
pipeline.add_component("ranker", ranker)

pipeline.connect("text_embedder", "embedding_retriever")
pipeline.connect("embedding_retriever", "ranker")

# First search to warm the model
pipeline.run(
    {
        "text_embedder": {
            "text": "test"
        },
        "ranker": {
            "query": "test"
        }
    }
)

Here is a screen of my index I made:
image

and the code I used to create it:

{
  "fields":[
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 384,
      "similarity": "cosine"
    }
  ]
}

FAQ Check

System:

  • OS: Ubuntu, Python 3.10.7
  • GPU/CPU: GPU
  • Haystack version (commit or version number): v2.0.0
  • DocumentStore: MongoDBAtlasDocumentStore
  • Reader:
  • Retriever: MongoDBAtlasEmbeddingRetriever
@anakin87 anakin87 transferred this issue from deepset-ai/haystack Mar 15, 2024
@anakin87 anakin87 added bug Something isn't working integration:mongodb-atlas labels Mar 15, 2024
@masci masci added the P2 label Mar 22, 2024
@anakin87
Copy link
Member

anakin87 commented Apr 3, 2024

I cannot reproduce the bug.

What I did:

  • create a new database (test) and collection (test)
  • create a new index (vector_index) with the same configuration as yours:
{
  "fields":[
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 384,
      "similarity": "cosine"
    }
  ]
}
  • run the following script
from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever
from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore
import os
from haystack.dataclasses import Document
from haystack.components.embedders import SentenceTransformersDocumentEmbedder
from haystack.components.embedders import SentenceTransformersDocumentEmbedder

os.environ["MONGO_CONNECTION_STRING"]="..."

document_store = MongoDBAtlasDocumentStore(
    database_name="test",
    collection_name="test",
    vector_search_index="vector_index",
)

# indexing phase
docs = [Document(content="This is a test", meta={"name": "test"}), Document(content="this is a document about dogs", meta={"name": "dog_doc"}),
        Document(content="this is a document about cats", meta={"name": "cat_doc"})]

embedder = SentenceTransformersDocumentEmbedder(model="BAAI/bge-small-en-v1.5")
embedder.warm_up()

docs_with_embeddings = embedder.run(docs)["documents"]

print(document_store.write_documents(docs_with_embeddings))
# 3

# retrieval phase
retriever = MongoDBAtlasEmbeddingRetriever(document_store=document_store, top_k=3)
results = retriever.run(query_embedding=[0.1]*384)
print(results)
# {'documents': [Document(id=0fc6abdbe5192ea10917b506084077451b47ccf097d5899f963a193b048a33a7, content: 'this is a document about cats', meta: {'name': 'cat_doc'}, score: 0.5037540197372437, embedding: vector of size 384), Document(id=ffd30337557ed1870cb5833d832c1a3c41f4889b3545e9c0b5e69108592661fd, content: 'This is a test', meta: {'name': 'test'}, score: 0.503305971622467, embedding: vector of size 384), Document(id=274731104067ab6f2e07380d4b1cd20112b26cd99fc6f36da8f9f4a7d6f06e00, content: 'this is a document about dogs', meta: {'name': 'dog_doc'}, score: 0.5031192898750305, embedding: vector of size 384)]}

I also tried a more complex example, with a Retrieval Pipeline with a Text Embedder and a Ranker, but I cannot reproduce the error.

@tillwf I'm closing the issue. Feel free to reopen it and add more details if the problem persists.

@anakin87 anakin87 closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants