Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: MongoDB Atlas keyword search #1200

Closed
wants to merge 15 commits into from

Conversation

kanenorman
Copy link
Contributor

@kanenorman kanenorman commented Nov 19, 2024

Related Issues

Proposed Changes:

Added MongoDBAtlasDocumentStore._fulltext_retrieval() and the corresponding MongoDBAtlasFullTextRetriever component, enabling users to leverage MongoDB Atlas's full-text search capabilities.

How did you test it?

  • Updated existing test to handle new init parameter full_text_search_index
  • Added test for MongoDBAtlasFullTextRetriever component to test_retriever.py

I did not implement a test for full-text retrieval. Similar to test_embedding_retrieval.py, it would likely be more secure and maintainable for a member of the Haystack team to write the test for a collection created in Haystack's MongoDB Atlas cluster, specifically in a file called test_fulltext_retrieval.py.. (I would also be willing to write test if that is preferred).

Notes for the reviewer

In MongoDBAtlasDocumentStore._fulltext_retrieval(), filters are currently passed directly to the $match stage, similar to how Langchain implements this behavior. While this approach works, it is not optimal for performance, as outlined in the MongoDB Atlas documentation. A more efficient solution would involve adding a filter clause to the compound operator. However, this requires additional work since the filter syntax used in the $match stage is not directly compatible with the $filter clause. Implementing this would require writing a new helper function, _normalize_compound_filter, to convert Haystack filter syntax into the format understood by MongoDB’s compound filter. Given that this would significantly expand the size & scope of this PR, I believe it is more appropriate to address this enhancement in a separate PR.

Checklist

@kanenorman kanenorman requested a review from a team as a code owner November 19, 2024 01:08
@kanenorman kanenorman requested review from mpangrazzi and removed request for a team November 19, 2024 01:08
@github-actions github-actions bot added integration:mongodb-atlas type:documentation Improvements or additions to documentation labels Nov 19, 2024
@kanenorman kanenorman changed the title Mongodb keyword search feat: MongoDB Atlas keyword search Nov 19, 2024
@kanenorman
Copy link
Contributor Author

kanenorman commented Nov 19, 2024

I'm encountering a peculiar configuration error in CI when running this line of code in the test:

connection: MongoClient = MongoClient(
    os.environ["MONGO_CONNECTION_STRING"], 
    driver=DriverInfo(name="MongoDBAtlasHaystackIntegration")
)

The error is as follows:

/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/synchronous/mongo_client.py:797: in __init__
    seeds.update(uri_parser.split_hosts(entity, port))
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

hosts = '', default_port = 27017

    def split_hosts(hosts: str, default_port: Optional[int] = DEFAULT_PORT) -> list[_Address]:
        """Takes a string of the form host1[:port],host2[:port]...
        Splits it into (host, port) tuples. If [:port] isn't present, the
        default_port is used.
        ...
        """
        nodes = []
        for entity in hosts.split(","):
            if not entity:
>               raise ConfigurationError("Empty host (or extra comma in host list).")
E               pymongo.errors.ConfigurationError: Empty host (or extra comma in host list).

/home/runner/.local/share/hatch/env/virtual/mongodb-atlas-haystack/BNcyCByD/mongodb-atlas-haystack/lib/python3.10/site-packages/pymongo/uri_parser.py:392: ConfigurationError

This issue occurs only on Mac/Linux CI but not on Windows. Curiously, I cannot reproduce the failure locally on my Mac. I'll continue investigating the root cause of this inconsistency in the CI environment.

EDIT:
@mpangrazzi. Realized I won't have access to {{ secrets.MONGO_CONNECTION_STRING }} in the CI Workflow from a public fork.

image

@anakin87 anakin87 self-requested a review November 21, 2024 09:50
@anakin87
Copy link
Member

Hey @kanenorman, thanks for the contribution and sorry for the delay...

I will try to have a look in the next few days!

@anakin87 anakin87 assigned anakin87 and unassigned anakin87 Nov 21, 2024
@vblagoje vblagoje requested review from vblagoje and removed request for anakin87 November 21, 2024 14:14
@vblagoje
Copy link
Member

@kanenorman my colleague @mpangrazzi and I will take over the review of this PR.

This PR already looks good. Before we dive into details, would you please:

  • add test_fulltext_retrieval.py as you suggested
  • if possible a new example for this feature in examples dir

Can we somehow use both text-search and vector indices? If so an example of both retrieval setups for our users would be great. If this is possible perhaps @mpangrazzi and I can work on this example.

Regarding CI - you are right. Let me see internally what we can do here short of migrating your branch to our main fork and two of us doing the final polishing, checks and the final integration.

@kanenorman
Copy link
Contributor Author

@vblagoje - Thank you! I'll address the two bullet points you mentioned above.

Also,

Can we somehow use both text-search and vector indices? If so, an example of both retrieval setups for our users would be great. If this is possible, perhaps @mpangrazzi and I can work on this example.

For this, are you requesting to see a hybrid retrieval scenario where we demonstrate:

  • MongoDBAtlasFullTextRetriever utilizing MongoDBAtlasDocumentStore.full_text_search_index
  • MongoDBAtlasEmbeddingRetriever utilizing MongoDBAtlasDocumentStore.vector_search_index
  • Similar to the pgvector hybrid example

@vblagoje
Copy link
Member

vblagoje commented Nov 26, 2024

Yes @kanenorman - it would be nice for the community to see this example and for all of us to easily verify we nailed this one. 🙏

@kanenorman
Copy link
Contributor Author

@vblagoje - I've added integrations/mongodb_atlas/examples/hybrid_retrieval.py to demonstrate how MongoDBAtlasFullTextRetriever and MongoDBAtlasEmbeddingRetriever can be used together to create a hybrid retrieval pipeline.

For the full-text search test in test_fulltext_retrieval.py, I completed the following setup:

  1. Created the full_text_index in MongoDB atlas UI
  2. Inserted test documents into the document store:
docs = [
    Document(
        content="The quick brown fox chased the dog",
        meta={"meta_field": "right_value"}
    ),
    Document(
        content="The fox was brown",
        meta={"meta_field": "right_value"}
     ),
    Document(content="The lazy dog"),
    Document(content="fox fox fox"),
]
document_store.write_documents(docs)
  1. Added a synonym mapping for testing:
{"mappingType":"equivalent","synonyms":["fox","reynard"]}

Please let me know if you need any additional details or assistance setting up the test environment.

@mpangrazzi
Copy link
Contributor

@kanenorman thanks for your work! 🙏 I'll review with @vblagoje soon!

@CLAassistant
Copy link

CLAassistant commented Dec 3, 2024

CLA assistant check
All committers have signed the CLA.

@mpangrazzi
Copy link
Contributor

mpangrazzi commented Dec 3, 2024

Hi @kanenorman! We've reviewed the code and opened this (on our branches we have no secrets issue and tests are running).

We updated test_fulltext_retrieval.py according to your comment, but it seems that some tests are still failing (see here).

Are they running correctly on your side? How have you created the full_text_index? Thanks!

@kanenorman
Copy link
Contributor Author

kanenorman commented Dec 3, 2024

Hi, @mpangrazzi. I apologize for not including this in my previous comment.

This is the index that I created where fox_synonyms is a collection containing the synonym mapping I mentioned in the previous comment. The test passed on my local machine when I ran with these configurations. please let me know if there are further issues.

{
  "mappings": {
    "dynamic": false,
    "fields": {
      "content": [
        {
          "type": "string"
        }
      ]
    }
  },
  "synonyms": [
    {
      "analyzer": "lucene.standard",
      "name": "synonym_mapping",
      "source": {
        "collection": "fox_synonyms"
      }
    }
  ]
}

@mpangrazzi
Copy link
Contributor

@kanenorman Thanks for your answer! We have everything working correctly here. We'll do a final review soon.

@mpangrazzi
Copy link
Contributor

@kanenorman Closing this since #1228 has been merged!

@mpangrazzi mpangrazzi closed this Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
integration:mongodb-atlas type:documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Full-text MongoDBAtlas Retriever
5 participants