From f91f1783a1a4336fb30b12763eb63d246d17cfaa Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Bilge=20Y=C3=BCcel?= Date: Tue, 22 Oct 2024 13:56:31 +0300 Subject: [PATCH] Update azure-cosmos-db to reflect postgresql support (#281) * Update azure-cosmos-db to reflect postgresql support * Update azure-cosmos-db.md * Update azure-cosmos-db.md * Add links to the retriever components for full pipeline examples --- integrations/azure-cosmos-db.md | 98 ++++++++++----------------------- 1 file changed, 29 insertions(+), 69 deletions(-) diff --git a/integrations/azure-cosmos-db.md b/integrations/azure-cosmos-db.md index 756cf81e..1cd4d0a0 100644 --- a/integrations/azure-cosmos-db.md +++ b/integrations/azure-cosmos-db.md @@ -8,8 +8,6 @@ authors: github: deepset-ai twitter: deepset_ai linkedin: https://www.linkedin.com/company/deepset-ai/ -pypi: https://pypi.org/project/mongodb-atlas-haystack/ -repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/mongodb_atlas type: Document Store report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues logo: /logos/azure-cosmos-db.png @@ -21,22 +19,30 @@ version: Haystack 2.0 - [Overview](#overview) - [Installation](#installation) -- [Usage](#usage) +- [Usage (MongoDB)](#usage-mongodb) +- [Usage (PostgreSQL)](#usage-postgresql) ## Overview -[Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/introduction) is a fully managed NoSQL, relational, and vector database for modern app development. It offers single-digit millisecond response times, automatic and instant scalability, and guaranteed speed at any scale. It is the database that ChatGPT relies on to dynamically scale with high reliability and low maintenance. +[Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/introduction) is a fully managed NoSQL, relational, and vector database for modern app development. It offers single-digit millisecond response times, automatic and instant scalability, and guaranteed speed at any scale. It is the database that ChatGPT relies on to dynamically scale with high reliability and low maintenance. Haystack supports **MongoDB** and **PostgreSQL** clusters running on Azure Cosmos DB. [Azure Cosmos DB for MongoDB](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/introduction) makes it easy to use Azure Cosmos DB as if it were a MongoDB database. You can use your existing MongoDB skills and continue to use your favorite MongoDB drivers, SDKs, and tools by pointing your application to the connection string for your account using the API for MongoDB. Learn more in the [Azure Cosmos DB for MongoDB documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/). +[Azure Cosmos DB for PostgreSQL](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/introduction) is a managed service for PostgreSQL extended with the Citus open source superpower of distributed tables. This superpower enables you to build highly scalable relational apps. You can start building apps on a single node cluster, as you would with PostgreSQL. As your app's scalability and performance requirements grow, you can seamlessly scale to multiple nodes by transparently distributing your tables. Learn more in the [Azure Cosmos DB for PostgreSQL documentation](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/). + ## Installation -It's possible to connect to your MongoDB cluster in Azure Cosmos DB through the `MongoDBAtlasDocumentStore`. For that, install the `mongo-atlas-haystack` integration. +It's possible to connect to your **MongoDB** cluster on Azure Cosmos DB through the `MongoDBAtlasDocumentStore`. For that, install the `mongo-atlas-haystack` integration. ```bash pip install mongodb-atlas-haystack ``` -## Usage +If you want to connect to the **PostgreSQL** cluster on Azure Cosmos DB, install the `pgvector-haystack` integration. +```bash +pip install pgvector-haystack +``` + +## Usage (MongoDB) To use Azure Cosmos DB for MongoDB with `MongoDBAtlasDocumentStore`, you'll need to set up an Azure Cosmos DB for MongoDB vCore cluster through the Azure portal. For a step-by-step guide, refer to [Quickstart: Azure Cosmos DB for MongoDB vCore](https://learn.microsoft.com/en-us/azure/cosmos-db/mongodb/vcore/quickstart-portal). @@ -70,75 +76,29 @@ document_store = MongoDBAtlasDocumentStore( document_store.write_documents([Document(content="this is my first doc")]) ``` +Now, you can go ahead and build your Haystack pipeline using `MongoDBAtlasEmbeddingRetriever`. Check out the [MongoDBAtlasEmbeddingRetriever docs](https://docs.haystack.deepset.ai/docs/mongodbatlasembeddingretriever) for the full pipeline example. -### Example pipelines +## Usage (PostgreSQL) -Here is some example code of an end-to-end RAG app built on Azure Cosmos DB: one indexing pipeline that embeds the documents, -and a generative pipeline that can be used for question answering. +To use Azure Cosmos DB for PostgreSQL with `PgvectorDocumentStore`, you'll need to set up a PostgreSQL cluster through the Azure portal. For a step-by-step guide, refer to [Quickstart: Azure Cosmos DB for PostgreSQL](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/quickstart-create-portal). + +After setting up your cluster, configure the `PG_CONN_STR` environment variable using the connection string for your cluster. You can find the connection string by following the instructions [here](https://learn.microsoft.com/en-us/azure/cosmos-db/postgresql/quickstart-connect-psql). The format should look like this: ```python -from haystack import Pipeline, Document -from haystack.document_stores.types import DuplicatePolicy -from haystack.components.writers import DocumentWriter -from haystack.components.generators import OpenAIGenerator -from haystack.components.builders.prompt_builder import PromptBuilder -from haystack.components.embedders import SentenceTransformersDocumentEmbedder, SentenceTransformersTextEmbedder -from haystack_integrations.document_stores.mongodb_atlas import MongoDBAtlasDocumentStore -from haystack_integrations.components.retrievers.mongodb_atlas import MongoDBAtlasEmbeddingRetriever +import os -# Create some example documents -documents = [ - Document(content="My name is Jean and I live in Paris."), - Document(content="My name is Mark and I live in Berlin."), - Document(content="My name is Giorgio and I live in Rome."), -] +os.environ['PG_CONN_STR'] = "host=c-..postgres.cosmos.azure.com port=5432 dbname=citus user=citus password={your_password} sslmode=require" +``` -document_store = MongoDBAtlasDocumentStore( - database_name="quickstartDB", # your db name - collection_name="sampleCollection", # your collection name - vector_search_index="haystack-test", # your cluster name -) +Once this is done, you can initialize the [`PgvectorDocumentStore`](https://docs.haystack.deepset.ai/docs/pgvectordocumentstore) in Haystack with the appropriate configuration. -# Define some more components -doc_writer = DocumentWriter(document_store=document_store, policy=DuplicatePolicy.SKIP) -doc_embedder = SentenceTransformersDocumentEmbedder(model="intfloat/e5-base-v2") -query_embedder = SentenceTransformersTextEmbedder(model="intfloat/e5-base-v2") - -# Pipeline that ingests document for retrieval -indexing_pipe = Pipeline() -indexing_pipe.add_component(instance=doc_embedder, name="doc_embedder") -indexing_pipe.add_component(instance=doc_writer, name="doc_writer") - -indexing_pipe.connect("doc_embedder.documents", "doc_writer.documents") -indexing_pipe.run({"doc_embedder": {"documents": documents}}) - -# Build a RAG pipeline with a Retriever to get documents relevant to -# the query, a PromptBuilder to create a custom prompt and the OpenAIGenerator (LLM) -prompt_template = """ -Given these documents, answer the question.\nDocuments: -{% for doc in documents %} - {{ doc.content }} -{% endfor %} - -\nQuestion: {{question}} -\nAnswer: -""" -rag_pipeline = Pipeline() -rag_pipeline.add_component(instance=query_embedder, name="query_embedder") -rag_pipeline.add_component(instance=MongoDBAtlasEmbeddingRetriever(document_store=document_store), name="retriever") -rag_pipeline.add_component(instance=PromptBuilder(template=prompt_template), name="prompt_builder") -rag_pipeline.add_component(instance=OpenAIGenerator(), name="llm") -rag_pipeline.connect("query_embedder", "retriever.query_embedding") -rag_pipeline.connect("embedding_retriever", "prompt_builder.documents") -rag_pipeline.connect("prompt_builder", "llm") - -# Ask a question on the data you just added. -question = "Where does Mark live?" -result = rag_pipeline.run( - { - "query_embedder": {"text": question}, - "prompt_builder": {"question": question}, - } +```python +document_store = PgvectorDocumentStore( + table_name="haystack_documents", + embedding_dimension=1024, + vector_function="cosine_similarity", + search_strategy="hnsw", + recreate_table=True, ) -print(result) ``` +Now, you can go ahead and build your Haystack pipeline using `PgvectorEmbeddingRetriever` and `PgvectorKeywordRetriever`. Check out the [PgvectorEmbeddingRetriever docs](https://docs.haystack.deepset.ai/docs/pgvectorembeddingretriever) for the full pipeline example.