diff --git a/docs/docs/integrations/providers/upstash.mdx b/docs/docs/integrations/providers/upstash.mdx index ff39b87649708..3ed8508360611 100644 --- a/docs/docs/integrations/providers/upstash.mdx +++ b/docs/docs/integrations/providers/upstash.mdx @@ -1,6 +1,130 @@ -# Upstash Redis +Upstash offers developers serverless databases and messaging +platforms to build powerful applications without having to worry +about the operational complexity of running databases at scale. + +One significant advantage of Upstash is that their databases support HTTP and all of their SDKs use HTTP. +This means that you can run this in serverless platforms, edge or any platform that does not support TCP connections. + +Currently, there are two Upstash integrations available for LangChain: +Upstash Vector as a vector embedding database and Upstash Redis as a cache and memory store. + +# Upstash Vector + +Upstash Vector is a serverless vector database that can be used to store and query vectors. + +## Installation + +Create a new serverless vector database at the [Upstash Console](https://console.upstash.com/vector). +Select your preferred distance metric and dimension count according to your model. + + +Install the Upstash Vector Python SDK with `pip install upstash-vector`. +The Upstash Vector integration in langchain is a wrapper for the Upstash Vector Python SDK. That's why the `upstash-vector` package is required. + +## Integrations + +Create a `UpstashVectorStore` object using credentials from the Upstash Console. +You also need to pass in an `Embeddings` object which can turn text into vector embeddings. + +```python +from langchain_community.vectorstores.upstash import UpstashVectorStore +import os + +os.environ["UPSTASH_VECTOR_REST_URL"] = "" +os.environ["UPSTASH_VECTOR_TOKEN"] = "" + +store = UpstashVectorStore( + embedding=embeddings +) +``` + +### Inserting Vectors + +```python +from langchain.text_splitter import CharacterTextSplitter +from langchain_community.document_loaders import TextLoader +from langchain_openai import OpenAIEmbeddings + +loader = TextLoader("../../modules/state_of_the_union.txt") +documents = loader.load() +text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0) +docs = text_splitter.split_documents(documents) + +# Create a new embeddings object +embeddings = OpenAIEmbeddings() + +# Create a new UpstashVectorStore object +store = UpstashVectorStore( + embedding=embeddings +) + +# Insert the document embeddings into the store +store.add_documents(docs) +``` + +When inserting documents, first they are embedded using the `Embeddings` object. + +Most embedding models can embed multiple documents at once, so the documents are batched and embedded in parallel. +The size of the batch can be controlled using the `embedding_chunk_size` parameter. + +The embedded vectors are then stored in the Upstash Vector database. When they are sent, multiple vectors are batched together to reduce the number of HTTP requests. +The size of the batch can be controlled using the `batch_size` parameter. Upstash Vector has a limit of 1000 vectors per batch in the free tier. -Upstash offers developers serverless databases and messaging platforms to build powerful applications without having to worry about the operational complexity of running databases at scale. +```python +store.add_documents( + documents, + batch_size=100, + embedding_chunk_size=200 +) +``` + +### Querying Vectors + +Vectors can be queried using a text query or another vector. + +The returned value is a list of Document objects. + +```python +result = store.similarity_search( + "The United States of America", + k=5 +) +``` + +Or using a vector: + +```python +vector = embeddings.embed_query("Hello world") + +result = store.similarity_search_by_vector( + vector, + k=5 +) +``` + +### Deleting Vectors + +Vectors can be deleted by their IDs. + +```python +store.delete(["id1", "id2"]) +``` + +### Getting information about the store + +You can get information about your database like the distance metric dimension using the info function. + +When an insert happens, the database an indexing takes place. While this is happening new vectors can not be queried. `pendingVectorCount` represents the number of vector that are currently being indexed. + +```python +info = store.info() +print(info) + +# Output: +# {'vectorCount': 44, 'pendingVectorCount': 0, 'indexSize': 2642412, 'dimension': 1536, 'similarityFunction': 'COSINE'} +``` + +# Upstash Redis This page covers how to use [Upstash Redis](https://upstash.com/redis) with LangChain. @@ -12,7 +136,6 @@ This page covers how to use [Upstash Redis](https://upstash.com/redis) with Lang ## Integrations All of Upstash-LangChain integrations are based on `upstash-redis` Python SDK being utilized as wrappers for LangChain. This SDK utilizes Upstash Redis DB by giving UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN parameters from the console. -One significant advantage of this is that, this SDK uses a REST API. This means, you can run this in serverless platforms, edge or any platform that does not support TCP connections. ### Cache diff --git a/docs/docs/integrations/vectorstores/upstash.ipynb b/docs/docs/integrations/vectorstores/upstash.ipynb new file mode 100644 index 0000000000000..b6821c2dd4ea2 --- /dev/null +++ b/docs/docs/integrations/vectorstores/upstash.ipynb @@ -0,0 +1,343 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Upstash Vector\n", + "\n", + "> [Upstash Vector](https://upstash.com/docs/vector/overall/whatisvector) is a serverless vector database designed for working with vector embeddings.\n", + ">\n", + "> The vector langchain integration is a wrapper around the [upstash-vector](https://github.com/upstash/vector-py) package.\n", + ">\n", + "> The python package uses the [vector rest api](https://upstash.com/docs/vector/api/get-started) behind the scenes." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installation\n", + "\n", + "Create a free vector database from [upstash console](https://console.upstash.com/vector) with the desired dimensions and distance metric.\n", + "\n", + "You can then create an `UpstashVectorStore` instance by:\n", + "\n", + "- Providing the environment variables `UPSTASH_VECTOR_URL` and `UPSTASH_VECTOR_TOKEN`\n", + "\n", + "- Giving them as parameters to the constructor\n", + "\n", + "- Passing an Upstash Vector `Index` instance to the constructor\n", + "\n", + "Also, an `Embeddings` instance is required to turn given texts into embeddings. Here we use `OpenAIEmbeddings` as an example" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "%pip install langchain-openai langchain upstash-vector" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": {}, + "outputs": [], + "source": [ + "import os\n", + "\n", + "from langchain_community.vectorstores.upstash import UpstashVectorStore\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "os.environ[\"OPENAI_API_KEY\"] = \"\"\n", + "os.environ[\"UPSTASH_VECTOR_URL\"] = \"\"\n", + "os.environ[\"UPSTASH_VECTOR_TOKEN\"] = \"\"\n", + "\n", + "# Create an embeddings instance\n", + "embeddings = OpenAIEmbeddings()\n", + "\n", + "# Create a vector store instance\n", + "store = UpstashVectorStore(embedding=embeddings)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Load documents\n", + "\n", + "Load an example text file and split it into chunks which can be turned into vector embeddings." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'}),\n", + " Document(page_content='Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \\n\\nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \\n\\nLet each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \\n\\nPlease rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \\n\\nThroughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. \\n\\nThey keep moving. \\n\\nAnd the costs and the threats to America and the world keep rising. \\n\\nThat’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. \\n\\nThe United States is a member along with 29 other nations. \\n\\nIt matters. American diplomacy matters. American resolve matters.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'}),\n", + " Document(page_content='Putin’s latest attack on Ukraine was premeditated and unprovoked. \\n\\nHe rejected repeated efforts at diplomacy. \\n\\nHe thought the West and NATO wouldn’t respond. And he thought he could divide us at home. Putin was wrong. We were ready. Here is what we did. \\n\\nWe prepared extensively and carefully. \\n\\nWe spent months building a coalition of other freedom-loving nations from Europe and the Americas to Asia and Africa to confront Putin. \\n\\nI spent countless hours unifying our European allies. We shared with the world in advance what we knew Putin was planning and precisely how he would try to falsely justify his aggression. \\n\\nWe countered Russia’s lies with truth. \\n\\nAnd now that he has acted the free world is holding him accountable. \\n\\nAlong with twenty-seven members of the European Union including France, Germany, Italy, as well as countries like the United Kingdom, Canada, Japan, Korea, Australia, New Zealand, and many others, even Switzerland.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'})]" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "from langchain.text_splitter import CharacterTextSplitter\n", + "from langchain_community.document_loaders import TextLoader\n", + "from langchain_openai import OpenAIEmbeddings\n", + "\n", + "loader = TextLoader(\"../../modules/state_of_the_union.txt\")\n", + "documents = loader.load()\n", + "text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)\n", + "docs = text_splitter.split_documents(documents)\n", + "\n", + "docs[:3]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Inserting documents\n", + "\n", + "The vectorstore embeds text chunks using the embedding object and batch inserts them into the database. This returns an id array of the inserted vectors." + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['95362512-0801-4b33-8e32-91ed563b25e5',\n", + " '7ee0cb06-0987-4d31-9089-c5b6c42fea08',\n", + " '40abd35c-e687-476c-a426-fcbb1fd679d8',\n", + " '4450d872-56b0-49a2-aa91-a5a718815bec',\n", + " '00a6df29-621b-4a48-a7d7-9a81ca4030de']" + ] + }, + "execution_count": 25, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "inserted_vectors = store.add_documents(docs)\n", + "\n", + "inserted_vectors[:5]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "store" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "['6010f8ef-ee75-4c37-8db8-052431a8bd01',\n", + " 'a388f11a-7cc3-4878-a8ac-77ba7c47fb82']" + ] + }, + "execution_count": 26, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "store.add_texts(\n", + " [\"This is a test\", \"This is another test\"],\n", + " [\n", + " {\"title\": \"Test 1\", \"author\": \"John Doe\", \"date\": \"2021-01-01\"},\n", + " {\"title\": \"Test 2\", \"author\": \"Jane Doe\", \"date\": \"2021-01-02\"},\n", + " ],\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Querying\n", + "\n", + "The database can be queried using a vector or a text prompt.\n", + "If a text prompt is used, it's first converted into embedding and then queried.\n", + "\n", + "The `k` parameter specifies how many results to return from the query." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "[Document(page_content='And my report is this: the State of the Union is strong—because you, the American people, are strong. \\n\\nWe are stronger today than we were a year ago. \\n\\nAnd we will be stronger a year from now than we are today. \\n\\nNow is our moment to meet and overcome the challenges of our time. \\n\\nAnd we will, as one people. \\n\\nOne America. \\n\\nThe United States of America. \\n\\nMay God bless you all. May God protect our troops.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'}),\n", + " Document(page_content='And built the strongest, freest, and most prosperous nation the world has ever known. \\n\\nNow is the hour. \\n\\nOur moment of responsibility. \\n\\nOur test of resolve and conscience, of history itself. \\n\\nIt is in this moment that our character is formed. Our purpose is found. Our future is forged. \\n\\nWell I know this nation. \\n\\nWe will meet the test. \\n\\nTo protect freedom and liberty, to expand fairness and opportunity. \\n\\nWe will save democracy. \\n\\nAs hard as these times have been, I am more optimistic about America today than I have been my whole life. \\n\\nBecause I see the future that is within our grasp. \\n\\nBecause I know there is simply nothing beyond our capacity. \\n\\nWe are the only nation on Earth that has always turned every crisis we have faced into an opportunity. \\n\\nThe only nation that can be defined by a single word: possibilities. \\n\\nSo on this night, in our 245th year as a nation, I have come to report on the State of the Union.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'}),\n", + " Document(page_content='Groups of citizens blocking tanks with their bodies. Everyone from students to retirees teachers turned soldiers defending their homeland. \\n\\nIn this struggle as President Zelenskyy said in his speech to the European Parliament “Light will win over darkness.” The Ukrainian Ambassador to the United States is here tonight. \\n\\nLet each of us here tonight in this Chamber send an unmistakable signal to Ukraine and to the world. \\n\\nPlease rise if you are able and show that, Yes, we the United States of America stand with the Ukrainian people. \\n\\nThroughout our history we’ve learned this lesson when dictators do not pay a price for their aggression they cause more chaos. \\n\\nThey keep moving. \\n\\nAnd the costs and the threats to America and the world keep rising. \\n\\nThat’s why the NATO Alliance was created to secure peace and stability in Europe after World War 2. \\n\\nThe United States is a member along with 29 other nations. \\n\\nIt matters. American diplomacy matters. American resolve matters.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'}),\n", + " Document(page_content='Madam Speaker, Madam Vice President, our First Lady and Second Gentleman. Members of Congress and the Cabinet. Justices of the Supreme Court. My fellow Americans. \\n\\nLast year COVID-19 kept us apart. This year we are finally together again. \\n\\nTonight, we meet as Democrats Republicans and Independents. But most importantly as Americans. \\n\\nWith a duty to one another to the American people to the Constitution. \\n\\nAnd with an unwavering resolve that freedom will always triumph over tyranny. \\n\\nSix days ago, Russia’s Vladimir Putin sought to shake the foundations of the free world thinking he could make it bend to his menacing ways. But he badly miscalculated. \\n\\nHe thought he could roll into Ukraine and the world would roll over. Instead he met a wall of strength he never imagined. \\n\\nHe met the Ukrainian people. \\n\\nFrom President Zelenskyy to every Ukrainian, their fearlessness, their courage, their determination, inspires the world.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'}),\n", + " Document(page_content='For that purpose we’ve mobilized American ground forces, air squadrons, and ship deployments to protect NATO countries including Poland, Romania, Latvia, Lithuania, and Estonia. \\n\\nAs I have made crystal clear the United States and our Allies will defend every inch of territory of NATO countries with the full force of our collective power. \\n\\nAnd we remain clear-eyed. The Ukrainians are fighting back with pure courage. But the next few days weeks, months, will be hard on them. \\n\\nPutin has unleashed violence and chaos. But while he may make gains on the battlefield – he will pay a continuing high price over the long run. \\n\\nAnd a proud Ukrainian people, who have known 30 years of independence, have repeatedly shown that they will not tolerate anyone who tries to take their country backwards. \\n\\nTo all Americans, I will be honest with you, as I’ve always promised. A Russian dictator, invading a foreign country, has costs around the world.', metadata={'source': 'docs/docs/modules/state_of_the_union.txt'})]" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "result = store.similarity_search(\"The United States of America\", k=5)\n", + "result" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Querying with score\n", + "\n", + "The score of the query can be included for every result. \n", + "\n", + "> The score returned in the query requests is a normalized value between 0 and 1, where 1 indicates the highest similarity and 0 the lowest regardless of the similarity function used. For more information look at the [docs](https://upstash.com/docs/vector/overall/features#vector-similarity-functions)." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'source': 'docs/docs/modules/state_of_the_union.txt'} - 0.9181416\n", + "{'source': 'docs/docs/modules/state_of_the_union.txt'} - 0.91668516\n", + "{'source': 'docs/docs/modules/state_of_the_union.txt'} - 0.9117657\n", + "{'source': 'docs/docs/modules/state_of_the_union.txt'} - 0.90447474\n", + "{'source': 'docs/docs/modules/state_of_the_union.txt'} - 0.9022917\n" + ] + } + ], + "source": [ + "result = store.similarity_search(\"The United States of America\", k=5)\n", + "\n", + "for doc, score in result:\n", + " print(f\"{doc.metadata} - {score}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Deleting vectors\n", + "\n", + "Vectors can be deleted by their ids" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [], + "source": [ + "store.delete(inserted_vectors)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Clearing the vector database\n", + "\n", + "This will clear the vector database" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [], + "source": [ + "store.delete(delete_all=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Getting info about vector database\n", + "\n", + "You can get information about your database like the distance metric dimension using the info function.\n", + "\n", + "> When an insert happens, the database an indexing takes place. While this is happening new vectors can not be queried. `pendingVectorCount` represents the number of vector that are currently being indexed. " + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'vectorCount': 44, 'pendingVectorCount': 0, 'indexSize': 2642412, 'dimension': 1536, 'similarityFunction': 'COSINE'}\n" + ] + }, + { + "data": { + "text/plain": [ + "InfoResult(vector_count=44, pending_vector_count=0, index_size=2642412, dimension=1536, similarity_function='COSINE')" + ] + }, + "execution_count": 32, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "store.info()" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "ai", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.4" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/docs/docs/modules/data_connection/indexing.ipynb b/docs/docs/modules/data_connection/indexing.ipynb index 45a5d92a42bc9..1f67758005d00 100644 --- a/docs/docs/modules/data_connection/indexing.ipynb +++ b/docs/docs/modules/data_connection/indexing.ipynb @@ -60,7 +60,7 @@ " * document addition by id (`add_documents` method with `ids` argument)\n", " * delete by id (`delete` method with `ids` argument)\n", "\n", - "Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `Milvus`, `MyScale`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `Vald`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`.\n", + "Compatible Vectorstores: `AnalyticDB`, `AstraDB`, `AwaDB`, `Bagel`, `Cassandra`, `Chroma`, `DashVector`, `DatabricksVectorSearch`, `DeepLake`, `Dingo`, `ElasticVectorSearch`, `ElasticsearchStore`, `FAISS`, `HanaDB`, `Milvus`, `MyScale`, `PGVector`, `Pinecone`, `Qdrant`, `Redis`, `Rockset`, `ScaNN`, `SupabaseVectorStore`, `SurrealDBStore`, `TimescaleVector`, `UpstashVectorStore`, `Vald`, `Vearch`, `VespaStore`, `Weaviate`, `ZepVectorStore`.\n", " \n", "## Caution\n", "\n", diff --git a/libs/community/langchain_community/vectorstores/__init__.py b/libs/community/langchain_community/vectorstores/__init__.py index da412557f3753..e845114ae038c 100644 --- a/libs/community/langchain_community/vectorstores/__init__.py +++ b/libs/community/langchain_community/vectorstores/__init__.py @@ -428,6 +428,12 @@ def _import_typesense() -> Any: return Typesense +def _import_upstash() -> Any: + from langchain_community.vectorstores.upstash import UpstashVectorStore + + return UpstashVectorStore + + def _import_usearch() -> Any: from langchain_community.vectorstores.usearch import USearch @@ -617,6 +623,8 @@ def __getattr__(name: str) -> Any: return _import_timescalevector() elif name == "Typesense": return _import_typesense() + elif name == "UpstashVectorStore": + return _import_upstash() elif name == "USearch": return _import_usearch() elif name == "Vald": @@ -704,6 +712,7 @@ def __getattr__(name: str) -> Any: "Tigris", "TimescaleVector", "Typesense", + "UpstashVectorStore", "USearch", "Vald", "Vearch", diff --git a/libs/community/langchain_community/vectorstores/upstash.py b/libs/community/langchain_community/vectorstores/upstash.py new file mode 100644 index 0000000000000..037946830f869 --- /dev/null +++ b/libs/community/langchain_community/vectorstores/upstash.py @@ -0,0 +1,774 @@ +from __future__ import annotations + +import logging +import uuid +from typing import TYPE_CHECKING, Any, Iterable, List, Optional, Tuple + +import numpy as np +from langchain_core.documents import Document +from langchain_core.embeddings import Embeddings +from langchain_core.utils.iter import batch_iterate +from langchain_core.vectorstores import VectorStore + +from langchain_community.vectorstores.utils import ( + maximal_marginal_relevance, +) + +if TYPE_CHECKING: + from upstash_vector import AsyncIndex, Index + +logger = logging.getLogger(__name__) + + +class UpstashVectorStore(VectorStore): + """Upstash Vector vector store + + To use, the ``upstash-vector`` python package must be installed. + + Also an Upstash Vector index is required. First create a new Upstash Vector index + and copy the `index_url` and `index_token` variables. Then either pass + them through the constructor or set the environment + variables `UPSTASH_VECTOR_REST_URL` and `UPSTASH_VECTOR_REST_TOKEN`. + + Example: + .. code-block:: python + + from langchain_community.vectorstores.upstash import UpstashVectorStore + from langchain_community.embeddings.openai import OpenAIEmbeddings + + embeddings = OpenAIEmbeddings() + vectorstore = UpstashVectorStore( + embedding=embeddings, + index_url="...", + index_token="..." + ) + + # or + + import os + + os.environ["UPSTASH_VECTOR_REST_URL"] = "..." + os.environ["UPSTASH_VECTOR_REST_TOKEN"] = "..." + + vectorstore = UpstashVectorStore( + embedding=embeddings + ) + """ + + def __init__( + self, + text_key: str = "text", + index: Optional[Index] = None, + async_index: Optional[AsyncIndex] = None, + index_url: Optional[str] = None, + index_token: Optional[str] = None, + embedding: Optional[Embeddings] = None, + ): + """ + Constructor for UpstashVectorStore. + + If index or index_url and index_token are not provided, the constructor will + attempt to create an index using the environment variables + `UPSTASH_VECTOR_REST_URL`and `UPSTASH_VECTOR_REST_TOKEN`. + + Args: + text_key: Key to store the text in metadata. + index: UpstashVector Index object. + async_index: UpstashVector AsyncIndex object, provide only if async + functions are needed + index_url: URL of the UpstashVector index. + index_token: Token of the UpstashVector index. + embedding: Embeddings object. + + Example: + .. code-block:: python + + from langchain_community.vectorstores.upstash import UpstashVectorStore + from langchain_community.embeddings.openai import OpenAIEmbeddings + + embeddings = OpenAIEmbeddings() + vectorstore = UpstashVectorStore( + embedding=embeddings, + index_url="...", + index_token="..." + ) + + # With an existing index + from upstash_vector import Index + + index = Index(url="...", token="...") + vectorstore = UpstashVectorStore( + embedding=embeddings, + index=index + ) + """ + + try: + from upstash_vector import AsyncIndex, Index + except ImportError: + raise ImportError( + "Could not import upstash_vector python package. " + "Please install it with `pip install upstash_vector`." + ) + + if index: + if not isinstance(index, Index): + raise ValueError( + "Passed index object should be an " + "instance of upstash_vector.Index, " + f"got {type(index)}" + ) + self._index = index + logger.info("Using the index passed as parameter") + if async_index: + if not isinstance(async_index, AsyncIndex): + raise ValueError( + "Passed index object should be an " + "instance of upstash_vector.AsyncIndex, " + f"got {type(async_index)}" + ) + self._async_index = async_index + logger.info("Using the async index passed as parameter") + + if index_url and index_token: + self._index = Index(url=index_url, token=index_token) + self._async_index = AsyncIndex(url=index_url, token=index_token) + logger.info("Created index from the index_url and index_token parameters") + elif not index and not async_index: + self._index = Index.from_env() + self._async_index = AsyncIndex.from_env() + logger.info("Created index using environment variables") + + self._embeddings = embedding + self._text_key = text_key + + @property + def embeddings(self) -> Optional[Embeddings]: + """Access the query embedding object if available.""" + return self._embeddings + + def _embed_documents(self, texts: Iterable[str]) -> List[List[float]]: + """Embed strings using the embeddings object""" + if not self._embeddings: + raise ValueError( + "No embeddings object provided. " + "Pass an embeddings object to the constructor." + ) + return self._embeddings.embed_documents(list(texts)) + + def _embed_query(self, text: str) -> List[float]: + """Embed query text using the embeddings object.""" + if not self._embeddings: + raise ValueError( + "No embeddings object provided. " + "Pass an embeddings object to the constructor." + ) + return self._embeddings.embed_query(text) + + def add_documents( + self, + documents: Iterable[Document], + ids: Optional[List[str]] = None, + batch_size: int = 32, + embedding_chunk_size: int = 1000, + ) -> List[str]: + """ + Get the embeddings for the documents and add them to the vectorstore. + + Documents are sent to the embeddings object + in batches of size `embedding_chunk_size`. + The embeddings are then upserted into the vectorstore + in batches of size `batch_size`. + + Args: + documents: Iterable of Documents to add to the vectorstore. + batch_size: Batch size to use when upserting the embeddings. + Upstash supports at max 1000 vectors per request. + embedding_batch_size: Chunk size to use when embedding the texts. + + Returns: + List of ids from adding the texts into the vectorstore. + + """ + texts = [doc.page_content for doc in documents] + metadatas = [doc.metadata for doc in documents] + + return self.add_texts( + texts, + metadatas=metadatas, + batch_size=batch_size, + ids=ids, + embedding_chunk_size=embedding_chunk_size, + ) + + async def aadd_documents( + self, + documents: Iterable[Document], + ids: Optional[List[str]] = None, + batch_size: int = 32, + embedding_chunk_size: int = 1000, + ) -> List[str]: + """ + Get the embeddings for the documents and add them to the vectorstore. + + Documents are sent to the embeddings object + in batches of size `embedding_chunk_size`. + The embeddings are then upserted into the vectorstore + in batches of size `batch_size`. + + Args: + documents: Iterable of Documents to add to the vectorstore. + batch_size: Batch size to use when upserting the embeddings. + Upstash supports at max 1000 vectors per request. + embedding_batch_size: Chunk size to use when embedding the texts. + + Returns: + List of ids from adding the texts into the vectorstore. + + """ + texts = [doc.page_content for doc in documents] + metadatas = [doc.metadata for doc in documents] + + return self.aadd_texts( + texts, + metadatas=metadatas, + ids=ids, + batch_size=batch_size, + embedding_chunk_size=embedding_chunk_size, + ) + + def add_texts( + self, + texts: Iterable[str], + metadatas: Optional[List[dict]] = None, + ids: Optional[List[str]] = None, + batch_size: int = 32, + embedding_chunk_size: int = 1000, + ) -> List[str]: + """ + Get the embeddings for the texts and add them to the vectorstore. + + Texts are sent to the embeddings object + in batches of size `embedding_chunk_size`. + The embeddings are then upserted into the vectorstore + in batches of size `batch_size`. + + Args: + texts: Iterable of strings to add to the vectorstore. + metadatas: Optional list of metadatas associated with the texts. + ids: Optional list of ids to associate with the texts. + batch_size: Batch size to use when upserting the embeddings. + Upstash supports at max 1000 vectors per request. + embedding_batch_size: Chunk size to use when embedding the texts. + + Returns: + List of ids from adding the texts into the vectorstore. + + """ + texts = list(texts) + ids = ids or [str(uuid.uuid4()) for _ in texts] + + # Copy metadatas to avoid modifying the original documents + if metadatas: + metadatas = [m.copy() for m in metadatas] + else: + metadatas = [{} for _ in texts] + + # Add text to metadata + for metadata, text in zip(metadatas, texts): + metadata[self._text_key] = text + + for i in range(0, len(texts), embedding_chunk_size): + chunk_texts = texts[i : i + embedding_chunk_size] + chunk_ids = ids[i : i + embedding_chunk_size] + chunk_metadatas = metadatas[i : i + embedding_chunk_size] + embeddings = self._embed_documents(chunk_texts) + + for batch in batch_iterate( + batch_size, zip(chunk_ids, embeddings, chunk_metadatas) + ): + self._index.upsert(vectors=batch) + + return ids + + async def aadd_texts( + self, + texts: Iterable[str], + metadatas: Optional[List[dict]] = None, + ids: Optional[List[str]] = None, + batch_size: int = 32, + embedding_chunk_size: int = 1000, + ) -> List[str]: + """ + Get the embeddings for the texts and add them to the vectorstore. + + Texts are sent to the embeddings object + in batches of size `embedding_chunk_size`. + The embeddings are then upserted into the vectorstore + in batches of size `batch_size`. + + Args: + texts: Iterable of strings to add to the vectorstore. + metadatas: Optional list of metadatas associated with the texts. + ids: Optional list of ids to associate with the texts. + batch_size: Batch size to use when upserting the embeddings. + Upstash supports at max 1000 vectors per request. + embedding_batch_size: Chunk size to use when embedding the texts. + + Returns: + List of ids from adding the texts into the vectorstore. + + """ + texts = list(texts) + ids = ids or [str(uuid.uuid4()) for _ in texts] + + # Copy metadatas to avoid modifying the original documents + if metadatas: + metadatas = [m.copy() for m in metadatas] + else: + metadatas = [{} for _ in texts] + + # Add text to metadata + for metadata, text in zip(metadatas, texts): + metadata[self._text_key] = text + + for i in range(0, len(texts), embedding_chunk_size): + chunk_texts = texts[i : i + embedding_chunk_size] + chunk_ids = ids[i : i + embedding_chunk_size] + chunk_metadatas = metadatas[i : i + embedding_chunk_size] + embeddings = self._embed_documents(chunk_texts) + + for batch in batch_iterate( + batch_size, zip(chunk_ids, embeddings, chunk_metadatas) + ): + await self._async_index.upsert(vectors=batch) + + return ids + + def similarity_search_with_score( + self, + query: str, + k: int = 4, + ) -> List[Tuple[Document, float]]: + """Retrieve texts most similar to query and + convert the result to `Document` objects. + + Args: + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + + Returns: + List of Documents most similar to the query and score for each + """ + return self.similarity_search_by_vector_with_score( + self._embed_query(query), k=k + ) + + async def asimilarity_search_with_score( + self, + query: str, + k: int = 4, + ) -> List[Tuple[Document, float]]: + """Retrieve texts most similar to query and + convert the result to `Document` objects. + + Args: + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + + Returns: + List of Documents most similar to the query and score for each + """ + return await self.asimilarity_search_by_vector_with_score( + self._embed_query(query), k=k + ) + + def _process_results(self, results: List) -> List[Tuple[Document, float]]: + docs = [] + for res in results: + metadata = res.metadata + if metadata and self._text_key in metadata: + text = metadata.pop(self._text_key) + doc = Document(page_content=text, metadata=metadata) + docs.append((doc, res.score)) + else: + logger.warning( + f"Found document with no `{self._text_key}` key. Skipping." + ) + return docs + + def similarity_search_by_vector_with_score( + self, + embedding: List[float], + k: int = 4, + ) -> List[Tuple[Document, float]]: + """Return texts whose embedding is closest to the given embedding""" + + results = self._index.query( + vector=embedding, + top_k=k, + include_metadata=True, + ) + + return self._process_results(results) + + async def asimilarity_search_by_vector_with_score( + self, + embedding: List[float], + k: int = 4, + ) -> List[Tuple[Document, float]]: + """Return texts whose embedding is closest to the given embedding""" + + results = await self._async_index.query( + vector=embedding, + top_k=k, + include_metadata=True, + ) + + return self._process_results(results) + + def similarity_search( + self, + query: str, + k: int = 4, + ) -> List[Document]: + """Return documents most similar to query. + + Args: + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + + Returns: + List of Documents most similar to the query and score for each + """ + docs_and_scores = self.similarity_search_with_score(query, k=k) + return [doc for doc, _ in docs_and_scores] + + async def asimilarity_search( + self, + query: str, + k: int = 4, + ) -> List[Document]: + """Return documents most similar to query. + + Args: + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + + Returns: + List of Documents most similar to the query + """ + docs_and_scores = await self.asimilarity_search_with_score(query, k=k) + return [doc for doc, _ in docs_and_scores] + + def similarity_search_by_vector( + self, embedding: List[float], k: int = 4 + ) -> List[Document]: + """Return documents closest to the given embedding. + + Args: + embedding: Embedding to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + + Returns: + List of Documents most similar to the query + """ + docs_and_scores = self.similarity_search_by_vector_with_score(embedding, k=k) + return [doc for doc, _ in docs_and_scores] + + async def asimilarity_search_by_vector( + self, embedding: List[float], k: int = 4 + ) -> List[Document]: + """Return documents closest to the given embedding. + + Args: + embedding: Embedding to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + + Returns: + List of Documents most similar to the query + """ + docs_and_scores = await self.asimilarity_search_by_vector_with_score( + embedding, k=k + ) + return [doc for doc, _ in docs_and_scores] + + def _similarity_search_with_relevance_scores( + self, + query: str, + k: int = 4, + **kwargs: Any, + ) -> List[Tuple[Document, float]]: + """ + Since Upstash always returns relevance scores, default implementation is used. + """ + return self.similarity_search_with_score(query, k=k, **kwargs) + + async def _asimilarity_search_with_relevance_scores( + self, + query: str, + k: int = 4, + **kwargs: Any, + ) -> List[Tuple[Document, float]]: + """ + Since Upstash always returns relevance scores, default implementation is used. + """ + return await self.asimilarity_search_with_score(query, k=k, **kwargs) + + def max_marginal_relevance_search_by_vector( + self, + embedding: List[float], + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + ) -> List[Document]: + """Return docs selected using the maximal marginal relevance. + + Maximal marginal relevance optimizes for similarity to query AND diversity + among selected documents. + + Args: + embedding: Embedding to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch to pass to MMR algorithm. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + Returns: + List of Documents selected by maximal marginal relevance. + """ + results = self._index.query( + vector=embedding, + top_k=fetch_k, + include_vectors=True, + include_metadata=True, + ) + mmr_selected = maximal_marginal_relevance( + np.array([embedding], dtype=np.float32), + [item.vector for item in results], + k=k, + lambda_mult=lambda_mult, + ) + selected = [results[i].metadata for i in mmr_selected] + return [ + Document(page_content=metadata.pop((self._text_key)), metadata=metadata) # type: ignore since include_metadata=True + for metadata in selected + ] + + async def amax_marginal_relevance_search_by_vector( + self, + embedding: List[float], + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + ) -> List[Document]: + """Return docs selected using the maximal marginal relevance. + + Maximal marginal relevance optimizes for similarity to query AND diversity + among selected documents. + + Args: + embedding: Embedding to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch to pass to MMR algorithm. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + Returns: + List of Documents selected by maximal marginal relevance. + """ + results = await self._async_index.query( + vector=embedding, + top_k=fetch_k, + include_vectors=True, + include_metadata=True, + ) + mmr_selected = maximal_marginal_relevance( + np.array([embedding], dtype=np.float32), + [item.vector for item in results], + k=k, + lambda_mult=lambda_mult, + ) + selected = [results[i].metadata for i in mmr_selected] + return [ + Document(page_content=metadata.pop((self._text_key)), metadata=metadata) # type: ignore since include_metadata=True + for metadata in selected + ] + + def max_marginal_relevance_search( + self, + query: str, + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + ) -> List[Document]: + """Return docs selected using the maximal marginal relevance. + + Maximal marginal relevance optimizes for similarity to query AND diversity + among selected documents. + + Args: + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch to pass to MMR algorithm. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + Returns: + List of Documents selected by maximal marginal relevance. + """ + embedding = self._embed_query(query) + return self.max_marginal_relevance_search_by_vector( + embedding=embedding, k=k, fetch_k=fetch_k, lambda_mult=lambda_mult + ) + + async def amax_marginal_relevance_search( + self, + query: str, + k: int = 4, + fetch_k: int = 20, + lambda_mult: float = 0.5, + ) -> List[Document]: + """Return docs selected using the maximal marginal relevance. + + Maximal marginal relevance optimizes for similarity to query AND diversity + among selected documents. + + Args: + query: Text to look up documents similar to. + k: Number of Documents to return. Defaults to 4. + fetch_k: Number of Documents to fetch to pass to MMR algorithm. + lambda_mult: Number between 0 and 1 that determines the degree + of diversity among the results with 0 corresponding + to maximum diversity and 1 to minimum diversity. + Defaults to 0.5. + Returns: + List of Documents selected by maximal marginal relevance. + """ + embedding = self._embed_query(query) + return await self.amax_marginal_relevance_search_by_vector( + embedding=embedding, k=k, fetch_k=fetch_k, lambda_mult=lambda_mult + ) + + @classmethod + def from_texts( + cls, + texts: List[str], + embedding: Embeddings, + metadatas: Optional[List[dict]] = None, + ids: Optional[List[str]] = None, + embedding_chunk_size: int = 1000, + batch_size: int = 32, + text_key: str = "text", + index: Optional[Index] = None, + async_index: Optional[AsyncIndex] = None, + index_url: Optional[str] = None, + index_token: Optional[str] = None, + ) -> UpstashVectorStore: + """Create a new UpstashVectorStore from a list of texts. + + Example: + .. code-block:: python + from langchain_community.vectorstores.upstash import UpstashVectorStore + from langchain_community.embeddings import OpenAIEmbeddings + + embeddings = OpenAIEmbeddings() + vector_store = UpstashVectorStore.from_texts( + texts, + embeddings, + ) + """ + vector_store = cls( + embedding=embedding, + text_key=text_key, + index=index, + async_index=async_index, + index_url=index_url, + index_token=index_token, + ) + + vector_store.add_texts( + texts, + metadatas=metadatas, + ids=ids, + batch_size=batch_size, + embedding_chunk_size=embedding_chunk_size, + ) + return vector_store + + def delete( + self, + ids: Optional[List[str]] = None, + delete_all: Optional[bool] = None, + batch_size=1000, + ) -> None: + """Delete by vector IDs + + Args: + ids: List of ids to delete. + delete_all: Delete all vectors in the index. + batch_size: Batch size to use when deleting the embeddings. + Upstash supports at max 1000 deletions per request. + """ + + if delete_all: + self._index.reset() + elif ids is not None: + for batch in batch_iterate(batch_size, ids): + self._index.delete(ids=batch) + else: + raise ValueError("Either ids or delete_all should be provided") + + return None + + async def adelete( + self, + ids: Optional[List[str]] = None, + delete_all: Optional[bool] = None, + batch_size=1000, + ) -> None: + """Delete by vector IDs + + Args: + ids: List of ids to delete. + delete_all: Delete all vectors in the index. + batch_size: Batch size to use when deleting the embeddings. + Upstash supports at max 1000 deletions per request. + """ + + if delete_all: + await self._async_index.reset() + elif ids is not None: + for batch in batch_iterate(batch_size, ids): + await self._async_index.delete(ids=batch) + else: + raise ValueError("Either ids or delete_all should be provided") + + return None + + def info(self): + """Get statistics about the index. + + Returns: + - total number of vectors + - total number of vectors waiting to be indexed + - total size of the index on disk in bytes + - dimension count for the index + - similarity function selected for the index + """ + return self._index.info() + + async def ainfo(self): + """Get statistics about the index. + + Returns: + - total number of vectors + - total number of vectors waiting to be indexed + - total size of the index on disk in bytes + - dimension count for the index + - similarity function selected for the index + """ + return await self._async_index.info() diff --git a/libs/community/tests/integration_tests/.env.example b/libs/community/tests/integration_tests/.env.example index 99be838353376..6effabe6afabb 100644 --- a/libs/community/tests/integration_tests/.env.example +++ b/libs/community/tests/integration_tests/.env.example @@ -50,3 +50,7 @@ POWERBI_NUMROWS=_num_rows_in_your_test_table # MongoDB Atlas Vector Search MONGODB_ATLAS_URI=your_mongodb_atlas_connection_string + +# Upstash Vector +UPSTASH_VECTOR_URL=your_upstash_vector_url +UPSTASH_VECTOR_TOKEN=your_upstash_vector_token diff --git a/libs/community/tests/integration_tests/vectorstores/test_upstash.py b/libs/community/tests/integration_tests/vectorstores/test_upstash.py new file mode 100644 index 0000000000000..a2cf155c774b5 --- /dev/null +++ b/libs/community/tests/integration_tests/vectorstores/test_upstash.py @@ -0,0 +1,243 @@ +"""Test Upstash Vector functionality.""" + +import os +from time import sleep + +import pytest +from langchain_core.documents import Document +from upstash_vector import AsyncIndex, Index + +from langchain_community.vectorstores.upstash import UpstashVectorStore +from tests.integration_tests.vectorstores.fake_embeddings import ( + FakeEmbeddings, +) + + +@pytest.fixture(scope="function", autouse=True) +def fixture(): + index = Index.from_env() + index.reset() + wait_for_indexing(index) + + +def wait_for_indexing(store: UpstashVectorStore): + while store.info().pending_vector_count != 0: + # Wait for indexing to complete + sleep(0.5) + + +def test_upstash_simple_insert() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + store = UpstashVectorStore.from_texts(texts=texts, embedding=FakeEmbeddings()) + wait_for_indexing(store) + output = store.similarity_search("foo", k=1) + assert output == [Document(page_content="foo")] + + +@pytest.mark.asyncio +async def test_upstash_simple_insert_async() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + store = UpstashVectorStore.from_texts(texts=texts, embedding=FakeEmbeddings()) + wait_for_indexing(store) + output = await store.asimilarity_search("foo", k=1) + assert output == [Document(page_content="foo")] + + +def test_upstash_with_metadatas() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + store = UpstashVectorStore.from_texts( + texts=texts, + embedding=FakeEmbeddings(), + metadatas=metadatas, + ) + wait_for_indexing(store) + output = store.similarity_search("foo", k=1) + assert output == [Document(page_content="foo", metadata={"page": "0"})] + + +@pytest.mark.asyncio +async def test_upstash_with_metadatas_async() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + store = UpstashVectorStore.from_texts( + texts=texts, + embedding=FakeEmbeddings(), + metadatas=metadatas, + ) + wait_for_indexing(store) + output = await store.asimilarity_search("foo", k=1) + assert output == [Document(page_content="foo", metadata={"page": "0"})] + + +def test_upstash_with_metadatas_with_scores() -> None: + """Test end to end construction and scored search.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + store = UpstashVectorStore.from_texts( + texts=texts, + embedding=FakeEmbeddings(), + metadatas=metadatas, + ) + wait_for_indexing(store) + output = store.similarity_search_with_score("foo", k=1) + assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)] + + +@pytest.mark.asyncio +async def test_upstash_with_metadatas_with_scores_async() -> None: + """Test end to end construction and scored search.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + store = UpstashVectorStore.from_texts( + texts=texts, + embedding=FakeEmbeddings(), + metadatas=metadatas, + ) + wait_for_indexing(store) + output = await store.asimilarity_search_with_score("foo", k=1) + assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)] + + +def test_upstash_with_metadatas_with_scores_using_vector() -> None: + """Test end to end construction and scored search, using embedding vector.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + embeddings = FakeEmbeddings() + + store = UpstashVectorStore.from_texts( + texts=texts, + embedding=embeddings, + metadatas=metadatas, + ) + wait_for_indexing(store) + embedded_query = embeddings.embed_query("foo") + output = store.similarity_search_by_vector_with_score(embedding=embedded_query, k=1) + assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)] + + +@pytest.mark.asyncio +async def test_upstash_with_metadatas_with_scores_using_vector_async() -> None: + """Test end to end construction and scored search, using embedding vector.""" + texts = ["foo", "bar", "baz"] + metadatas = [{"page": str(i)} for i in range(len(texts))] + embeddings = FakeEmbeddings() + + store = UpstashVectorStore.from_texts( + texts=texts, + embedding=embeddings, + metadatas=metadatas, + ) + wait_for_indexing(store) + embedded_query = embeddings.embed_query("foo") + output = await store.asimilarity_search_by_vector_with_score( + embedding=embedded_query, k=1 + ) + assert output == [(Document(page_content="foo", metadata={"page": "0"}), 1.0)] + + +def test_upstash_mmr() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + store = UpstashVectorStore.from_texts(texts=texts, embedding=FakeEmbeddings()) + wait_for_indexing(store) + output = store.max_marginal_relevance_search("foo", k=1) + assert output == [Document(page_content="foo")] + + +@pytest.mark.asyncio +async def test_upstash_mmr_async() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + store = UpstashVectorStore.from_texts(texts=texts, embedding=FakeEmbeddings()) + wait_for_indexing(store) + output = await store.amax_marginal_relevance_search("foo", k=1) + assert output == [Document(page_content="foo")] + + +def test_upstash_mmr_by_vector() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + embeddings = FakeEmbeddings() + store = UpstashVectorStore.from_texts(texts=texts, embedding=embeddings) + wait_for_indexing(store) + embedded_query = embeddings.embed_query("foo") + output = store.max_marginal_relevance_search_by_vector(embedded_query, k=1) + assert output == [Document(page_content="foo")] + + +@pytest.mark.asyncio +async def test_upstash_mmr_by_vector_async() -> None: + """Test end to end construction and search.""" + texts = ["foo", "bar", "baz"] + embeddings = FakeEmbeddings() + store = UpstashVectorStore.from_texts(texts=texts, embedding=embeddings) + wait_for_indexing(store) + embedded_query = embeddings.embed_query("foo") + output = await store.amax_marginal_relevance_search_by_vector(embedded_query, k=1) + assert output == [Document(page_content="foo")] + + +def test_init_from_index() -> None: + index = Index.from_env() + + store = UpstashVectorStore(index=index) + + assert store.info() is not None + + +@pytest.mark.asyncio +async def test_init_from_async_index() -> None: + index = AsyncIndex.from_env() + + store = UpstashVectorStore(async_index=index) + + assert await store.ainfo() is not None + + +def test_init_from_credentials() -> None: + store = UpstashVectorStore( + index_url=os.environ["UPSTASH_VECTOR_REST_URL"], + index_token=os.environ["UPSTASH_VECTOR_REST_TOKEN"], + ) + + assert store.info() is not None + + +@pytest.mark.asyncio +async def test_init_from_credentials_async() -> None: + store = UpstashVectorStore( + index_url=os.environ["UPSTASH_VECTOR_REST_URL"], + index_token=os.environ["UPSTASH_VECTOR_REST_TOKEN"], + ) + + assert await store.ainfo() is not None + + +def test_upstash_add_documents_no_metadata() -> None: + store = UpstashVectorStore(embedding=FakeEmbeddings()) + store.add_documents([Document(page_content="foo")]) + wait_for_indexing(store) + + search = store.similarity_search("foo") + assert search == [Document(page_content="foo")] + + +def test_upstash_add_documents_mixed_metadata() -> None: + store = UpstashVectorStore(embedding=FakeEmbeddings()) + docs = [ + Document(page_content="foo"), + Document(page_content="bar", metadata={"baz": 1}), + ] + ids = ["0", "1"] + actual_ids = store.add_documents(docs, ids=ids) + wait_for_indexing(store) + assert actual_ids == ids + search = store.similarity_search("foo bar") + assert sorted(search, key=lambda d: d.page_content) == sorted( + docs, key=lambda d: d.page_content + ) diff --git a/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py b/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py index 85c5312d1f924..9bfdce9b324de 100644 --- a/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py +++ b/libs/community/tests/unit_tests/vectorstores/test_indexing_docs.py @@ -75,6 +75,7 @@ def check_compatibility(vector_store: VectorStore) -> bool: "SurrealDBStore", "TileDB", "TimescaleVector", + "UpstashVectorStore", "Vald", "Vearch", "VespaStore", diff --git a/libs/community/tests/unit_tests/vectorstores/test_public_api.py b/libs/community/tests/unit_tests/vectorstores/test_public_api.py index 48b51accdda87..242e427f2168a 100644 --- a/libs/community/tests/unit_tests/vectorstores/test_public_api.py +++ b/libs/community/tests/unit_tests/vectorstores/test_public_api.py @@ -63,6 +63,7 @@ "Tigris", "TimescaleVector", "Typesense", + "UpstashVectorStore", "USearch", "Vald", "Vearch",