diff --git a/docs/docs/concepts/vectorstores.mdx b/docs/docs/concepts/vectorstores.mdx index 23ed6623c1a5d..0ffdbbf49e3ba 100644 --- a/docs/docs/concepts/vectorstores.mdx +++ b/docs/docs/concepts/vectorstores.mdx @@ -11,46 +11,60 @@ This conceptual overview focuses on text-based indexing and retrieval for simplicity. However, embedding models can be [multi-modal](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings) -and vectorstores can be used to store and retrieve a variety of data types beyond text. +and vector stores can be used to store and retrieve a variety of data types beyond text. ::: ## Overview -Vectorstores are a powerful and efficient way to index and retrieve unstructured data. -They leverage vector [embeddings](/docs/concepts/embedding_models/), which are numerical representations of unstructured data that capture semantic meaning. -At their core, vectorstores utilize specialized data structures called vector indices. -These indices are designed to perform efficient similarity searches over embedding vectors, allowing for rapid retrieval of relevant information based on semantic similarity rather than exact keyword matches. +Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations. -## Key concept +These vectors, called [embeddings](/docs/concepts/embedding_models/), capture the semantic meaning of data that has been embedded. + +Vector stores are frequently used to search over unstructured data, such as text, images, and audio, to retrieve relevant information based on semantic similarity rather than exact keyword matches. ![Vectorstores](/img/vectorstores.png) -There are [many different types of vectorstores](/docs/integrations/vectorstores/). -LangChain provides a universal interface for working with them, providing standard methods for common operations. +## Integrations -## Adding documents +LangChain has a large number of vectorstore integrations, allowing users to easily switch between different vectorstore implementations. -Using [Pinecone](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstore.PineconeVectorStore.html#langchain_pinecone.vectorstores.PineconeVectorStore) as an example, we initialize a vectorstore with the [embedding](/docs/concepts/embedding_models/) model we want to use: +Please see the [full list of LangChain vectorstore integrations](/docs/integrations/vectorstores/). -```python -from pinecone import Pinecone -from langchain_openai import OpenAIEmbeddings -from langchain_pinecone import PineconeVectorStore +## Interface + +LangChain provides a standard interface for working with vector stores, allowing users to easily switch between different vectorstore implementations. + +The interface consists of basic methods for writing, deleting and searching for documents in the vector store. + +The key methods are: + +- `add_documents`: Add a list of texts to the vector store. +- `delete_documents`: Delete a list of documents from the vector store. +- `similarity_search`: Search for similar documents to a given query. -# Initialize Pinecone -pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"]) +## Initialization + +Most vectors in LangChain accept an embedding model as an argument when initializing the vector store. + +We will use LangChain's [InMemoryVectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html) implementation to illustrate the API. + +```python +from langchain_core.vectorstores import InMemoryVectorStore # Initialize with an embedding model -vector_store = PineconeVectorStore(index=pc.Index(index_name), embedding=OpenAIEmbeddings()) +vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel()) ``` -Given a vectorstore, we need the ability to add documents to it. -The `add_texts` and `add_documents` methods can be used to add texts (strings) and documents (LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects) to a vectorstore, respectively. -As an example, we can create a list of [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html). +## Adding documents + +To add documents, use the `add_documents` method. + +This API works with a list of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects. `Document` objects all have `page_content` and `metadata` attributes, making them a universal way to store unstructured text and associated metadata. ```python from langchain_core.documents import Document + document_1 = Document( page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.", metadata={"source": "tweet"}, @@ -60,28 +74,26 @@ document_2 = Document( page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.", metadata={"source": "news"}, ) + documents = [document_1, document_2] + +vector_store.add_documents(documents=documents) ``` -When we use the `add_documents` method to add the documents to the vectorstore, the vectorstore will use the provided embedding model to create an embedding of each document. -What happens if we add the same document twice? -Many vectorstores support [`upsert`](https://docs.pinecone.io/guides/data/upsert-data) functionality, which combines the functionality of inserting and updating records. -To use this, we simply supply a unique identifier for each document when we add it to the vectorstore using `add_documents` or `add_texts`. -If the record doesn't exist, it inserts a new record. -If the record already exists, it updates the existing record. +You should usually provide IDs for the documents you add to the vector store, so +that instead of adding the same document multiple times, you can update the existing document. ```python -# Given a list of documents and a vector store -uuids = [str(uuid4()) for _ in range(len(documents))] -vector_store.add_documents(documents=documents, ids=uuids) +vector_store.add_documents(documents=documents, ids=["doc1", "doc2"]) ``` -:::info[Further reading] +## Delete -* See the [full list of LangChain vectorstore integrations](/docs/integrations/vectorstores/). -* See Pinecone's [documentation](https://docs.pinecone.io/guides/data/upsert-data) on the `upsert` method. +To delete documents, use the `delete_documents` method which takes a list of document IDs to delete. -::: +```python +vector_store.delete_documents(ids=["doc1"]) +``` ## Search @@ -98,16 +110,8 @@ A critical advantage of embeddings vectors is they can be compared using many si - **Euclidean Distance**: Measures the straight-line distance between two points. - **Dot Product**: Measures the projection of one vector onto another. -The choice of similarity metric can sometimes be selected when initializing the vectorstore. -As an example, Pinecone allows the user to select the [similarity metric on index creation](/docs/integrations/vectorstores/pinecone/#initialization). - -```python -pc.create_index( - name=index_name, - dimension=3072, - metric="cosine", -) -``` +The choice of similarity metric can sometimes be selected when initializing the vectorstore. Please refer +to the documentation of the specific vectorstore you are using to see what similarity metrics are supported. :::info[Further reading] @@ -153,10 +157,11 @@ This allows structured filters to reduce the size of the similarity search space 2. **Metadata search**: Apply structured query to the metadata, filering specific documents. Vectorstore support for metadata filtering is typically dependent on the underlying vector store implementation. + Here is example usage with [Pinecone](/docs/integrations/vectorstores/pinecone/#query-directly), showing that we filter for all documents that have the metadata key `source` with value `tweet`. ```python -results = vectorstore.similarity_search( +vectorstore.similarity_search( "LangChain provides abstractions to make working with LLMs easy", k=2, filter={"source": "tweet"},