Skip to content

Commit

Permalink
Update
Browse files Browse the repository at this point in the history
  • Loading branch information
eyurtsev committed Oct 22, 2024
1 parent 66ebd78 commit eeabfb3
Showing 1 changed file with 49 additions and 44 deletions.
93 changes: 49 additions & 44 deletions docs/docs/concepts/vectorstores.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,46 +11,60 @@

This conceptual overview focuses on text-based indexing and retrieval for simplicity.
However, embedding models can be [multi-modal](https://cloud.google.com/vertex-ai/generative-ai/docs/embeddings/get-multimodal-embeddings)
and vectorstores can be used to store and retrieve a variety of data types beyond text.
and vector stores can be used to store and retrieve a variety of data types beyond text.
:::

## Overview

Vectorstores are a powerful and efficient way to index and retrieve unstructured data.
They leverage vector [embeddings](/docs/concepts/embedding_models/), which are numerical representations of unstructured data that capture semantic meaning.
At their core, vectorstores utilize specialized data structures called vector indices.
These indices are designed to perform efficient similarity searches over embedding vectors, allowing for rapid retrieval of relevant information based on semantic similarity rather than exact keyword matches.
Vector stores are specialized data stores that enable indexing and retrieving information based on vector representations.

## Key concept
These vectors, called [embeddings](/docs/concepts/embedding_models/), capture the semantic meaning of data that has been embedded.

Vector stores are frequently used to search over unstructured data, such as text, images, and audio, to retrieve relevant information based on semantic similarity rather than exact keyword matches.

![Vectorstores](/img/vectorstores.png)

There are [many different types of vectorstores](/docs/integrations/vectorstores/).
LangChain provides a universal interface for working with them, providing standard methods for common operations.
## Integrations

## Adding documents
LangChain has a large number of vectorstore integrations, allowing users to easily switch between different vectorstore implementations.

Using [Pinecone](https://python.langchain.com/api_reference/pinecone/vectorstores/langchain_pinecone.vectorstore.PineconeVectorStore.html#langchain_pinecone.vectorstores.PineconeVectorStore) as an example, we initialize a vectorstore with the [embedding](/docs/concepts/embedding_models/) model we want to use:
Please see the [full list of LangChain vectorstore integrations](/docs/integrations/vectorstores/).

```python
from pinecone import Pinecone
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
## Interface

LangChain provides a standard interface for working with vector stores, allowing users to easily switch between different vectorstore implementations.

The interface consists of basic methods for writing, deleting and searching for documents in the vector store.

The key methods are:

- `add_documents`: Add a list of texts to the vector store.
- `delete_documents`: Delete a list of documents from the vector store.
- `similarity_search`: Search for similar documents to a given query.

# Initialize Pinecone
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
## Initialization

Most vectors in LangChain accept an embedding model as an argument when initializing the vector store.

We will use LangChain's [InMemoryVectorStore](https://python.langchain.com/api_reference/core/vectorstores/langchain_core.vectorstores.in_memory.InMemoryVectorStore.html) implementation to illustrate the API.

```python
from langchain_core.vectorstores import InMemoryVectorStore

# Initialize with an embedding model
vector_store = PineconeVectorStore(index=pc.Index(index_name), embedding=OpenAIEmbeddings())
vector_store = InMemoryVectorStore(embedding=SomeEmbeddingModel())
```

Given a vectorstore, we need the ability to add documents to it.
The `add_texts` and `add_documents` methods can be used to add texts (strings) and documents (LangChain [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects) to a vectorstore, respectively.
As an example, we can create a list of [Documents](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html).
## Adding documents

To add documents, use the `add_documents` method.

This API works with a list of [Document](https://api.python.langchain.com/en/latest/documents/langchain_core.documents.base.Document.html) objects.
`Document` objects all have `page_content` and `metadata` attributes, making them a universal way to store unstructured text and associated metadata.

```python
from langchain_core.documents import Document

document_1 = Document(
page_content="I had chocalate chip pancakes and scrambled eggs for breakfast this morning.",
metadata={"source": "tweet"},
Expand All @@ -60,28 +74,26 @@ document_2 = Document(
page_content="The weather forecast for tomorrow is cloudy and overcast, with a high of 62 degrees.",
metadata={"source": "news"},
)

documents = [document_1, document_2]

vector_store.add_documents(documents=documents)
```

When we use the `add_documents` method to add the documents to the vectorstore, the vectorstore will use the provided embedding model to create an embedding of each document.
What happens if we add the same document twice?
Many vectorstores support [`upsert`](https://docs.pinecone.io/guides/data/upsert-data) functionality, which combines the functionality of inserting and updating records.
To use this, we simply supply a unique identifier for each document when we add it to the vectorstore using `add_documents` or `add_texts`.
If the record doesn't exist, it inserts a new record.
If the record already exists, it updates the existing record.
You should usually provide IDs for the documents you add to the vector store, so
that instead of adding the same document multiple times, you can update the existing document.

```python
# Given a list of documents and a vector store
uuids = [str(uuid4()) for _ in range(len(documents))]
vector_store.add_documents(documents=documents, ids=uuids)
vector_store.add_documents(documents=documents, ids=["doc1", "doc2"])
```

:::info[Further reading]
## Delete

* See the [full list of LangChain vectorstore integrations](/docs/integrations/vectorstores/).
* See Pinecone's [documentation](https://docs.pinecone.io/guides/data/upsert-data) on the `upsert` method.
To delete documents, use the `delete_documents` method which takes a list of document IDs to delete.

:::
```python
vector_store.delete_documents(ids=["doc1"])
```

## Search

Expand All @@ -98,16 +110,8 @@ A critical advantage of embeddings vectors is they can be compared using many si
- **Euclidean Distance**: Measures the straight-line distance between two points.
- **Dot Product**: Measures the projection of one vector onto another.

The choice of similarity metric can sometimes be selected when initializing the vectorstore.
As an example, Pinecone allows the user to select the [similarity metric on index creation](/docs/integrations/vectorstores/pinecone/#initialization).

```python
pc.create_index(
name=index_name,
dimension=3072,
metric="cosine",
)
```
The choice of similarity metric can sometimes be selected when initializing the vectorstore. Please refer
to the documentation of the specific vectorstore you are using to see what similarity metrics are supported.

:::info[Further reading]

Expand Down Expand Up @@ -153,10 +157,11 @@ This allows structured filters to reduce the size of the similarity search space
2. **Metadata search**: Apply structured query to the metadata, filering specific documents.

Vectorstore support for metadata filtering is typically dependent on the underlying vector store implementation.

Here is example usage with [Pinecone](/docs/integrations/vectorstores/pinecone/#query-directly), showing that we filter for all documents that have the metadata key `source` with value `tweet`.

```python
results = vectorstore.similarity_search(
vectorstore.similarity_search(
"LangChain provides abstractions to make working with LLMs easy",
k=2,
filter={"source": "tweet"},
Expand Down

0 comments on commit eeabfb3

Please sign in to comment.