feat: Add PineconeDocumentStore (v2) #43
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Related Issues:
Fixes #6055
Pinecone Document store
Based on the basic contract that all DocumentStores are expected to follow, we have implemented the following classes for the document store.
Indexing Pipeline
First the document store node is added which initializes the document store. The documents as a list of document objects are added to the document store using the write_documents method. Then the embedder node is added which creates the corresponding embeddings for each document using the DocumentEmbedder. The documents with their embeddings are stored in the document store.
Indexing pipelines prepare your files for search. Their main objective is to convert your files into Haystack Documents, so that they can be saved in a DocumentStore.
During indexing, we do not use any Retriever, but rather a DocumentEmbedder. This class accepts a model name and simply adds embeddings to the Documents it receives.
Embedders encode a list of data points (strings, images, etc.) into a list of vectors (i.e., the embeddings) using a model. The embedders are used both in the indexing (to encode documents) and query pipelines (encode query).
When embedding documents, the Embedder receives a list of Document objects as input. For each item in the list, the corresponding vectors are computed and stored in the embedding field of the item itself. The list is then returned as the output.
The Document class contains the query and the embedding. In v1 the Document class and the embedding were separate.
When working with documents, there's the possibility to compute embeddings also for the document's metadata. In this case, the Embedder will be responsible for performing any text-manipulation work needed in preparation of the actual embedding process.
Query Pipeline
First the document store node is added which has the documents stored in the document store. Then the embedding node is added, the TextEmbedder class will create the embedding of the query. The filter_documents method will retrieve the documents with the specific filters. Then the retriever node is added which will call the run method to retrieve the relevant documents from the document store.
First the documents with the query and embeddings are created.
During query, the first step is not a Retriever anymore, but a StringEmbedder. This will convert the query into its embedding representation and forward it over to a Retriever that expects it. When embedding queries, the Embedder receives a list of strings in input that are transformed into a list of vectors returned as output.
The embedders were part of the retreiver in v1. In v2, the embedders will be separate and used for creating embeddings.
Retrievers retrieve Documents from the DocumentStores. They are specific and aware of which Store has been used. For e.g., PineconeRetriever for the PineconeDocumentStore. They will be commonly used in query pipelines (not in indexing pipelines).
Methods implemented:
get_index_stats: It returns statistics about the index's contents, including the vector count per namespace and the number of dimensions. New in v2.
count_documents: Returns the no of documents which are present in the document store. Similar to get_all_documents in v1. New in v2.
filter_documents: It takes the input query and embedding as input and returns a list of documents that match the filters. Returns the documents which match the filters provided. Added: Filtering by the query embedding similar to update_embedding in v1.
write_documents: The write documents now takes only the document objects as input whereas in v1 it would take the embedding and documents as input. It takes a list of documents as input. To store documents in Pinecone, we use dummy embeddings, if the embedding is not passed with the document. We added the functionality of writing the documents if the user does not pass the embedding. We write the documents using the dummy embeddings which is specified by the value of DOCUMENT_WITHOUT_EMBEDDING.
get_documents_by_id: Retrieves all documents in the index using their IDs. Since Pinecone does not support headers we removed the headers, parameter from the method which was present in v1.
delete_documents: It deletes all the documents with matching document_ids from the document store. Since Pinecone does not support headers we removed the headers parameter from the method which was present in v1.
Retriever
Pinecone Retriever Class for retrieving documents from the PineconeDocumentStore. It is similar to the BaseRetriever class of the retriever node in v1. In v1, the BaseRetriever would take the input as string and return the documents that are most relevant to the query. The Pinecone Retreiver takes the embedding of the query as input, and returns a dictionary of the retreived documents.
Methods implemented:
run: It takes the embedding of the query, filters, top_k value, scale score as input and returns a dictionary of the retreived documents. New in v2.
to_dict: Serializes the Retriever component to a dictionary. New in v2.
from_dict: Deserializes the Retriever component from a dictionary. New in v2.
Pinecone Concepts
Pinecone supports dense, sparse, and sparse-dense vectors. Currently, Haystack supports only dense vectors, with support for sparse and sparse-dense vectors forthcoming.
Adding vectors:
Create a new index. At this time we need to specify the index name, metric ('euclidean', 'cosine', or 'dotproduct'), and dimension.
Wait for the index to be created.
Generate vectors using any embedding model.
Each vector generated by the embedding model can be augmented with metadata.
Pinecone supports 40kb of metadata per vector.
Upsert the vectors, typically in batches.
Pinecone allows you to partition the records in an index into namespaces,
which can be specified during the upsert.
Queries and other operations can then be restricted by namespace.
After indexing the data, we can check the number of vectors created
Once our index is populated, queries can be run.
Query
The Query operation searches the index using a query vector. It retrieves the IDs of the most similar records in the index, along with their similarity scores.
Namespaces
Pinecone allows you to partition the records in an index into namespaces.
When upserting vectors into Pinecone you can select a namespace you want to upsert the vectors into.
Namespaces are created automatically the first time they are used to upsert records. If the namespace doesn't exist, it is created implicitly. Namespaces are uniquely identified by a namespace name.
While querying the vector database, you can query based on a specfic namespace.
Queries and other operations are then limited to one namespace, so different requests can search different subsets of your index.
Metadata Filtering
Pinecone lets you attach metadata key-value pairs to vectors in an index, and specify filter expressions when you query the index. Metadata can be included in upsert requests as you insert your vectors. You can limit your vector search based on metadata.
Searches with metadata filters retrieve exactly the number of nearest-neighbor
results that match the filters.
Metadata filter expressions can be included with queries to limit the search to only vectors matching the filter expression. Filters can be built using a set of operators that can be applied to strings and/or numbers:
The metadata filters can be combined with AND and OR:
Deleting vectors by metadata filter
Vectors can also be deleted by specifying metadata filters. This is done by passing a metadata filter expression to the delete operation. This deletes all vectors matching the metadata filter expression.
This code was written collaboratively with @awinml.