Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add PineconeDocumentStore (v2) #43

Closed
wants to merge 6 commits into from

Conversation

vrunm
Copy link
Contributor

@vrunm vrunm commented Oct 29, 2023

Related Issues:

Fixes #6055

Pinecone Document store

Based on the basic contract that all DocumentStores are expected to follow, we have implemented the following classes for the document store.

class PineconeDocumentStore:
    def __init__(
        self,
        api_key: str,
        environment: str = "us-west1-gcp",
        pinecone_index: Optional["pinecone.Index"] = None,
        embedding_dim: int = 768,
        batch_size: int = 100,
        return_embedding: bool = False,
        index: str = "document",
        similarity: str = "cosine",
        replicas: int = 1,
        shards: int = 1,
        namespace: Optional[str] = None,
        embedding_field: str = "embedding",
        progress_bar: bool = True,
        duplicate_documents: str = "overwrite",
        recreate_index: bool = False,
        metadata_config: Optional[Dict] = None,
        validate_index_sync: bool = True,
    )   
    def get_index_stats(self) -> [Dict[str, Any]]
    def write_documents(self, documents: List[Document], policy: DuplicatePolicy = "fail") -> None
    def get_document_count(
        self,
        filters: Dict[str, Any] = None,
        index: Optional[str] = None,
        only_documents_without_embedding: bool = False,
        headers: Optional[Dict[str, str]] = None,
        namespace: Optional[str] = None,
        type_metadata: Optional[DocTypeMetadata] = None,
    ) -> int
    def get_embedding_count(
        self,
        filters: Optional[Dict[str, Any]] = None,
        index: Optional[str] = None,
        namespace: Optional[str] = None,
    ) -> int
    def count_documents(self) -> int
    def query_by_embedding(
        self,
        query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: int = 10,
        scale_score: bool = True,
        return_embedding: Optional[bool] = None,
    ) -> List[Document]
    def filter_documents(self, filters: Optional[Dict[str, Any]] = None) -> List[Document]
    def get_documents_by_id(
        self,
        ids: List[str],
        index: Optional[str] = None,
        batch_size: int = 100,
        return_embedding: Optional[bool] = None,
        namespace: Optional[str] = None,
        include_type_metadata: Optional[bool] = False,
    ) -> List[Document]:
    def delete_documents(self, document_ids: List[str]) -> None
    def delete_index(self, index: Optional[str]) -> None
   
class PineconeRetriever: 
    def run(
        self,
        query_embedding: List[float],
        filters: Optional[Dict[str, Any]] = None,
        top_k: Optional[int] = None,
        scale_score: Optional[bool] = None,
        return_embedding: Optional[bool] = None,
    ) -> Dict[str, List[Document]]
    def from_dict(cls, data: Dict[str, Any]) -> "PineconeRetriever" 
    def to_dict(self) -> Dict[str, Any]

Indexing Pipeline

First the document store node is added which initializes the document store. The documents as a list of document objects are added to the document store using the write_documents method. Then the embedder node is added which creates the corresponding embeddings for each document using the DocumentEmbedder. The documents with their embeddings are stored in the document store.

Indexing pipelines prepare your files for search. Their main objective is to convert your files into Haystack Documents, so that they can be saved in a DocumentStore.

During indexing, we do not use any Retriever, but rather a DocumentEmbedder. This class accepts a model name and simply adds embeddings to the Documents it receives.

Embedders encode a list of data points (strings, images, etc.) into a list of vectors (i.e., the embeddings) using a model. The embedders are used both in the indexing (to encode documents) and query pipelines (encode query).

When embedding documents, the Embedder receives a list of Document objects as input. For each item in the list, the corresponding vectors are computed and stored in the embedding field of the item itself. The list is then returned as the output.

The Document class contains the query and the embedding. In v1 the Document class and the embedding were separate.

When working with documents, there's the possibility to compute embeddings also for the document's metadata. In this case, the Embedder will be responsible for performing any text-manipulation work needed in preparation of the actual embedding process.

Query Pipeline

First the document store node is added which has the documents stored in the document store. Then the embedding node is added, the TextEmbedder class will create the embedding of the query. The filter_documents method will retrieve the documents with the specific filters. Then the retriever node is added which will call the run method to retrieve the relevant documents from the document store.

First the documents with the query and embeddings are created.
During query, the first step is not a Retriever anymore, but a StringEmbedder. This will convert the query into its embedding representation and forward it over to a Retriever that expects it. When embedding queries, the Embedder receives a list of strings in input that are transformed into a list of vectors returned as output.

The embedders were part of the retreiver in v1. In v2, the embedders will be separate and used for creating embeddings.

Retrievers retrieve Documents from the DocumentStores. They are specific and aware of which Store has been used. For e.g., PineconeRetriever for the PineconeDocumentStore. They will be commonly used in query pipelines (not in indexing pipelines).

Methods implemented:

  • get_index_stats: It returns statistics about the index's contents, including the vector count per namespace and the number of dimensions. New in v2.

  • count_documents: Returns the no of documents which are present in the document store. Similar to get_all_documents in v1. New in v2.

  • filter_documents: It takes the input query and embedding as input and returns a list of documents that match the filters. Returns the documents which match the filters provided. Added: Filtering by the query embedding similar to update_embedding in v1.

  • write_documents: The write documents now takes only the document objects as input whereas in v1 it would take the embedding and documents as input. It takes a list of documents as input. To store documents in Pinecone, we use dummy embeddings, if the embedding is not passed with the document. We added the functionality of writing the documents if the user does not pass the embedding. We write the documents using the dummy embeddings which is specified by the value of DOCUMENT_WITHOUT_EMBEDDING.

  • get_documents_by_id: Retrieves all documents in the index using their IDs. Since Pinecone does not support headers we removed the headers, parameter from the method which was present in v1.

  • delete_documents: It deletes all the documents with matching document_ids from the document store. Since Pinecone does not support headers we removed the headers parameter from the method which was present in v1.

Retriever

Pinecone Retriever Class for retrieving documents from the PineconeDocumentStore. It is similar to the BaseRetriever class of the retriever node in v1. In v1, the BaseRetriever would take the input as string and return the documents that are most relevant to the query. The Pinecone Retreiver takes the embedding of the query as input, and returns a dictionary of the retreived documents.

Methods implemented:

  • run: It takes the embedding of the query, filters, top_k value, scale score as input and returns a dictionary of the retreived documents. New in v2.

  • to_dict: Serializes the Retriever component to a dictionary. New in v2.

  • from_dict: Deserializes the Retriever component from a dictionary. New in v2.

Pinecone Concepts

Pinecone supports dense, sparse, and sparse-dense vectors. Currently, Haystack supports only dense vectors, with support for sparse and sparse-dense vectors forthcoming.

Adding vectors:

  1. Create a new index. At this time we need to specify the index name, metric ('euclidean', 'cosine', or 'dotproduct'), and dimension.

  2. Wait for the index to be created.

  3. Generate vectors using any embedding model.
    Each vector generated by the embedding model can be augmented with metadata.
    Pinecone supports 40kb of metadata per vector.

  4. Upsert the vectors, typically in batches.
    Pinecone allows you to partition the records in an index into namespaces,
    which can be specified during the upsert.
    Queries and other operations can then be restricted by namespace.

  5. After indexing the data, we can check the number of vectors created

Once our index is populated, queries can be run.

Query

The Query operation searches the index using a query vector. It retrieves the IDs of the most similar records in the index, along with their similarity scores.

Namespaces

Pinecone allows you to partition the records in an index into namespaces.
When upserting vectors into Pinecone you can select a namespace you want to upsert the vectors into.
Namespaces are created automatically the first time they are used to upsert records. If the namespace doesn't exist, it is created implicitly. Namespaces are uniquely identified by a namespace name.
While querying the vector database, you can query based on a specfic namespace.
Queries and other operations are then limited to one namespace, so different requests can search different subsets of your index.

Metadata Filtering

Pinecone lets you attach metadata key-value pairs to vectors in an index, and specify filter expressions when you query the index. Metadata can be included in upsert requests as you insert your vectors. You can limit your vector search based on metadata.

Searches with metadata filters retrieve exactly the number of nearest-neighbor
results that match the filters.

Metadata filter expressions can be included with queries to limit the search to only vectors matching the filter expression. Filters can be built using a set of operators that can be applied to strings and/or numbers:
The metadata filters can be combined with AND and OR:

  • $eq - Equal to (number, string, boolean)
  • $ne - Not equal to (number, string, boolean)
  • $gt - Greater than (number)
  • $gte - Greater than or equal to (number)
  • $lt - Less than (number)
  • $lte - Less than or equal to (number)
  • $in - In array (string or number)
  • $nin - Not in array (string or number)

Deleting vectors by metadata filter

Vectors can also be deleted by specifying metadata filters. This is done by passing a metadata filter expression to the delete operation. This deletes all vectors matching the metadata filter expression.


This code was written collaboratively with @awinml.

@vrunm vrunm requested a review from a team as a code owner October 29, 2023 07:22
@vrunm vrunm requested review from anakin87 and removed request for a team October 29, 2023 07:22
@anakin87 anakin87 self-assigned this Oct 30, 2023
@anakin87
Copy link
Member

Hey, @vrunm and @awinml...

Thanks for opening this PR!

The review will probably take a few days, because we are refactoring the Document class and this will also affect the Document Store.

Please bear with me and thank you again!

@anakin87
Copy link
Member

Hey, I made some changes:

  • I adapted this PR to the recent Document refactoring
  • I set up the test workflow

Before improving this PR, we would like to wait until some other refactorings (especially on filtering mechanisms) are in place...

So please bear with me again...

@CLAassistant
Copy link

CLAassistant commented Dec 6, 2023

CLA assistant check
All committers have signed the CLA.

@anakin87
Copy link
Member

anakin87 commented Dec 6, 2023

Hey, I'm closing this in favor of #81.

I moved your commits there, so don't worry: your work will be recognized...

I do this for two reasons:

  • having the draft PR in an official branch makes it possible to run tests using our API Key (not supported in forks)
  • after the recent changes, there is quite a bit of work to be done to adapt your good contribution

Thanks again!

@anakin87 anakin87 closed this Dec 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pinecone Document Store Support (v2)
3 participants