pqai index

PQAI Index

This service maintains indexes. Indexes are special data structures that enable fast search through a corpus. Although this service's scope does not limit to vector indexes, they are the kind of indexes that are used in the current implementation.

Vector indexes store representations of real-valued vectors (typically having hundreds of dimensions) that are themselves representations of real-world entities such as text or images.

Given a query vector, vector indexes make it easy (and fast) to find other vectors similar to it (measured by a metric such as cosine distance). To identify the real-world entity to which the similar vectors correspond, the vector indexes also carry IDs associated with each vector. When the vectors correspond to patents, for example, the IDs may be their patent numbers.

Deep neural networks can create vector representations of real-world data (such as text, images, sounds - pretty much anything) that encode important semantic information about it. This enables searching items beyond just matching their surface forms. This is where vector indexes come in handy.

Without vector indexes, search could still be performed on vectors by comparing a query vector with each of the item vectors and then finding which ones are most similar. Such a one-to-many comparison would be excessively slow for millions of vectors.

Vector indexes enable finding similar vectors by performing what is called "approximate neighbor searching." During indexing, a specialized data structure is created which makes it really easy (and fast) to find some of the closest neighbors to a given query vector. Note that the results of search from such an index are not always 100% true, but in practice (and when parameters of indexing are properly set + small number of neighbors are requested) these approximate neighbors are same as the true closest neighbors.

Code structure

root/
  |-- core/
       |-- indexer.py			# code for creating indexes
       |-- indexes.py			# code for searching through indexes
       |-- storage.py			# code for storing and loading indexes
  |-- indexes/					# put index files here
  |-- tests/
        |-- test_indexes/		# toy indexes used for testing
        |-- test_server.py		# Tests for the REST API
        |-- test_indexes.py
        |-- test_indexer.py
  |-- main.py					# Defines the REST API
  |
  |-- requirements.txt			# List of Python dependencies
  |
  |-- Dockerfile				# Docker files
  |-- docker-compose.yml
  |
  |-- env						# .env file template
  |-- deploy.sh					# Script for setting up service on local

Core modules

Indexer

This module has code that creates vector indexes. It defines the following two classes:

FaissIndexCreator
``AnnoyIndexCreator`

The two differ only in the type of indexes they create.

The FAISSIndexCreator creates indexes in the .faiss format. More information about this format can be found in its documentation. FAISS indexes need to be loaded into the main memory for searching, due to which the memory requirements are high. Searching is very fast, however, and GPU acceleration can also be used to further reduce search latency.

In memory-restricted settings, however, AnnoyIndexCreator is a better choice.

The AnnoyIndexCreator outputs indexes in the .ann format. More information about this format can be found in its documentation. These indexes need not be loaded into the main memory for searching but instead use memory mapping to search directly from the disk. Searching is generally slower but for small indexes, the performance is acceptable.

To create an index, you just need to supply the vectors and their associated labels to the create method of these classes. Typical usage is as follows:

from core.indexer import FaissIndexCreator
import numpy as np

n_vectors = 20000
n_dims = 128
shape = (n_vectors, n_dims)
vectors = np.random.normal(size=shape).astype("float32")
labels = [str(i) for i in range(n_vectors)]
index_name = "test_index"
save_dir = "./"

options = {"normalize": True, "factory_string": "OPQ16_64,HNSW32"}
creator = FaissIndexCreator(**options)

creator.create(
    name=self.index_name,
    vectors=self.vectors,
    labels=self.labels,
    n_train=None,
    save_dir=save_dir
)

The above code will create an index file test_index.faiss and a metadata file test_index.config.json in the current working directory. These files can then be read by a FaissIndexReader instance to create a FaissIndex object. It can be used to find vectors similar to a given query vector.

Indexes

This module provides wrapper around FAISS and Annoy indexes through the following classes:

FaissIndex
AnnoyIndex

Both of these classes inherit their interface from the abstract VectorIndex class, also defined in the same module. VectorIndex exposes a single abstract method called search which takes two arguments: a query vector and a number of results to be returned.

Initialization of these classes require pre-instantiated versions of core FAISS and Annoy index objects. To hide those details from the users of these classes, two 'index readers' are provided by this module:

FaissIndexReader
AnnoyIndexReader

Both of these expose a method called read_from_files which, as the name implies, accepts paths to the files created by FaissIndexCreator and AnnoyIndexCreator. It returns FaissIndex and AnnoyIndex objects, which can be directly used by their search methods. So in practice, FaissIndex and AnnoyIndex objects may never be required to be instantiated directly.

Typical usage for reading an FAISS index and searching is as follows:

index_file = "B68G.abs.faiss"
json_file = "B68G.abs.items.json"

reader = FaissIndexReader()
index = reader.read_from_files(index_file, json_file)

qvec = np.ones(768)
n_results = 10
results = index.search(qvec, n_results)

For an Annoy index, typical usage is as follows:

ann_file = "Y02T.ttl.ann"
json_file = "Y02T.ttl.items.json"

reader = AnnoyIndexReader(768, "angular")
index = reader.read_from_files(ann_file, json_file)

qvec = np.ones(768)
n_results = 10
results = index.search(qvec, n_results)

Storage

A single big vector index is quite difficult to manage in real life situations (for example, it's hard to update with new vectors) due to which it's preferable to have multiple indexes in a production setting.

Keeping track of which indexes are present and reading them one by one (depending on what type they are - FAISS or Annoy or other) can also be a hassle.

To mitigate this situation, this module provides the IndexStorage class, which provides a way to manage any mix of FAISS and Annoy indexes. They should all be stored in a directory, the path of which is provided during instantiation of IndexStorage.

During initialization, it will discover all indexes in the given directory and load them to memory (if needed). They can then be listed using the available method and obtained through its get method. The get method takes a prefix and returns all indexes whose names start with that prefix.

Typical usage is as follows:

index_dir = "./indexes"
indexes = IndexStorage(index_dir)
print(indexes.available()) # list of index names
for index_id in indexes:
    index = indexes.get(index)
    print(type(index)) # FAISSIndex or AnnoyIndex

Deployment

Prerequisites

The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.

Setup

The easiest way to get this service up and running on your local system is to follow these steps:

Clone the repository

git clone https://github.com/pqaidevteam/pqai-index.git

Using the env template in the repository, create a .env file and set the environment variables.
```
cd pqai-index
cp env .env
nano .env
```
Run deploy.sh script.
```
chmod +x deploy.sh
bash ./deploy.sh
```