-
Notifications
You must be signed in to change notification settings - Fork 28
pqai index
This service maintains indexes. Indexes are special data structures that enable fast search through a corpus. Although this service's scope does not limit to vector indexes, they are the kind of indexes that are used in the current implementation.
Vector indexes store representations of real-valued vectors (typically having hundreds of dimensions) that are themselves representations of real-world entities such as text or images.
Given a query vector, vector indexes make it easy (and fast) to find other vectors similar to it (measured by a metric such as cosine distance). To identify the real-world entity to which the similar vectors correspond, the vector indexes also carry IDs associated with each vector. When the vectors correspond to patents, for example, the IDs may be their patent numbers.
Deep neural networks can create vector representations of real-world data (such as text, images, sounds - pretty much anything) that encode important semantic information about it. This enables searching items beyond just matching their surface forms. This is where vector indexes come in handy.
Without vector indexes, search could still be performed on vectors by comparing a query vector with each of the item vectors and then finding which ones are most similar. Such a one-to-many comparison would be excessively slow for millions of vectors.
Vector indexes enable finding similar vectors by performing what is called "approximate neighbor searching." During indexing, a specialized data structure is created which makes it really easy (and fast) to find some of the closest neighbors to a given query vector. Note that the results of search from such an index are not always 100% true, but in practice (and when parameters of indexing are properly set + small number of neighbors are requested) these approximate neighbors are same as the true closest neighbors.
root/
|-- core/
|-- indexer.py # code for creating indexes
|-- indexes.py # code for searching through indexes
|-- storage.py # code for storing and loading indexes
|-- indexes/ # put index files here
|-- tests/
|-- test_indexes/ # toy indexes used for testing
|-- test_server.py # Tests for the REST API
|-- test_indexes.py
|-- test_indexer.py
|-- main.py # Defines the REST API
|
|-- requirements.txt # List of Python dependencies
|
|-- Dockerfile # Docker files
|-- docker-compose.yml
|
|-- env # .env file template
|-- deploy.sh # Script for setting up service on local
This module has code that creates vector indexes. It defines the following two classes:
FaissIndexCreator
- ``AnnoyIndexCreator`
The two differ only in the type of indexes they create.
The FAISSIndexCreator
creates indexes in the .faiss
format. More information about this format can be found in its documentation. FAISS indexes need to be loaded into the main memory for searching, due to which the memory requirements are high. Searching is very fast, however, and GPU acceleration can also be used to further reduce search latency.
In memory-restricted settings, however, AnnoyIndexCreator
is a better choice.
The AnnoyIndexCreator
outputs indexes in the .ann
format. More information about this format can be found in its documentation. These indexes need not be loaded into the main memory for searching but instead use memory mapping to search directly from the disk. Searching is generally slower but for small indexes, the performance is acceptable.
To create an index, you just need to supply the vectors and their associated labels to the create
method of these classes. Typical usage is as follows:
from core.indexer import FaissIndexCreator
import numpy as np
n_vectors = 20000
n_dims = 128
shape = (n_vectors, n_dims)
vectors = np.random.normal(size=shape).astype("float32")
labels = [str(i) for i in range(n_vectors)]
index_name = "test_index"
save_dir = "./"
options = {"normalize": True, "factory_string": "OPQ16_64,HNSW32"}
creator = FaissIndexCreator(**options)
creator.create(
name=self.index_name,
vectors=self.vectors,
labels=self.labels,
n_train=None,
save_dir=save_dir
)
The above code will create an index file test_index.faiss
and a metadata file test_index.config.json
in the current working directory. These files can then be read by a FaissIndexReader
instance to create a FaissIndex
object. It can be used to find vectors similar to a given query vector.
This module provides wrapper around FAISS and Annoy indexes through the following classes:
FaissIndex
AnnoyIndex
Both of these classes inherit their interface from the abstract VectorIndex
class, also defined in the same module. VectorIndex
exposes a single abstract method called search
which takes two arguments: a query vector and a number of results to be returned.
Initialization of these classes require pre-instantiated versions of core FAISS and Annoy index objects. To hide those details from the users of these classes, two 'index readers' are provided by this module:
FaissIndexReader
AnnoyIndexReader
Both of these expose a method called read_from_files
which, as the name implies, accepts paths to the files created by FaissIndexCreator
and AnnoyIndexCreator
. It returns FaissIndex
and AnnoyIndex
objects, which can be directly used by their search
methods. So in practice, FaissIndex
and AnnoyIndex
objects may never be required to be instantiated directly.
Typical usage for reading an FAISS index and searching is as follows:
index_file = "B68G.abs.faiss"
json_file = "B68G.abs.items.json"
reader = FaissIndexReader()
index = reader.read_from_files(index_file, json_file)
qvec = np.ones(768)
n_results = 10
results = index.search(qvec, n_results)
For an Annoy index, typical usage is as follows:
ann_file = "Y02T.ttl.ann"
json_file = "Y02T.ttl.items.json"
reader = AnnoyIndexReader(768, "angular")
index = reader.read_from_files(ann_file, json_file)
qvec = np.ones(768)
n_results = 10
results = index.search(qvec, n_results)
A single big vector index is quite difficult to manage in real life situations (for example, it's hard to update with new vectors) due to which it's preferable to have multiple indexes in a production setting.
Keeping track of which indexes are present and reading them one by one (depending on what type they are - FAISS or Annoy or other) can also be a hassle.
To mitigate this situation, this module provides the IndexStorage
class, which provides a way to manage any mix of FAISS and Annoy indexes. They should all be stored in a directory, the path of which is provided during instantiation of IndexStorage
.
During initialization, it will discover all indexes in the given directory and load them to memory (if needed). They can then be listed using the available
method and obtained through its get
method. The get
method takes a prefix and returns all indexes whose names start with that prefix.
Typical usage is as follows:
index_dir = "./indexes"
indexes = IndexStorage(index_dir)
print(indexes.available()) # list of index names
for index_id in indexes:
index = indexes.get(index)
print(type(index)) # FAISSIndex or AnnoyIndex
Prerequisites
The following deployment steps assume that you are running a Linux distribution and have Git and Docker installed on your system.
Setup
The easiest way to get this service up and running on your local system is to follow these steps:
-
Clone the repository
git clone https://github.com/pqaidevteam/pqai-index.git
-
Using the
env
template in the repository, create a.env
file and set the environment variables.cd pqai-index cp env .env nano .env
-
Run
deploy.sh
script.chmod +x deploy.sh bash ./deploy.sh
This will create a docker image and run it as a docker container on the port number you specified in the .env
file.
Alternatively, after following steps (1) and (2) above, you can use the command python main.py
to run the service in a terminal.
This service is dependent on the following other services:
- pqai-encoder (only during indexing)
The following services depend on this service:
- pqai-gateway
- Johnson, Jeff, Matthijs Douze, and Herve Jegou. Billion-Scale Similarity Search with GPUs. arXiv, February 28, 2017. http://arxiv.org/abs/1702.08734.