0.5.0
What's Changed
Added code for Qdrant, a vector database built in Rust
Includes:
Key features
Bulk index both the data and associated vectors (sentence embeddings) using sentence-transformers
into Qdrant so that we can perform similarity search on phrases.
- Unlike keyword based search, similarity search requires vectors that come from an NLP (typically transformer) model
- We use a pretrained model from
sentence-transformers
multi-qa-distilbert-cos-v1
is the model used: As per the docs, "This model was tuned for semantic search: Given a query/question, it can find relevant passages. It was trained on a large and diverse set of (question, answer) pairs."
- We use a pretrained model from
- Unlike other cases, generating sentence embeddings on a large batch of text is quite slow on a CPU, so some code is provided to generate ONNX-optimized and quantized models so that we both generate and index the vectors into db more rapidly without a GPU
Notes on ONNX performance
It looks like ONNX does utilize all available CPU cores when processing the text and generating the embeddings (the image below was generated from an AWS EC2 T2 ubuntu instance with a single 4-core CPU).
On average, the entire wine reviews dataset of 129,971 reviews is vectorized and ingested into Qdrant in 34 minutes via the quantized ONNX model, as opposed to more than 1 hour for the regular sbert
model downloaded from the sentence-transformers
repo. The quantized ONNX model is also ~33% smaller in size from the original model.
sbert
model: Processes roughly 51 items/sec- Quantized
onnxruntime
model: Processes roughly 92 items/sec
This amounts to a roughly 1.8x reduction in indexing time, with a ~26% smaller (quantized) model that loads and processes results faster. To verify that the embeddings from the quantized models are of similar quality, some example cosine similarities are shown below.
Example results:
The following results are for the sentence-transformers/multi-qa-MiniLM-L6-cos-v1
model that was built for semantic similarity tasks.
Vanilla model
---
Loading vanilla sentence transformer model
---
Similarity between 'I'm very happy' and 'I am so glad': [0.74601071]
Similarity between 'I'm very happy' and 'I'm so sad': [0.6456476]
Similarity between 'I'm very happy' and 'My dog is missing': [0.09541589]
Similarity between 'I'm very happy' and 'The universe is so vast!': [0.27607652]
Quantized ONNX model
---
Loading quantized ONNX model
---
The ONNX file model_optimized_quantized.onnx is not a regular name used in optimum.onnxruntime, the ORTModel might not behave as expected.
Similarity between 'I'm very happy' and 'I am so glad': [0.74153285]
Similarity between 'I'm very happy' and 'I'm so sad': [0.65299551]
Similarity between 'I'm very happy' and 'My dog is missing': [0.09312761]
Similarity between 'I'm very happy' and 'The universe is so vast!': [0.26112114]
As can be seen, the similarity scores are very close to the vanilla model, but the model is ~26% smaller and we are able to process the sentences much faster on the same CPU.