Skip to content

Commit

Permalink
Merge pull request #17 from prrao87/qdrant
Browse files Browse the repository at this point in the history
Qdrant: A vector database built on Rust
  • Loading branch information
prrao87 authored Apr 22, 2023
2 parents def1585 + 1ee5c02 commit dc1607d
Show file tree
Hide file tree
Showing 19 changed files with 942 additions and 2 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -134,4 +134,5 @@ dmypy.json
# data
data/*.json
data/*.jsonl
*/*/meili_data
*/*/meili_data
dbs/qdrant/scripts/onnx_models
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@ Example code is provided for numerous databases, along with FastAPI docker deplo
* Neo4j
* Elasticsearch
* Meilisearch
* Qdrant

#### 🚧 Coming soon

* Qdrant
* Weviate


Expand Down
12 changes: 12 additions & 0 deletions dbs/qdrant/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
QDRANT_VERSION = "v1.1.1"
QDRANT_PORT = 6333
QDRANT_HOST = "localhost"
QDRANT_SERVICE = "qdrant"
API_PORT = 8005
EMBEDDING_MODEL_CHECKPOINT = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"

# Container image tag
TAG = "0.1.0"

# Docker project namespace (defaults to the current folder name if not set)
COMPOSE_PROJECT_NAME = qdrant_wine
14 changes: 14 additions & 0 deletions dbs/qdrant/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM python:3.10-slim-bullseye

WORKDIR /wine

COPY ./requirements-docker.txt /wine/requirements-docker.txt

RUN pip install --no-cache-dir -U pip wheel setuptools
RUN pip install --no-cache-dir -r /wine/requirements-docker.txt

COPY ./api /wine/api
COPY ./schemas /wine/schemas
COPY ./scripts/onnx_models /wine/scripts/onnx_models

EXPOSE 8000
158 changes: 158 additions & 0 deletions dbs/qdrant/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
# Qdrant

[Qdrant](https://qdrant.tech/) is a vector database and vector similarity search engine written in Rust. The primary use case for a vector database is to answer business questions that involve connected data.

* Which wines from Chile were tasted by at least two different tasters?
* What are the top-rated wines from Italy that share their variety with my favourite ones from Portugal?

Code is provided for ingesting the wine reviews dataset into Qdrant in an async fashion. In addition, a query API written in FastAPI is also provided that allows a user to query available endpoints. As always in FastAPI, documentation is available via OpenAPI (http://localhost:8000/docs).

* All code (wherever possible) is async
* [Pydantic](https://docs.pydantic.dev) is used for schema validation, both prior to data ingestion and during API request handling
* The same schema is used for data ingestion and for the API, so there is only one source of truth regarding how the data is handled
* For ease of reproducibility, the whole setup is orchestrated and deployed via docker

## Setup

Note that this code base has been tested in Python 3.10, and requires a minimum of Python 3.10 to work. Install dependencies via `requirements.txt`.

```sh
# Setup the environment for the first time
python -m venv qdrant_venv # python -> python 3.10

# Activate the environment (for subsequent runs)
source qdrant_venv/bin/activate

python -m pip install -r requirements.txt
```

---

## Step 1: Set up containers

Use the provided `docker-compose.yml` to initiate separate containers, one that runs Qdrant, and another one that serves as an API on top of the database.

```
docker compose up -d
```

This compose file starts a persistent-volume Qdrant database with credentials specified in `.env`. The `qdrant` variable in the environment file indicates that we are opening up the database service to a FastAPI server (running as a separate service, in a separate container) downstream. Both containers can communicate with one another with the common network that they share, on the exact port numbers specified.

The services can be stopped at any time for maintenance and updates.

```
docker compose down
```

**Note:** The setup shown here would not be ideal in production, as there are other details related to security and scalability that are not addressed via simple docker, but, this is a good starting point to begin experimenting!


## Step 1: Ingest the data

Because Qdrant is a vector database, we ingest not only the wine reviews JSON blobs for each item, but also vectors (i.e., sentence embeddings) for the fields on which we want to perform a semantic similarity search. For this dataset, it's reasonable to expect that a simple concatenation of fields like `title`, `variety` and `description` would result in a useful sentence embedding that can be compared against a search query (which is also converted to a vector during query time).

As an example, consider the following data snippet form the `data/` directory in this repo:

```json
"title": "Castello San Donato in Perano 2009 Riserva (Chianti Classico)",
"description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019.",
"variety": "Red Blend"
```

### Choice of embedding model

[SentenceTransformers](https://www.sbert.net/) is a Python framework for a range of sentence and text embeddings. It results from extensive work on fine-tuning BERT to work well on semantic similarity tasks using Siamese BERT networks, where the model is trained to predict the similarity between sentence pairs. The original work is [described here](https://arxiv.org/abs/1908.10084).

#### Why use sentence transformers?

Although larger and more powerful text embedding models exist (such as [OpenAI embeddings](https://platform.openai.com/docs/guides/embeddings)), they can become really expensive as they are not free, and charge per token of text they generate vectors for. SentenceTransformers are free and open-source, and have been optimized for years for performance (to utilize all CPU cores) as well as accuracy. A full list of sentence transformer models [is in their project page](https://www.sbert.net/docs/pretrained_models.html).

For this work, it makes sense to use among the fastest models in this list, which is the `multi-qa-MiniLM-L6-cos-v1` **uncased** model. As the name suggests, it was tuned for semantic search and question answering, and generates sentence embeddings for single sentences or paragraphs up to a maximum sequence length of 512. It was trained on 215M question answer pairs from various sources. Compared to the more general-purpose `all-MiniLM-L6-v2` model, it shows slightly improved performance on semantic search tasks while offering a similar level of performance. [See the sbert docs](https://www.sbert.net/docs/pretrained_models.html) for more details on performance comparisons between the various pretrained models.


### Run data loader

Data is ingested into the Qdrant database through the scripts in the `scripts` directly.

```sh
cd scripts
python bulk_index_sbert.py
```

This script validates the input JSON data via [Pydantic](https://docs.pydantic.dev), and then indexes them to Qdrant using the [Qdrant Python client](https://github.com/qdrant/qdrant-client).

We simply concatenate the key fields that contain useful information about each wine, and vectorize them prior to indexing them to the database.


## Step 3: Test API

Once the data has been successfully loaded into Qdrant and the containers are up and running, we can test out a search query via an HTTP request as follows.

```sh
curl -X 'GET' \
'http://localhost:8000/wine/search?terms=tuscany%20red&max_price=50'
```

This cURL request passes the search terms "**tuscany red**" to the `/wine/search/` endpoint, which is then parsed into a working Cypher query by the FastAPI backend. The query runs and retrieves results from a full text search index (that looks for these keywords in the wine's title and description), and, if the setup was done correctly, we should see the following response:

```json
[
{
"wineID": 66393,
"country": "Italy",
"title": "Capezzana 1999 Ghiaie Della Furba Red (Tuscany)",
"description": "Very much a baby, this is one big, bold, burly Cab-Merlot-Syrah blend that's filled to the brim with extracted plum fruit, bitter chocolate and earth. It takes a long time in the glass for it to lose its youthful, funky aromatics, and on the palate things are still a bit scattered. But in due time things will settle and integrate",
"points": 90,
"price": 49,
"variety": "Red Blend",
"winery": "Capezzana"
},
{
"wineID": 40960,
"country": "Italy",
"title": "Fattoria di Grignano 2011 Pietramaggio Red (Toscana)",
"description": "Here's a simple but well made red from Tuscany that has floral aromas of violet and rose with berry notes. The palate offers bright cherry, red currant and a touch of spice. Pair this with pasta dishes or grilled vegetables.",
"points": 86,
"price": 11,
"variety": "Red Blend",
"winery": "Fattoria di Grignano"
},
{
"wineID": 73595,
"country": "Italy",
"title": "I Giusti e Zanza 2011 Belcore Red (Toscana)",
"description": "With aromas of violet, tilled soil and red berries, this blend of Sangiovese and Merlot recalls sunny Tuscany. It's loaded with wild cherry flavors accented by white pepper, cinnamon and vanilla. The palate is uplifted by vibrant acidity and fine tannins.",
"points": 89,
"price": 27,
"variety": "Red Blend",
"winery": "I Giusti e Zanza"
}
]
```

Not bad! This example correctly returns some highly rated Tuscan red wines along with their price and country of origin (obviously, Italy in this case).

### Step 4: Extend the API

The API can be easily extended with the provided structure.

- The `schemas` directory houses the Pydantic schemas, both for the data input as well as for the endpoint outputs
- As the data model gets more complex, we can add more files and separate the ingestion logic from the API logic here
- The `api/routers` directory contains the endpoint routes so that we can provide additional endpoint that answer more business questions
- For e.g.: "What are the top rated wines from Argentina?"
- In general, it makes sense to organize specific business use cases into their own router files
- The `api/main.py` file collects all the routes and schemas to run the API


#### Existing endpoints

So far, the following endpoints that help answer interesting questions have been implemented.

```
GET
/wine/search
Semantic similarity search
```

More to come soon!

Empty file added dbs/qdrant/api/__init__.py
Empty file.
14 changes: 14 additions & 0 deletions dbs/qdrant/api/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from pydantic import BaseSettings


class Settings(BaseSettings):
qdrant_service: str
qdrant_port: str
qdrant_host: str
qdrant_service: str
api_port = str
embedding_model_checkpoint: str
tag: str

class Config:
env_file = ".env"
52 changes: 52 additions & 0 deletions dbs/qdrant/api/main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
from collections.abc import AsyncGenerator
from contextlib import asynccontextmanager
from functools import lru_cache

from fastapi import FastAPI
from qdrant_client import QdrantClient

from api.config import Settings
from api.routers.wine import wine_router

from scripts.onnx_optimizer import get_embedding_pipeline


@lru_cache()
def get_settings():
# Use lru_cache to avoid loading .env file for every request
return Settings()


@asynccontextmanager
async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
"""Async context manager for Qdrant database connection."""
settings = get_settings()
model_checkpoint = settings.embedding_model_checkpoint
app.model = get_embedding_pipeline(
"scripts/onnx_models", model_filename="model_optimized_quantized.onnx"
)
app.client = QdrantClient(host=settings.qdrant_service, port=settings.qdrant_port)
print("Successfully connected to Qdrant")
yield
print("Successfully closed Qdrant connection and released resources")


app = FastAPI(
title="REST API for wine reviews on Qdrant",
description=(
"Query from a Qdrant database of 130k wine reviews from the Wine Enthusiast magazine"
),
version=get_settings().tag,
lifespan=lifespan,
)


@app.get("/", include_in_schema=False)
async def root():
return {
"message": "REST API for querying Qdrant database of 130k wine reviews from the Wine Enthusiast magazine"
}


# Attach routes
app.include_router(wine_router, prefix="/wine", tags=["wine"])
72 changes: 72 additions & 0 deletions dbs/qdrant/api/routers/wine.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
from qdrant_client import QdrantClient
from qdrant_client.http import models
from fastapi import APIRouter, HTTPException, Query, Request
from optimum.pipelines import pipeline

from schemas.retriever import SimilaritySearch

wine_router = APIRouter()


# --- Routes ---


@wine_router.get(
"/search",
response_model=list[SimilaritySearch],
response_description="Search wines by title, description and variety",
)
def search_by_keywords(
request: Request,
terms: str = Query(description="Search wine by keywords in title, description and variety"),
max_price: float = Query(
default=10000.0, description="Specify the maximum price for the wine (e.g., 30)"
),
) -> list[SimilaritySearch] | None:
model = request.app.model
client = request.app.client
collection = "wines"
result = _search_by_keywords(client, model, collection, terms, max_price)
if not result:
raise HTTPException(
status_code=404,
detail=f"No wine with the provided terms '{terms}' found in database - please try again",
)
return result


# --- Helper functions ---


def _search_by_keywords(
client: QdrantClient, model: pipeline, collection: str, terms: str, max_price: float
) -> list[SimilaritySearch] | None:
"""Convert input text query into a vector for lookup in the db"""
vector = model(terms)[0][0]

# Define a range filter for wine price
filter = models.Filter(
**{
"must": [
{
"key": "price",
"range": {
"lte": max_price,
},
}
]
}
)

# Use `vector` for similarity search on the closest vectors in the collection
search_result = client.search(
collection_name=collection, query_vector=vector, query_filter=filter, top=5
)
# `search_result` contains found vector ids with similarity scores along with the stored payload
# For now we are interested in payload only
payloads = [hit.payload for hit in search_result]
# # Qdrant doesn't appear to have a sort option for fields other than similarity score, so we just filter it ourselves
payloads = sorted(payloads, key=lambda x: x["points"], reverse=True)
if not payloads:
return None
return payloads
37 changes: 37 additions & 0 deletions dbs/qdrant/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
version: "3"

services:
qdrant:
image: qdrant/qdrant:${QDRANT_VERSION}
restart: unless-stopped
environment:
- QDRANT_HOST=${QDRANT_HOST}
ports:
- ${QDRANT_PORT}:6333
volumes:
- qdrant_storage:/qdrant/storage
# networks:
# - wine

# fastapi:
# image: qdrant_wine_fastapi:${TAG}
# build: .
# restart: unless-stopped
# env_file:
# - .env
# ports:
# - ${API_PORT}:8000
# depends_on:
# - qdrant
# volumes:
# - ./:/wine
# networks:
# - wine
# command: uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload

volumes:
qdrant_storage:

# networks:
# wine:
# driver: bridge
Loading

0 comments on commit dc1607d

Please sign in to comment.