Merge pull request #17 from prrao87/qdrant

Qdrant: A vector database built on Rust
prrao87 · Apr 22, 2023 · dc1607d · dc1607d
2 parents def1585 + 1ee5c02
commit dc1607d
Show file tree

Hide file tree

Showing 19 changed files with 942 additions and 2 deletions.
diff --git a/.gitignore b/.gitignore
@@ -134,4 +134,5 @@ dmypy.json
 # data
 data/*.json
 data/*.jsonl
-*/*/meili_data
+*/*/meili_data
+dbs/qdrant/scripts/onnx_models
diff --git a/README.md b/README.md
@@ -8,10 +8,10 @@ Example code is provided for numerous databases, along with FastAPI docker deplo
 * Neo4j
 * Elasticsearch
 * Meilisearch
+* Qdrant
 
 #### 🚧 Coming soon
 
-* Qdrant
 * Weviate
 
 

diff --git a/dbs/qdrant/.env.example b/dbs/qdrant/.env.example
@@ -0,0 +1,12 @@
+QDRANT_VERSION = "v1.1.1"
+QDRANT_PORT = 6333
+QDRANT_HOST = "localhost"
+QDRANT_SERVICE = "qdrant"
+API_PORT = 8005
+EMBEDDING_MODEL_CHECKPOINT = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
+
+# Container image tag
+TAG = "0.1.0"
+
+# Docker project namespace (defaults to the current folder name if not set)
+COMPOSE_PROJECT_NAME = qdrant_wine
diff --git a/dbs/qdrant/Dockerfile b/dbs/qdrant/Dockerfile
@@ -0,0 +1,14 @@
+FROM python:3.10-slim-bullseye
+
+WORKDIR /wine
+
+COPY ./requirements-docker.txt /wine/requirements-docker.txt
+
+RUN pip install --no-cache-dir -U pip wheel setuptools
+RUN pip install --no-cache-dir -r /wine/requirements-docker.txt
+
+COPY ./api /wine/api
+COPY ./schemas /wine/schemas
+COPY ./scripts/onnx_models /wine/scripts/onnx_models
+
+EXPOSE 8000
diff --git a/dbs/qdrant/README.md b/dbs/qdrant/README.md
@@ -0,0 +1,158 @@
+# Qdrant
+
+[Qdrant](https://qdrant.tech/) is a vector database and vector similarity search engine written in Rust. The primary use case for a vector database is to answer business questions that involve connected data.
+
+* Which wines from Chile were tasted by at least two different tasters?
+* What are the top-rated wines from Italy that share their variety with my favourite ones from Portugal?
+
+Code is provided for ingesting the wine reviews dataset into Qdrant in an async fashion. In addition, a query API written in FastAPI is also provided that allows a user to query available endpoints. As always in FastAPI, documentation is available via OpenAPI (http://localhost:8000/docs).
+
+* All code (wherever possible) is async
+* [Pydantic](https://docs.pydantic.dev) is used for schema validation, both prior to data ingestion and during API request handling
+  * The same schema is used for data ingestion and for the API, so there is only one source of truth regarding how the data is handled
+* For ease of reproducibility, the whole setup is orchestrated and deployed via docker
+
+## Setup
+
+Note that this code base has been tested in Python 3.10, and requires a minimum of Python 3.10 to work. Install dependencies via `requirements.txt`.
+
+```sh
+# Setup the environment for the first time
+python -m venv qdrant_venv  # python -> python 3.10
+
+# Activate the environment (for subsequent runs)
+source qdrant_venv/bin/activate
+
+python -m pip install -r requirements.txt
+```
+
+--- 
+
+## Step 1: Set up containers
+
+Use the provided `docker-compose.yml` to initiate separate containers, one that runs Qdrant, and another one that serves as an API on top of the database.
+
+```
+docker compose up -d
+```
+
+This compose file starts a persistent-volume Qdrant database with credentials specified in `.env`. The `qdrant` variable in the environment file indicates that we are opening up the database service to a FastAPI server (running as a separate service, in a separate container) downstream. Both containers can communicate with one another with the common network that they share, on the exact port numbers specified.
+
+The services can be stopped at any time for maintenance and updates.
+
+```
+docker compose down
+```
+
+**Note:** The setup shown here would not be ideal in production, as there are other details related to security and scalability that are not addressed via simple docker, but, this is a good starting point to begin experimenting!
+
+
+## Step 1: Ingest the data
+
+Because Qdrant is a vector database, we ingest not only the wine reviews JSON blobs for each item, but also vectors (i.e., sentence embeddings) for the fields on which we want to perform a semantic similarity search. For this dataset, it's reasonable to expect that a simple concatenation of fields like `title`, `variety` and `description` would result in a useful sentence embedding that can be compared against a search query (which is also converted to a vector during query time).
+
+As an example, consider the following data snippet form the `data/` directory in this repo:
+
+```json
+"title": "Castello San Donato in Perano 2009 Riserva  (Chianti Classico)",
+"description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019.",
+"variety": "Red Blend"
+```
+
+### Choice of embedding model
+
+[SentenceTransformers](https://www.sbert.net/) is a Python framework for a range of sentence and text embeddings. It results from extensive work on fine-tuning BERT to work well on semantic similarity tasks using Siamese BERT networks, where the model is trained to predict the similarity between sentence pairs. The original work is [described here](https://arxiv.org/abs/1908.10084).
+
+#### Why use sentence transformers?
+
+Although larger and more powerful text embedding models exist (such as [OpenAI embeddings](https://platform.openai.com/docs/guides/embeddings)), they can become really expensive as they are not free, and charge per token of text they generate vectors for. SentenceTransformers are free and open-source, and have been optimized for years for performance (to utilize all CPU cores) as well as accuracy. A full list of sentence transformer models [is in their project page](https://www.sbert.net/docs/pretrained_models.html).
+
+For this work, it makes sense to use among the fastest models in this list, which is the `multi-qa-MiniLM-L6-cos-v1` **uncased** model. As the name suggests, it was tuned for semantic search and question answering, and generates sentence embeddings for single sentences or paragraphs up to a maximum sequence length of 512. It was trained on 215M question answer pairs from various sources. Compared to the more general-purpose `all-MiniLM-L6-v2` model, it shows slightly improved performance on semantic search tasks while offering a similar level of performance. [See the sbert docs](https://www.sbert.net/docs/pretrained_models.html) for more details on performance comparisons between the various pretrained models.
+
+
+### Run data loader
+
+Data is ingested into the Qdrant database through the scripts in the `scripts` directly.
+
+```sh
+cd scripts
+python bulk_index_sbert.py
+```
+
+This script validates the input JSON data via [Pydantic](https://docs.pydantic.dev), and then indexes them to Qdrant using the [Qdrant Python client](https://github.com/qdrant/qdrant-client).
+
+We simply concatenate the key fields that contain useful information about each wine, and vectorize them prior to indexing them to the database.
+
+
+## Step 3: Test API
+
+Once the data has been successfully loaded into Qdrant and the containers are up and running, we can test out a search query via an HTTP request as follows.
+
+```sh
+curl -X 'GET' \
+  'http://localhost:8000/wine/search?terms=tuscany%20red&max_price=50'
+```
+
+This cURL request passes the search terms "**tuscany red**" to the `/wine/search/` endpoint, which is then parsed into a working Cypher query by the FastAPI backend. The query runs and retrieves results from a full text search index (that looks for these keywords in the wine's title and description), and, if the setup was done correctly, we should see the following response:
+
+```json
+[
+    {
+        "wineID": 66393,
+        "country": "Italy",
+        "title": "Capezzana 1999 Ghiaie Della Furba Red (Tuscany)",
+        "description": "Very much a baby, this is one big, bold, burly Cab-Merlot-Syrah blend that's filled to the brim with extracted plum fruit, bitter chocolate and earth. It takes a long time in the glass for it to lose its youthful, funky aromatics, and on the palate things are still a bit scattered. But in due time things will settle and integrate",
+        "points": 90,
+        "price": 49,
+        "variety": "Red Blend",
+        "winery": "Capezzana"
+    },
+    {
+        "wineID": 40960,
+        "country": "Italy",
+        "title": "Fattoria di Grignano 2011 Pietramaggio Red (Toscana)",
+        "description": "Here's a simple but well made red from Tuscany that has floral aromas of violet and rose with berry notes. The palate offers bright cherry, red currant and a touch of spice. Pair this with pasta dishes or grilled vegetables.",
+        "points": 86,
+        "price": 11,
+        "variety": "Red Blend",
+        "winery": "Fattoria di Grignano"
+    },
+    {
+        "wineID": 73595,
+        "country": "Italy",
+        "title": "I Giusti e Zanza 2011 Belcore Red (Toscana)",
+        "description": "With aromas of violet, tilled soil and red berries, this blend of Sangiovese and Merlot recalls sunny Tuscany. It's loaded with wild cherry flavors accented by white pepper, cinnamon and vanilla. The palate is uplifted by vibrant acidity and fine tannins.",
+        "points": 89,
+        "price": 27,
+        "variety": "Red Blend",
+        "winery": "I Giusti e Zanza"
+    }
+]
+```
+
+Not bad! This example correctly returns some highly rated Tuscan red wines along with their price and country of origin (obviously, Italy in this case).
+
+### Step 4: Extend the API
+
+The API can be easily extended with the provided structure.
+
+- The `schemas` directory houses the Pydantic schemas, both for the data input as well as for the endpoint outputs
+  - As the data model gets more complex, we can add more files and separate the ingestion logic from the API logic here
+- The `api/routers` directory contains the endpoint routes so that we can provide additional endpoint that answer more business questions
+  - For e.g.: "What are the top rated wines from Argentina?"
+  - In general, it makes sense to organize specific business use cases into their own router files
+- The `api/main.py` file collects all the routes and schemas to run the API
+
+
+#### Existing endpoints
+
+So far, the following endpoints that help answer interesting questions have been implemented.
+
+```
+GET
+/wine/search
+Semantic similarity search
+```
+
+More to come soon!
+
diff --git a/dbs/qdrant/api/__init__.py b/dbs/qdrant/api/__init__.py
diff --git a/dbs/qdrant/api/config.py b/dbs/qdrant/api/config.py
@@ -0,0 +1,14 @@
+from pydantic import BaseSettings
+
+
+class Settings(BaseSettings):
+    qdrant_service: str
+    qdrant_port: str
+    qdrant_host: str
+    qdrant_service: str
+    api_port = str
+    embedding_model_checkpoint: str
+    tag: str
+
+    class Config:
+        env_file = ".env"
diff --git a/dbs/qdrant/api/main.py b/dbs/qdrant/api/main.py
@@ -0,0 +1,52 @@
+from collections.abc import AsyncGenerator
+from contextlib import asynccontextmanager
+from functools import lru_cache
+
+from fastapi import FastAPI
+from qdrant_client import QdrantClient
+
+from api.config import Settings
+from api.routers.wine import wine_router
+
+from scripts.onnx_optimizer import get_embedding_pipeline
+
+
+@lru_cache()
+def get_settings():
+    # Use lru_cache to avoid loading .env file for every request
+    return Settings()
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
+    """Async context manager for Qdrant database connection."""
+    settings = get_settings()
+    model_checkpoint = settings.embedding_model_checkpoint
+    app.model = get_embedding_pipeline(
+        "scripts/onnx_models", model_filename="model_optimized_quantized.onnx"
+    )
+    app.client = QdrantClient(host=settings.qdrant_service, port=settings.qdrant_port)
+    print("Successfully connected to Qdrant")
+    yield
+    print("Successfully closed Qdrant connection and released resources")
+
+
+app = FastAPI(
+    title="REST API for wine reviews on Qdrant",
+    description=(
+        "Query from a Qdrant database of 130k wine reviews from the Wine Enthusiast magazine"
+    ),
+    version=get_settings().tag,
+    lifespan=lifespan,
+)
+
+
+@app.get("/", include_in_schema=False)
+async def root():
+    return {
+        "message": "REST API for querying Qdrant database of 130k wine reviews from the Wine Enthusiast magazine"
+    }
+
+
+# Attach routes
+app.include_router(wine_router, prefix="/wine", tags=["wine"])
diff --git a/dbs/qdrant/api/routers/wine.py b/dbs/qdrant/api/routers/wine.py
@@ -0,0 +1,72 @@
+from qdrant_client import QdrantClient
+from qdrant_client.http import models
+from fastapi import APIRouter, HTTPException, Query, Request
+from optimum.pipelines import pipeline
+
+from schemas.retriever import SimilaritySearch
+
+wine_router = APIRouter()
+
+
+# --- Routes ---
+
+
+@wine_router.get(
+    "/search",
+    response_model=list[SimilaritySearch],
+    response_description="Search wines by title, description and variety",
+)
+def search_by_keywords(
+    request: Request,
+    terms: str = Query(description="Search wine by keywords in title, description and variety"),
+    max_price: float = Query(
+        default=10000.0, description="Specify the maximum price for the wine (e.g., 30)"
+    ),
+) -> list[SimilaritySearch] | None:
+    model = request.app.model
+    client = request.app.client
+    collection = "wines"
+    result = _search_by_keywords(client, model, collection, terms, max_price)
+    if not result:
+        raise HTTPException(
+            status_code=404,
+            detail=f"No wine with the provided terms '{terms}' found in database - please try again",
+        )
+    return result
+
+
+# --- Helper functions ---
+
+
+def _search_by_keywords(
+    client: QdrantClient, model: pipeline, collection: str, terms: str, max_price: float
+) -> list[SimilaritySearch] | None:
+    """Convert input text query into a vector for lookup in the db"""
+    vector = model(terms)[0][0]
+
+    # Define a range filter for wine price
+    filter = models.Filter(
+        **{
+            "must": [
+                {
+                    "key": "price",
+                    "range": {
+                        "lte": max_price,
+                    },
+                }
+            ]
+        }
+    )
+
+    # Use `vector` for similarity search on the closest vectors in the collection
+    search_result = client.search(
+        collection_name=collection, query_vector=vector, query_filter=filter, top=5
+    )
+    # `search_result` contains found vector ids with similarity scores along with the stored payload
+    # For now we are interested in payload only
+    payloads = [hit.payload for hit in search_result]
+    # # Qdrant doesn't appear to have a sort option for fields other than similarity score, so we just filter it ourselves
+    payloads = sorted(payloads, key=lambda x: x["points"], reverse=True)
+    if not payloads:
+        return None
+    return payloads
diff --git a/dbs/qdrant/docker-compose.yml b/dbs/qdrant/docker-compose.yml
@@ -0,0 +1,37 @@
+version: "3"
+
+services:
+  qdrant:
+    image: qdrant/qdrant:${QDRANT_VERSION}
+    restart: unless-stopped
+    environment:
+      - QDRANT_HOST=${QDRANT_HOST}
+    ports:
+      - ${QDRANT_PORT}:6333
+    volumes:
+      - qdrant_storage:/qdrant/storage
+  #   networks:
+  #     - wine
+
+  # fastapi:
+  #   image: qdrant_wine_fastapi:${TAG}
+  #   build: .
+  #   restart: unless-stopped
+  #   env_file:
+  #     - .env
+  #   ports:
+  #     - ${API_PORT}:8000
+  #   depends_on:
+  #     - qdrant
+  #   volumes:
+  #     - ./:/wine
+  #   networks:
+  #     - wine
+  #   command: uvicorn api.main:app --host 0.0.0.0 --port 8000 --reload
+
+volumes:
+  qdrant_storage:
+
+# networks:
+#   wine:
+#     driver: bridge