Merge pull request #21 from prrao87/weviate

Weaviate: An ML-first vector database for similarity/hybrid search
prrao87 · Apr 25, 2023 · 62e165d · 62e165d
2 parents 48b22cd + 807be08
commit 62e165d
Show file tree

Hide file tree

Showing 21 changed files with 1,486 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -135,4 +135,4 @@ dmypy.json
 data/*.json
 data/*.jsonl
 dbs/meilisearch/meili_data
-dbs/qdrant/onnx_model/onnx
+*/*/onnx_model/onnx
diff --git a/dbs/weaviate/.env.example b/dbs/weaviate/.env.example
@@ -0,0 +1,13 @@
+WEAVIATE_VERSION = "1.18.4"
+WEAVIATE_PORT = 8080
+WEAVIATE_HOST = "localhost"
+WEAVIATE_SERVICE = "weaviate"
+API_PORT = 8004
+EMBEDDING_MODEL_CHECKPOINT = "sentence-transformers/multi-qa-MiniLM-L6-cos-v1"
+ONNX_MODEL_FILENAME = "model_optimized_quantized.onnx"
+
+# Container image tag
+TAG = "0.1.0"
+
+# Docker project namespace (defaults to the current folder name if not set)
+COMPOSE_PROJECT_NAME = weaviate_wine
diff --git a/dbs/weaviate/Dockerfile b/dbs/weaviate/Dockerfile
@@ -0,0 +1,13 @@
+FROM python:3.10-slim-bullseye
+
+WORKDIR /wine
+
+COPY ./requirements.txt /wine/requirements.txt
+
+RUN pip install --no-cache-dir -U pip wheel setuptools
+RUN pip install --no-cache-dir -r /wine/requirements.txt
+
+COPY ./api /wine/api
+COPY ./schemas /wine/schemas
+
+EXPOSE 8000
diff --git a/dbs/weaviate/Dockerfile.onnxruntime b/dbs/weaviate/Dockerfile.onnxruntime
@@ -0,0 +1,14 @@
+FROM python:3.10-slim-bullseye
+
+WORKDIR /wine
+
+COPY ./requirements-onnx.txt /wine/requirements-onnx.txt
+
+RUN pip install --no-cache-dir -U pip wheel setuptools
+RUN pip install --no-cache-dir -r /wine/requirements-onnx.txt
+
+COPY ./api /wine/api
+COPY ./schemas /wine/schemas
+COPY ./onnx_model /wine/onnx_model
+
+EXPOSE 8000
diff --git a/dbs/weaviate/README.md b/dbs/weaviate/README.md
@@ -0,0 +1,244 @@
+# Weaviate
+
+[Weaviate](https://weaviate.io/) is an ML-first vector search database written in Go. It allows users to store data objects and vector embeddings and scale to billions of objects, allowing for sub-millisecond searches. The primary use case for a vector database is to retrieve results that are most semantically similar to the input natural language query. The semantic similarity is obtained by comparing the sentence embeddings (which are n-dimensional vectors) between the input query and the data stored in the database. Most vector DBs, including Weaviate, store both the metadata (as JSON) and the sentence embeddings of text on which we want to search (as vectors), allowing us to perform much more flexible searches than keyword-only search databases. In the case of Weaviate, it even allows hybrid searches, giving developers the flexibility to decide what search methods work best on the data at hand.
+
+Code is provided for ingesting the wine reviews dataset into Weaviate. In addition, a query API written in FastAPI is also provided that allows a user to query available endpoints. As always in FastAPI, documentation is available via OpenAPI (http://localhost:8005/docs).
+
+* Unlike "normal" databases, in a vector DB, the vectorization process is the biggest bottleneck, and because a lot of vector DBs are relatively new, they do not yet support async indexing (although they might, soon).
+  * It doesn't make sense to focus on async requests for vector DBs at present -- rather, it makes more sense to focus on speeding up the vectorization process
+* [Pydantic](https://docs.pydantic.dev) is used for schema validation, both prior to data ingestion and during API request handling
+* For ease of reproducibility during development, the whole setup is orchestrated and deployed via docker
+
+## Setup
+
+Note that this code base has been tested in Python 3.10, and requires a minimum of Python 3.10 to work. Install dependencies via `requirements.txt`.
+
+```sh
+# Setup the environment for the first time
+python -m venv weaviate_venv  # python -> python 3.10
+
+# Activate the environment (for subsequent runs)
+source weaviate_venv/bin/activate
+
+python -m pip install -r requirements.txt
+```
+
+--- 
+
+## Step 1: Set up containers
+
+Docker compose files are provided, which start a persistent-volume Weaviate database with credentials specified in `.env`. The `weaviate` variable in the environment file under the `fastapi` service indicates that we are opening up the database service to FastAPI (running as a separate service, in a separate container) downstream. Both containers can communicate with one another with the common network that they share, on the exact port numbers specified.
+
+The database and API services can be restarted at any time for maintenance and updates by simply running the `docker restart <container_name>` command.
+
+**💡 Note:** The setup shown here would not be ideal in production, as there are other details related to security and scalability that are not addressed via simple docker, but, this is a good starting point to begin experimenting!
+
+### Option 1: Use `sbert` model
+
+If using the `sbert` model [from the sentence-transformers repo](https://www.sbert.net/) directly, use the provided `docker-compose.yml` to initiate separate containers, one that runs Weaviate, and another one that serves as an API on top of the database.
+
+**⚠️ Note**: This approach will attempt to run `sbert` on a GPU if available, and if not, on CPU (while utilizing all CPU cores). This approach may not yield the fastest vectorization if using CPU-only -- a more optimized version is provided [below](#option-2-use-onnxruntime-model-highly-optimized-for-cpu).
+
+```
+docker compose -f docker-compose.yml up -d
+```
+Tear down the services using the following command.
+
+```
+docker compose -f docker-compose.yml down
+```
+
+### Option 2: Use `onnxruntime` model
+
+An approach to make the sentence embedding vector generation process more efficient is to optimize and quantize the original `sbert` model via [ONNX (Open Neural Network Exchange)](https://huggingface.co/docs/transformers/serialization). This framework provides a standard interface for optimizing deep learning models and their computational graphs to be executed much faster and with lower resources on specialized runtimes and hardware.
+
+To deploy the services with the optimized `sbert` model, use the provided `docker-compose.yml` to initiate separate containers, one that runs Weaviate, and another one that serves as an API on top of the database.
+
+**⚠️ Note**: This approach requires some more additional packages from Hugging Face, on top of the `sbert` modules. **Currently (as of early 2023), they only work on Python 3.10**. For this section, make sure to only use Python 3.10 if ONNX complains about module installations via `pip`.
+
+```
+docker compose -f docker-compose-onnx.yml up -d
+```
+Tear down the services using the following command.
+
+```
+docker compose -f docker-compose-onnx.yml down
+```
+
+
+## Step 2: Ingest the data
+
+We ingest both the JSON data for full-text search and filtering, as well as the sentence embedding vectors for similarity search into Weaviate. For this dataset, it's reasonable to expect that a simple concatenation of fields like `title`, `country`, `province`, `variety` and `description` would result in a useful vector that can be compared against a search query, also vectorized in the same embedding space.
+
+As an example, consider the following data snippet form the `data/` directory in this repo:
+
+```json
+"variety": "Red Blend",
+"country": "Italy",
+"province": "Tuscany",
+"title": "Castello San Donato in Perano 2009 Riserva  (Chianti Classico)",
+"description": "Made from a blend of 85% Sangiovese and 15% Merlot, this ripe wine delivers soft plum, black currants, clove and cracked pepper sensations accented with coffee and espresso notes. A backbone of firm tannins give structure. Drink now through 2019."
+```
+
+The above fields are concatenated for vectorization, and then indexed along with the data to Weaviate.
+
+
+### Choice of embedding model
+
+[SentenceTransformers](https://www.sbert.net/) is a Python framework for a range of sentence and text embeddings. It results from extensive work on fine-tuning BERT to work well on semantic similarity tasks using Siamese BERT networks, where the model is trained to predict the similarity between sentence pairs. The original work is [described here](https://arxiv.org/abs/1908.10084).
+
+#### Why use sentence transformers?
+
+Although larger and more powerful text embedding models exist (such as [OpenAI embeddings](https://platform.openai.com/docs/guides/embeddings)), they can become really expensive as they are not free, and charge per token of text. SentenceTransformers are free and open-source, and have been optimized for years for performance, both to utilize all CPU cores and for reduced size while maintaining performance. A full list of sentence transformer models [is in the project page](https://www.sbert.net/docs/pretrained_models.html).
+
+For this work, it makes sense to use among the fastest models in this list, which is the `multi-qa-MiniLM-L6-cos-v1` **uncased** model. As per the docs, it was tuned for semantic search and question answering, and generates sentence embeddings for single sentences or paragraphs up to a maximum sequence length of 512. It was trained on 215M question answer pairs from various sources. Compared to the more general-purpose `all-MiniLM-L6-v2` model, it shows slightly improved performance on semantic search tasks while offering a similar level of performance. [See the sbert docs](https://www.sbert.net/docs/pretrained_models.html) for more details on performance comparisons between the various pretrained models.
+
+### Build ONNX optimized model files
+
+A key step, if using ONNX runtime to speed up vectorization, is to build optimized and quantized models from the base `sbert` model. This is done by running the script `onnx_optimizer.py` in the `onnx_model/` directory.
+
+The optimization/quantization are done using a modified version of [the methods in this blog post](https://www.philschmid.de/optimize-sentence-transformers). We ony perform dynamic quantization for now as static quantization requires a very hardware and OS-specific set of instructions that don't generalize -- it only makes sense to do this in a production environment that is expected to serve thousands of requests in short time. As further reading, a detailed explanation of the difference between static and dynamic quantization [is available in the Hugging Face docs](https://huggingface.co/docs/optimum/concept_guides/quantization).
+
+```sh
+cd onnx_model
+python onnx_optimizer.py  # python -> python 3.10
+```
+
+Running this script generates a new directory `onnx_models/onnx` with the optimized and quantized models, along with their associated model config files.
+
+* `model_optimized.onnx`
+* `model_optimized_quantized.onnx`
+
+The `model_optimized_quantized.onnx` is a dynamically-quantized model file that is ~26% smaller in size than the original model in this case, and generates sentence  embeddings roughly 1.8x faster than the original sentence transformers model, due to the optimized ONNX runtime. A more detailed blog post benchmarking these numbers will be published shortly!
+
+### Run data loader
+
+Data is ingested into the Weaviate database through the scripts in the `scripts` directly. The scripts validate the input JSON data via [Pydantic](https://docs.pydantic.dev), and then index both the JSON data and the vectors to Weaviate using the [Weaviate Python client](https://github.com/weaviate/weaviate-python-client).
+
+As mentioned before, the fields `variety`, `country`, `province`, `title` and `description` are concatenated, vectorized, and then indexed to Weaviate.
+
+#### Option 1: Use `sbert`
+
+If running on a Macbook or a machine without a GPU, it's possible to generate sentence embeddings using the original `sbert` model as per the `EMBEDDING_MODEL_CHECKPOINT` variable in the `.env` file.
+
+```sh
+cd scripts
+python bulk_index_sbert.py
+```
+
+#### Option 2: Use `onnx` quantized model
+
+If running on a remote Linux CPU instance, it is highly recommended to use the ONNX quantized model for the `EMBEDDING_MODEL_CHECKPOINT` model specified in `.env`. Using the appropriate hardware on modern Intel chips can vastly outperform the original `sbert` model on a conventional CPU, allowing for lower-cost and higher-throughput indexing for much larger datasets, all with very low memory consumption (under 2 GB).
+
+```sh
+cd scripts
+python bulk_index_onnx.py
+```
+
+### Time to index dataset
+
+Because vectorizing a large dataset can be an expensive step, part of the goal of this exercise is to see whether we can do so on CPU, with the fewest resources possible.
+
+In short, We are able to index all 129,971 wine reviews from the dataset in **28 min 30 sec**. The conditions under which this indexing time was achieved are listed below.
+
+* Ubuntu 22.04 EC2 `T2.xlarge` instance on AWS (1 CPU with 4 cores, 16 GB of RAM)
+* Python 3.10.10 (Did not use Python 3.11 because ONNX doesn't support it yet)
+* Quantized ONNX version of the `sentence-transformers/multi-qa-MiniLM-L6-cos-v1` sentence transformer
+* Weaviate version `1.18.4`
+
+## Step 3: Test API
+
+Once the data has been successfully loaded into Weaviate and the containers are up and running, we can test out a search query via an HTTP request as follows.
+
+```sh
+curl -X 'GET' \
+  'http://0.0.0.0:8005/wine/search?terms=tuscany%20red&max_price=100&country=Italy'
+```
+
+This cURL request passes the search terms "**tuscany red**", along with the country "Italy" and a maximum price of "100" to the `/wine/search/` endpoint, which is then parsed into a working filter query to Weaviate by the FastAPI backend. The query runs and retrieves results that are semantically similar to the input query for red Tuscan wines, and, if the setup was done correctly, we should see the following response:
+
+```json
+[
+    {
+        "id": 8456,
+        "country": "Italy",
+        "province": "Tuscany",
+        "title": "Petra 2008 Petra Red (Toscana)",
+        "description": "From one of Italy's most important showcase designer wineries, this blend of Cabernet Sauvignon and Merlot lives up to its super Tuscan celebrity. It is gently redolent of dark chocolate, ripe fruit, leather, tobacco and crushed black pepper—the bouquet's elegant moderation is one of its strongest points. The mouthfeel is rich, creamy and long. Drink after 2018.",
+        "points": 92,
+        "price": 80.0,
+        "variety": "Red Blend",
+        "winery": "Petra"
+    },
+    {
+        "id": 896,
+        "country": "Italy",
+        "province": "Tuscany",
+        "title": "Le Buche 2006 Giuseppe Olivi Memento Red (Toscana)",
+        "description": "Le Buche is an interesting winery to watch, and its various Tuscan blends show great promise. Memento is equal parts Sangiovese and Syrah with a soft, velvety texture and a bright berry finish.",
+        "points": 90,
+        "price": 45.0,
+        "variety": "Red Blend",
+        "winery": "Le Buche"
+    },
+    {
+        "id": 9343,
+        "country": "Italy",
+        "province": "Tuscany",
+        "title": "Poggio Mandorlo 2008 Red (Toscana)",
+        "description": "Made from Merlot and Cabernet Franc, this structured red offers aromas of black currant, toast, graphite and a whiff of cedar. The firm palate offers coconut, coffee, grilled sage and red berry alongside bracing tannins. Drink sooner rather than later to capture the fruit richness.",
+        "points": 89,
+        "price": 60.0,
+        "variety": "Red Blend",
+        "winery": "Poggio Mandorlo"
+    }
+]
+```
+
+Not bad! This example correctly returns some highly rated Tuscan red wines form Italy along with their price. More specific search queries, such as low/high acidity, or flavour profiles of wines can also be entered to get more relevant results by country.
+
+## Step 4: Extend the API
+
+The API can be easily extended with the provided structure.
+
+- The `schemas` directory houses the Pydantic schemas, both for the data input as well as for the endpoint outputs
+  - As the data model gets more complex, we can add more files and separate the ingestion logic from the API logic here
+- The `api/routers` directory contains the endpoint routes so that we can provide additional endpoint that answer more business questions
+  - For e.g.: "What are the top rated wines from Argentina?"
+  - In general, it makes sense to organize specific business use cases into their own router files
+- The `api/main.py` file collects all the routes and schemas to run the API
+
+
+#### Existing endpoints
+
+As an example, some search endpoints are implemented and can be accessed via the API at the following URLs.
+
+```
+GET
+/wine/search
+Semantic similarity search
+```
+
+```
+GET
+/wine/search_by_country
+Semantic similarity search for wines by country
+```
+
+```
+GET
+/wine/search_by_filters
+Semantic similarity search for wines by country, price and points (review ratings)
+```
+
+```
+GET
+/wine/count_by_country
+Get counts of wines by country
+```
+
+```
+GET
+/wine/count_by_filters
+Get counts of wines by country, price and points (review ratings)
+```
diff --git a/dbs/weaviate/api/__init__.py b/dbs/weaviate/api/__init__.py
diff --git a/dbs/weaviate/api/config.py b/dbs/weaviate/api/config.py
@@ -0,0 +1,15 @@
+from pydantic import BaseSettings
+
+
+class Settings(BaseSettings):
+    weaviate_service: str
+    weaviate_port: str
+    weaviate_host: str
+    weaviate_service: str
+    api_port = str
+    embedding_model_checkpoint: str
+    onnx_model_filename: str
+    tag: str
+
+    class Config:
+        env_file = ".env"
diff --git a/dbs/weaviate/api/main.py b/dbs/weaviate/api/main.py
@@ -0,0 +1,80 @@
+from collections.abc import AsyncGenerator
+from contextlib import asynccontextmanager
+from functools import lru_cache
+
+import weaviate
+from fastapi import FastAPI
+
+from api.config import Settings
+from api.routers.wine import wine_router
+
+try:
+    from optimum.onnxruntime import ORTModelForCustomTasks
+    from optimum.pipelines import pipeline
+    from transformers import AutoTokenizer
+
+    model_type = "onnx"
+except ModuleNotFoundError:
+    from sentence_transformers import SentenceTransformer
+
+    model_type = "sbert"
+
+
+@lru_cache()
+def get_settings():
+    # Use lru_cache to avoid loading .env file for every request
+    return Settings()
+
+
+def get_embedding_pipeline(onnx_path, model_filename: str):
+    """
+    Create a sentence embedding pipeline using the optimized ONNX model, if available in the environment
+    """
+    # Reload tokenizer
+    tokenizer = AutoTokenizer.from_pretrained(onnx_path)
+    optimized_model = ORTModelForCustomTasks.from_pretrained(onnx_path, file_name=model_filename)
+    embedding_pipeline = pipeline("feature-extraction", model=optimized_model, tokenizer=tokenizer)
+    return embedding_pipeline
+
+
+@asynccontextmanager
+async def lifespan(app: FastAPI) -> AsyncGenerator[None, None]:
+    """Async context manager for Weaviate database connection."""
+    settings = get_settings()
+    model_checkpoint = settings.embedding_model_checkpoint
+    if model_type == "sbert":
+        app.model = SentenceTransformer(model_checkpoint)
+        app.model_type = "sbert"
+    elif model_type == "onnx":
+        app.model = get_embedding_pipeline(
+            "onnx_model/onnx", model_filename=settings.onnx_model_filename
+        )
+        app.model_type = "onnx"
+    # Create Weaviate client
+    HOST = settings.weaviate_service
+    PORT = settings.weaviate_port
+    app.client = weaviate.Client(f"http://{HOST}:{PORT}")
+    print("Successfully connected to Weaviate")
+    yield
+    print("Successfully closed Weaviate connection and released resources")
+
+
+app = FastAPI(
+    title="REST API for wine reviews on Weaviate",
+    description=(
+        "Query from a Weaviate database of 130k wine reviews from the Wine Enthusiast magazine"
+    ),
+    version=get_settings().tag,
+    lifespan=lifespan,
+)
+
+
+@app.get("/", include_in_schema=False)
+async def root():
+    return {
+        "message": "REST API for querying Weaviate database of 130k wine reviews from the Wine Enthusiast magazine"
+    }
+
+
+# Attach routes
+app.include_router(wine_router, prefix="/wine", tags=["wine"])