Skip to content

Commit

Permalink
Enabled document preparation and vectorstores microservices on Redis …
Browse files Browse the repository at this point in the history
…and Qdrant (opea-project#55)

* enabled vectordb service for redis.

Signed-off-by: Ye, Xinyu <[email protected]>

* enabled vectordb service for qdrant.

Signed-off-by: Ye, Xinyu <[email protected]>

* enabled dataprep for redis and qdrant.

Signed-off-by: Ye, Xinyu <[email protected]>

* added readme for dataprep and vectorstores on redis and qdrant.

Signed-off-by: Ye, Xinyu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added requirements.

Signed-off-by: Ye, Xinyu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* added dockers in dataprep.

Signed-off-by: Ye, Xinyu <[email protected]>

* modified readme.

Signed-off-by: Ye, Xinyu <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Signed-off-by: Ye, Xinyu <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
XinyuYe-Intel and pre-commit-ci[bot] authored May 16, 2024
1 parent 4ecd616 commit a7c5ebd
Show file tree
Hide file tree
Showing 23 changed files with 542 additions and 0 deletions.
1 change: 1 addition & 0 deletions comps/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@
LLMParamsDoc,
SearchedDoc,
TextDoc,
DocPath,
)

# Microservice
Expand Down
4 changes: 4 additions & 0 deletions comps/cores/proto/docarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,10 @@ class Base64ByteStrDoc(BaseDoc):
byte_str: str


class DocPath(BaseDoc):
path: str


class EmbedDoc768(BaseDoc):
text: str
embedding: conlist(float, min_length=768, max_length=768)
Expand Down
49 changes: 49 additions & 0 deletions comps/dataprep/langchain/qdrant/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# 🚀Start Microservice with Python

## Install Requirements

```bash
pip install -r requirements.txt
```

## Start Qdrant server

Please refer to this [readme](../../../vectorstores/langchain/qdrant/README.md).

## Start document preparation microservice for Qdrant with Python Script

Start document preparation microservice for Qdrant with below command.

```bash
python prepare_doc_qdrant.py
```

# 🚀Start Microservice with Docker

## Build Docker Image

```bash
cd ../../
docker build -t opea/gen-ai-comps:dataprep-qdrant-xeon-server --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/langchain/qdrant/docker/Dockerfile .
```

## Run Docker with CLI

```bash
docker run -d --name="dataprep-qdrant-server" -p 8000:8000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/gen-ai-comps:dataprep-qdrant-xeon-server
```

## Run Docker with Docker Compose

```bash
cd docker
docker compose -f docker-compose-dataprep-qdrant.yaml up -d
```

# Invoke Microservices

Once document preparation microservice for Qdrant is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/path/to/document"}' http://localhost:6000/v1/dataprep
```
13 changes: 13 additions & 0 deletions comps/dataprep/langchain/qdrant/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
28 changes: 28 additions & 0 deletions comps/dataprep/langchain/qdrant/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

# Embedding model
EMBED_MODEL = os.getenv("EMBED_MODEL", "sentence-transformers/all-MiniLM-L6-v2")

# Qdrant configuration
QDRANT_HOST = os.getenv("QDRANT", "localhost")
QDRANT_PORT = int(os.getenv("QDRANT_PORT", 6333))
COLLECTION_NAME = os.getenv("COLLECTION_NAME", "rag-qdrant")

# LLM/Embedding endpoints
TGI_LLM_ENDPOINT = os.getenv("TGI_LLM_ENDPOINT", "http://localhost:8080")
TGI_LLM_ENDPOINT_NO_RAG = os.getenv("TGI_LLM_ENDPOINT_NO_RAG", "http://localhost:8081")
TEI_EMBEDDING_ENDPOINT = os.getenv("TEI_ENDPOINT")
40 changes: 40 additions & 0 deletions comps/dataprep/langchain/qdrant/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM python:3.11-slim

ENV LANG C.UTF-8

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r /home/user/comps/dataprep/langchain/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/dataprep/langchain/redis

ENTRYPOINT ["python", "prepare_doc_redis.py"]

Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

version: "3"
services:
qdrant-vector-db:
image: qdrant/qdrant
container_name: qdrant-vector-db
ports:
- "6333:6333"
- "6334:6334"
72 changes: 72 additions & 0 deletions comps/dataprep/langchain/qdrant/prepare_doc_qdrant.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

import os

from config import COLLECTION_NAME, EMBED_MODEL, QDRANT_HOST, QDRANT_PORT
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceBgeEmbeddings, HuggingFaceEmbeddings, HuggingFaceHubEmbeddings
from langchain_community.vectorstores import Qdrant

from comps import DocPath, opea_microservices, register_microservice
from comps.dataprep.langchain.utils import docment_loader

tei_embedding_endpoint = os.getenv("TEI_ENDPOINT")


@register_microservice(
name="opea_service@prepare_doc_qdrant",
expose_endpoint="/v1/dataprep",
host="0.0.0.0",
port=6000,
input_datatype=DocPath,
output_datatype=None,
)
def ingest_documents(doc_path: DocPath):
"""Ingest document to Qdrant."""
doc_path = doc_path.path
print(f"Parsing document {doc_path}.")

text_splitter = RecursiveCharacterTextSplitter(chunk_size=1500, chunk_overlap=100, add_start_index=True)
content = docment_loader(doc_path)
chunks = text_splitter.split_text(content)

print("Done preprocessing. Created ", len(chunks), " chunks of the original pdf")
# Create vectorstore
if tei_embedding_endpoint:
# create embeddings using TEI endpoint service
embedder = HuggingFaceHubEmbeddings(model=tei_embedding_endpoint)
else:
# create embeddings using local embedding model
embedder = HuggingFaceBgeEmbeddings(model_name=EMBED_MODEL)

# Batch size
batch_size = 32
num_chunks = len(chunks)
for i in range(0, num_chunks, batch_size):
batch_chunks = chunks[i : i + batch_size]
batch_texts = batch_chunks

_ = Qdrant.from_texts(
texts=batch_texts,
embedding=embedder,
collection_name=COLLECTION_NAME,
host=QDRANT_HOST,
port=QDRANT_PORT,
)
print(f"Processed batch {i//batch_size + 1}/{(num_chunks-1)//batch_size + 1}")


if __name__ == "__main__":
opea_microservices["opea_service@prepare_doc_qdrant"].start()
10 changes: 10 additions & 0 deletions comps/dataprep/langchain/qdrant/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
docarray[full]
easyocr
fastapi
fitz
huggingface_hub
langchain
numpy
Pillow
sentence_transformers
shortuuid
49 changes: 49 additions & 0 deletions comps/dataprep/langchain/redis/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# 🚀Start Microservice with Python

## Install Requirements

```bash
pip install -r requirements.txt
```

## Start Redis Stack Server

Please refer to this [readme](../../../vectorstores/langchain/redis/README.md).

## Start Document Preparation Microservice for Redis with Python Script

Start document preparation microservice for Redis with below command.

```bash
python prepare_doc_redis.py
```

# 🚀Start Microservice with Docker

## Build Docker Image

```bash
cd ../../
docker build -t opea/gen-ai-comps:dataprep-redis-xeon-server --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/langchain/redis/docker/Dockerfile .
```

## Run Docker with CLI

```bash
docker run -d --name="dataprep-redis-server" -p 8000:8000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/gen-ai-comps:dataprep-redis-xeon-server
```

## Run Docker with Docker Compose

```bash
cd docker
docker compose -f docker-compose-dataprep-redis.yaml up -d
```

# Invoke Microservices

Once document preparation microservice for Redis is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

```bash
curl -X POST -H "Content-Type: application/json" -d '{"path":"/path/to/document"}' http://localhost:6000/v1/dataprep
```
13 changes: 13 additions & 0 deletions comps/dataprep/langchain/redis/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
File renamed without changes.
40 changes: 40 additions & 0 deletions comps/dataprep/langchain/redis/docker/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

FROM python:3.11-slim

ENV LANG C.UTF-8

RUN apt-get update -y && apt-get install -y --no-install-recommends --fix-missing \
libgl1-mesa-glx \
libjemalloc-dev \
vim

RUN useradd -m -s /bin/bash user && \
mkdir -p /home/user && \
chown -R user /home/user/

USER user

COPY comps /home/user/comps

RUN pip install --no-cache-dir --upgrade pip && \
pip install --no-cache-dir -r /home/user/comps/dataprep/langchain/requirements.txt

ENV PYTHONPATH=$PYTHONPATH:/home/user

WORKDIR /home/user/comps/dataprep/langchain/redis

ENTRYPOINT ["python", "prepare_doc_redis.py"]

Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Copyright (c) 2024 Intel Corporation
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

version: "3"
services:
redis-vector-db:
image: redis/redis-stack:7.2.0-v9
container_name: redis-vector-db
ports:
- "6379:6379"
- "8001:8001"
dataprep-redis:
image: opea/gen-ai-comps:dataprep-redis-xeon-server
container_name: dataprep-redis-server
ports:
- "8000:8000"
ipc: host
environment:
http_proxy: ${http_proxy}
https_proxy: ${https_proxy}
restart: unless-stopped
Loading

0 comments on commit a7c5ebd

Please sign in to comment.