Skip to content

Commit

Permalink
Merge branch 'arangodb' of https://github.com/arangoml/GenAIComps int…
Browse files Browse the repository at this point in the history
…o arangodb
  • Loading branch information
aMahanna committed Jan 5, 2025
2 parents 9a7d810 + 50b2639 commit 86aca77
Show file tree
Hide file tree
Showing 18 changed files with 906 additions and 127 deletions.
4 changes: 4 additions & 0 deletions .github/workflows/docker/compose/retrievers-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,7 @@ services:
build:
dockerfile: comps/retrievers/neo4j/llama_index/Dockerfile
image: ${REGISTRY:-opea}/retriever-neo4j-llamaindex:${TAG:-latest}
retriever-arango:
build:
dockerfile: comps/retrievers/arango/langchain/Dockerfile
image: ${REGISTRY:-opea}/retriever-arango:${TAG:-latest}
42 changes: 22 additions & 20 deletions comps/dataprep/arango/langchain/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Dataprep Microservice with ArangoDB

## 🚀Start Microservice with Python
## 🚀 1. Start Microservice with Python

### Install Requirements

Expand Down Expand Up @@ -31,27 +31,27 @@ export ARANGO_DB_NAME=${your_db_name}
export PYTHONPATH=${path_to_comps}
```

### Start Document Preparation Microservice for ArangoDB with Python Script
See below for additional environment variables that can be set.

Start document preparation microservice for ArangoDB with below command.
### Start Dataprep Service

```bash
python prepare_doc_arango.py
```

## 🚀Start Microservice with Docker
## 🚀 2. Start Microservice with Docker

### Build Docker Image

```bash
cd ../../../../
cd /your/path/to/GenAIComps
docker build -t opea/dataprep-arango:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/arango/langchain/Dockerfile .
```

### Run Docker with CLI

```bash
docker run -d --name="dataprep-arango-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-arango:latest
docker run -d --name="dataprep-arango-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e ... opea/dataprep-arango:latest
```

### Run Docker with Docker Compose
Expand All @@ -61,13 +61,9 @@ cd comps/dataprep/arango/langchain
docker compose -f docker-compose-dataprep-arango.yaml up -d
```

## Invoke Microservice
## 🚀 3. Consume Retriever Service

Once document preparation microservice for ArangoDB is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.

After the service is complete a Graph is created in ArangoDB. The default graph name is `Graph`, you can specify the graph name by `-F "graph_name=${your_graph_name}"` in the curl command.

By default, the microservice will create embeddings for the documents if embedding environment variables are specified. You can specify `-F "create_embeddings=false"` to skip the embedding creation.
An ArangoDB Graph is created from the documents provided to the microservice. The microservice will extract entities from the documents and create nodes and relationships in the graph based on the entities extracted. The microservice will also create embeddings for the documents if embedding environment variables are specified.

```bash
curl -X POST \
Expand All @@ -77,7 +73,11 @@ curl -X POST \
http://localhost:6007/v1/dataprep
```

You can specify chunk_size and chunk_size by the following commands.
You can specify the graph name with `-F "graph_name=${your_graph_name}"` in the curl command.

By default, the microservice will create embeddings for the documents if embedding environment variables are specified. You can specify `-F "create_embeddings=false"` to skip document embedding creation.

You can also specify the `chunk_size` and `chunk_overlap` with the following parameters:

```bash
curl -X POST \
Expand All @@ -89,11 +89,11 @@ curl -X POST \
http://localhost:6007/v1/dataprep
```

We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".

Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.
We support table extraction from pdf documents. You can specify `process_table` and `table_strategy` with the following parameters:
- `table_strategy` refers to the strategies to understand tables for table retrieval. As the setting progresses from `"fast"` to `"hq"` to `"llm"`, the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is `"fast"`.
- `process_table` refers to whether to process tables in the document. The default value is `False`.

For ensure the quality and comprehensiveness of the extracted entities, we recommend to use `gpt-4o` as the default model for parsing the document. To enable the openai service, please `export OPENAI_API_KEY=xxxx` before using this services.
Note: If you specify `"table_strategy=llm"`, you should first start the TGI Service. Please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.

```bash
curl -X POST \
Expand All @@ -107,13 +107,15 @@ curl -X POST \

---

Additional options that can be specified from the environment variables are as follows (default values are in the config.py file):
Additional options that can be specified from the environment variables are as follows (default values are also in the `config.py` file):

ArangoDB Configuration:
ArangoDB Connection configuration
- `ARANGO_URL`: The URL for the ArangoDB service.
- `ARANGO_USERNAME`: The username for the ArangoDB service.
- `ARANGO_PASSWORD`: The password for the ArangoDB service.
- `ARANGO_DB_NAME`: The name of the database to use for the ArangoDB service.

ArangoDB Graph Insertion configuration
- `USE_ONE_ENTITY_COLLECTION`: If set to True, the microservice will use a single entity collection for all nodes. If set to False, the microservice will use a separate collection by node type. Defaults to `True`.
- `INSERT_ASYNC`: If set to True, the microservice will insert the data into ArangoDB asynchronously. Defaults to `False`.
- `ARANGO_BATCH_SIZE`: The batch size for the microservice to insert the data. Defaults to `500`.
Expand All @@ -127,7 +129,7 @@ Text Generation Inference Configuration
- `TGI_LLM_TIMEOUT`: The timeout for the TGI service. Defaults to `600`.

Text Embeddings Inferencing Configuration
**Note**: This is optional functionality to generate embeddings for text chunks.
**Note**: This is optional functionality to generate embeddings for documents (i.e text chunks).
- `TEI_EMBEDDING_ENDPOINT`: The endpoint for the TEI service.
- `HUGGINGFACEHUB_API_TOKEN`: The API token for the Hugging Face Hub.
- `TEI_EMBED_MODEL`: The model to use for the TEI service. Defaults to `BAAI/bge-base-en-v1.5`.
Expand Down
4 changes: 2 additions & 2 deletions comps/dataprep/arango/langchain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@

import os

# ArangoDB configuration
# ArangoDB Connection configuration
ARANGO_URL = os.getenv("ARANGO_URL", "http://localhost:8529")
ARANGO_USERNAME = os.getenv("ARANGO_USERNAME", "root")
ARANGO_PASSWORD = os.getenv("ARANGO_PASSWORD", "test")
ARANGO_DB_NAME = os.getenv("ARANGO_DB_NAME", "_system")

# ArangoDB graph configuration
# ArangoDB Graph Insertion configuration
USE_ONE_ENTITY_COLLECTION = os.getenv("USE_ONE_ENTITY_COLLECTION", True)
INSERT_ASYNC = os.getenv("INSERT_ASYNC", False)
ARANGO_BATCH_SIZE = os.getenv("ARANGO_BATCH_SIZE", 500)
Expand Down
205 changes: 103 additions & 102 deletions comps/dataprep/arango/langchain/prepare_doc_arango.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
INSERT_ASYNC,
NODE_PROPERTIES,
OPENAI_API_KEY,
OPENAI_CHAT_MODEL,
OPENAI_CHAT_TEMPERATURE,
OPENAI_EMBED_DIMENSIONS,
OPENAI_EMBED_MODEL,
RELATIONSHIP_PROPERTIES,
Expand Down Expand Up @@ -84,110 +86,13 @@
logger.error(f"Could not set custom Prompt: {e}")


def ingest_data_to_arango(doc_path: DocPath, graph_name: str, create_embeddings: bool) -> bool:
def ingest_data_to_arango(doc_path: DocPath, graph_name: str, generate_chunk_embeddings: bool) -> bool:
"""Ingest document to ArangoDB."""
path = doc_path.path

if logflag:
logger.info(f"Parsing document {path}.")

#############################
# Text Generation Inference #
#############################

if OPENAI_API_KEY:
if logflag:
logger.info("OpenAI API Key is set. Verifying its validity...")
openai.api_key = OPENAI_API_KEY

try:
openai.models.list()
if logflag:
logger.info("OpenAI API Key is valid.")
llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
except openai.error.AuthenticationError:
if logflag:
logger.info("OpenAI API Key is invalid.")
except Exception as e:
if logflag:
logger.info(f"An error occurred while verifying the API Key: {e}")

elif TGI_LLM_ENDPOINT:
llm = HuggingFaceEndpoint(
endpoint_url=TGI_LLM_ENDPOINT,
max_new_tokens=TGI_LLM_MAX_NEW_TOKENS,
top_k=TGI_LLM_TOP_K,
top_p=TGI_LLM_TOP_P,
temperature=TGI_LLM_TEMPERATURE,
timeout=TGI_LLM_TIMEOUT,
)
else:
raise ValueError("No text generation inference endpoint is set.")

try:
llm_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=ALLOWED_NODES,
allowed_relationships=ALLOWED_RELATIONSHIPS,
prompt=PROMPT_TEMPLATE,
node_properties=NODE_PROPERTIES if NODE_PROPERTIES else False,
relationship_properties=RELATIONSHIP_PROPERTIES if RELATIONSHIP_PROPERTIES else False,
)
except (TypeError, ValueError) as e:
if logflag:
logger.warning(f"Advanced LLMGraphTransformer failed: {e}")
# Fall back to basic config
try:
llm_transformer = LLMGraphTransformer(llm=llm)
except (TypeError, ValueError) as e:
if logflag:
logger.error(f"Failed to initialize LLMGraphTransformer: {e}")
raise

########################################
# Text Embeddings Inference (optional) #
########################################

embeddings = None
if create_embeddings:
if OPENAI_API_KEY:
# Use OpenAI embeddings
embeddings = OpenAIEmbeddings(
model=OPENAI_EMBED_MODEL,
dimensions=OPENAI_EMBED_DIMENSIONS,
)

elif TEI_EMBEDDING_ENDPOINT and HUGGINGFACEHUB_API_TOKEN:
# Use TEI endpoint service
embeddings = HuggingFaceHubEmbeddings(
model=TEI_EMBEDDING_ENDPOINT,
huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)
elif TEI_EMBED_MODEL:
# Use local embedding model
embeddings = HuggingFaceBgeEmbeddings(model_name=TEI_EMBED_MODEL)
else:
if logflag:
logger.warning("No embeddings environment variables are set, cannot generate embeddings.")
embeddings = None

############
# ArangoDB #
############

client = ArangoClient(hosts=ARANGO_URL)
sys_db = client.db(name="_system", username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)

if not sys_db.has_database(ARANGO_DB_NAME):
sys_db.create_database(ARANGO_DB_NAME)

db = client.db(name=ARANGO_DB_NAME, username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)

graph = ArangoGraph(
db=db,
include_examples=False,
generate_schema_on_init=False,
)

############
# Chunking #
############
Expand Down Expand Up @@ -221,14 +126,19 @@ def ingest_data_to_arango(doc_path: DocPath, graph_name: str, create_embeddings:
table_chunks = get_tables_result(path, doc_path.table_strategy)
if isinstance(table_chunks, list):
chunks = chunks + table_chunks

if logflag:
logger.info(f"Done preprocessing. Created {len(chunks)} chunks of the original file.")

################################
# Graph generation & insertion #
################################

generate_chunk_embeddings = embeddings is not None
graph = ArangoGraph(
db=db,
include_examples=False,
generate_schema_on_init=False,
)

for text in chunks:
document = Document(page_content=text)
Expand Down Expand Up @@ -294,7 +204,7 @@ async def ingest_documents(
table_strategy=table_strategy,
),
graph_name=graph_name,
create_embeddings=create_embeddings,
generate_chunk_embeddings=create_embeddings and embeddings is not None,
)
uploaded_files.append(save_path)
if logflag:
Expand Down Expand Up @@ -323,7 +233,7 @@ async def ingest_documents(
table_strategy=table_strategy,
),
graph_name=graph_name,
create_embeddings=create_embeddings,
generate_chunk_embeddings=create_embeddings and embeddings is not None,
)
except json.JSONDecodeError:
raise HTTPException(status_code=500, detail="Fail to ingest data into qdrant.")
Expand All @@ -340,4 +250,95 @@ async def ingest_documents(


if __name__ == "__main__":

#############################
# Text Generation Inference #
#############################

if OPENAI_API_KEY:
if logflag:
logger.info("OpenAI API Key is set. Verifying its validity...")
openai.api_key = OPENAI_API_KEY

try:
openai.models.list()
if logflag:
logger.info("OpenAI API Key is valid.")
llm = ChatOpenAI(temperature=OPENAI_CHAT_TEMPERATURE, model_name=OPENAI_CHAT_MODEL)
except openai.error.AuthenticationError:
if logflag:
logger.info("OpenAI API Key is invalid.")
except Exception as e:
if logflag:
logger.info(f"An error occurred while verifying the API Key: {e}")

elif TGI_LLM_ENDPOINT:
llm = HuggingFaceEndpoint(
endpoint_url=TGI_LLM_ENDPOINT,
max_new_tokens=TGI_LLM_MAX_NEW_TOKENS,
top_k=TGI_LLM_TOP_K,
top_p=TGI_LLM_TOP_P,
temperature=TGI_LLM_TEMPERATURE,
timeout=TGI_LLM_TIMEOUT,
)
else:
raise ValueError("No text generation inference endpoint is set.")

try:
llm_transformer = LLMGraphTransformer(
llm=llm,
allowed_nodes=ALLOWED_NODES,
allowed_relationships=ALLOWED_RELATIONSHIPS,
prompt=PROMPT_TEMPLATE,
node_properties=NODE_PROPERTIES or False,
relationship_properties=RELATIONSHIP_PROPERTIES or False,
)
except (TypeError, ValueError) as e:
if logflag:
logger.warning(f"Advanced LLMGraphTransformer failed: {e}")
# Fall back to basic config
try:
llm_transformer = LLMGraphTransformer(llm=llm)
except (TypeError, ValueError) as e:
if logflag:
logger.error(f"Failed to initialize LLMGraphTransformer: {e}")
raise

########################################
# Text Embeddings Inference (optional) #
########################################

if OPENAI_API_KEY:
# Use OpenAI embeddings
embeddings = OpenAIEmbeddings(
model=OPENAI_EMBED_MODEL,
dimensions=OPENAI_EMBED_DIMENSIONS,
)

elif TEI_EMBEDDING_ENDPOINT and HUGGINGFACEHUB_API_TOKEN:
# Use TEI endpoint service
embeddings = HuggingFaceHubEmbeddings(
model=TEI_EMBEDDING_ENDPOINT,
huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
)
elif TEI_EMBED_MODEL:
# Use local embedding model
embeddings = HuggingFaceBgeEmbeddings(model_name=TEI_EMBED_MODEL)
else:
if logflag:
logger.warning("No embeddings environment variables are set, cannot generate embeddings.")
embeddings = None

############
# ArangoDB #
############

client = ArangoClient(hosts=ARANGO_URL)
sys_db = client.db(name="_system", username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)

if not sys_db.has_database(ARANGO_DB_NAME):
sys_db.create_database(ARANGO_DB_NAME)

db = client.db(name=ARANGO_DB_NAME, username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)

opea_microservices["opea_service@prepare_doc_arango"].start()
Loading

0 comments on commit 86aca77

Please sign in to comment.