Merge branch 'arangodb' of https://github.com/arangoml/GenAIComps int…

…o arangodb
arangoml · Jan 5, 2025 · 86aca77 · 86aca77
2 parents 9a7d810 + 50b2639
commit 86aca77
Show file tree

Hide file tree

Showing 18 changed files with 906 additions and 127 deletions.
diff --git a/.github/workflows/docker/compose/retrievers-compose.yaml b/.github/workflows/docker/compose/retrievers-compose.yaml
@@ -47,3 +47,7 @@ services:
     build:
       dockerfile: comps/retrievers/neo4j/llama_index/Dockerfile
     image: ${REGISTRY:-opea}/retriever-neo4j-llamaindex:${TAG:-latest}
+  retriever-arango:
+    build:
+      dockerfile: comps/retrievers/arango/langchain/Dockerfile
+    image: ${REGISTRY:-opea}/retriever-arango:${TAG:-latest}
diff --git a/comps/dataprep/arango/langchain/README.md b/comps/dataprep/arango/langchain/README.md
@@ -1,6 +1,6 @@
 # Dataprep Microservice with ArangoDB
 
-## 🚀Start Microservice with Python
+## 🚀 1. Start Microservice with Python
 
 ### Install Requirements
 
@@ -31,27 +31,27 @@ export ARANGO_DB_NAME=${your_db_name}
 export PYTHONPATH=${path_to_comps}
 ```
 
-### Start Document Preparation Microservice for ArangoDB with Python Script
+See below for additional environment variables that can be set.
 
-Start document preparation microservice for ArangoDB with below command.
+### Start Dataprep Service
 
 ```bash
 python prepare_doc_arango.py
 ```
 
-## 🚀Start Microservice with Docker
+## 🚀 2. Start Microservice with Docker
 
 ### Build Docker Image
 
 ```bash
-cd ../../../../
+cd /your/path/to/GenAIComps
 docker build -t opea/dataprep-arango:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/arango/langchain/Dockerfile .
 ```
 
 ### Run Docker with CLI
 
 ```bash
-docker run -d --name="dataprep-arango-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy opea/dataprep-arango:latest
+docker run -d --name="dataprep-arango-server" -p 6007:6007 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e ... opea/dataprep-arango:latest
 ```
 
 ### Run Docker with Docker Compose
@@ -61,13 +61,9 @@ cd comps/dataprep/arango/langchain
 docker compose -f docker-compose-dataprep-arango.yaml up -d
 ```
 
-## Invoke Microservice
+## 🚀 3. Consume Retriever Service
 
-Once document preparation microservice for ArangoDB is started, user can use below command to invoke the microservice to convert the document to embedding and save to the database.
-
-After the service is complete a Graph is created in ArangoDB. The default graph name is `Graph`, you can specify the graph name by `-F "graph_name=${your_graph_name}"` in the curl command.
-
-By default, the microservice will create embeddings for the documents if embedding environment variables are specified. You can specify `-F "create_embeddings=false"` to skip the embedding creation.
+An ArangoDB Graph is created from the documents provided to the microservice. The microservice will extract entities from the documents and create nodes and relationships in the graph based on the entities extracted. The microservice will also create embeddings for the documents if embedding environment variables are specified.
 
 ```bash
 curl -X POST \
@@ -77,7 +73,11 @@ curl -X POST \
     http://localhost:6007/v1/dataprep
 ```
 
-You can specify chunk_size and chunk_size by the following commands.
+You can specify the graph name with `-F "graph_name=${your_graph_name}"` in the curl command.
+
+By default, the microservice will create embeddings for the documents if embedding environment variables are specified. You can specify `-F "create_embeddings=false"` to skip document embedding creation.
+
+You can also specify the `chunk_size` and `chunk_overlap` with the following parameters:
 
 ```bash
 curl -X POST \
@@ -89,11 +89,11 @@ curl -X POST \
     http://localhost:6007/v1/dataprep
 ```
 
-We support table extraction from pdf documents. You can specify process_table and table_strategy by the following commands. "table_strategy" refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm," the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".
-
-Note: If you specify "table_strategy=llm", You should first start TGI Service, please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.
+We support table extraction from pdf documents. You can specify `process_table` and `table_strategy` with the following parameters:
+- `table_strategy` refers to the strategies to understand tables for table retrieval. As the setting progresses from `"fast"` to `"hq"` to `"llm"`, the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is `"fast"`.
+- `process_table` refers to whether to process tables in the document. The default value is `False`.
 
-For ensure the quality and comprehensiveness of the extracted entities, we recommend to use `gpt-4o` as the default model for parsing the document. To enable the openai service, please `export OPENAI_API_KEY=xxxx` before using this services.
+Note: If you specify `"table_strategy=llm"`, you should first start the TGI Service. Please refer to 1.2.1, 1.3.1 in https://github.com/opea-project/GenAIComps/tree/main/comps/llms/README.md, and then `export TGI_LLM_ENDPOINT="http://${your_ip}:8008"`.
 
 ```bash
 curl -X POST \
@@ -107,13 +107,15 @@ curl -X POST \
 
 ---
 
-Additional options that can be specified from the environment variables are as follows (default values are in the config.py file):
+Additional options that can be specified from the environment variables are as follows (default values are also in the `config.py` file):
 
-ArangoDB Configuration:
+ArangoDB Connection configuration
 - `ARANGO_URL`: The URL for the ArangoDB service.
 - `ARANGO_USERNAME`: The username for the ArangoDB service.
 - `ARANGO_PASSWORD`: The password for the ArangoDB service.
 - `ARANGO_DB_NAME`: The name of the database to use for the ArangoDB service.
+
+ArangoDB Graph Insertion configuration
 - `USE_ONE_ENTITY_COLLECTION`: If set to True, the microservice will use a single entity collection for all nodes. If set to False, the microservice will use a separate collection by node type. Defaults to `True`.
 - `INSERT_ASYNC`: If set to True, the microservice will insert the data into ArangoDB asynchronously. Defaults to `False`.
 - `ARANGO_BATCH_SIZE`: The batch size for the microservice to insert the data. Defaults to `500`.
@@ -127,7 +129,7 @@ Text Generation Inference Configuration
 - `TGI_LLM_TIMEOUT`: The timeout for the TGI service. Defaults to `600`.
 
 Text Embeddings Inferencing Configuration
-**Note**: This is optional functionality to generate embeddings for text chunks. 
+**Note**: This is optional functionality to generate embeddings for documents (i.e text chunks). 
 - `TEI_EMBEDDING_ENDPOINT`: The endpoint for the TEI service.
 - `HUGGINGFACEHUB_API_TOKEN`: The API token for the Hugging Face Hub.
 - `TEI_EMBED_MODEL`: The model to use for the TEI service. Defaults to `BAAI/bge-base-en-v1.5`.

diff --git a/comps/dataprep/arango/langchain/config.py b/comps/dataprep/arango/langchain/config.py
@@ -3,13 +3,13 @@
 
 import os
 
-# ArangoDB configuration
+# ArangoDB Connection configuration
 ARANGO_URL = os.getenv("ARANGO_URL", "http://localhost:8529")
 ARANGO_USERNAME = os.getenv("ARANGO_USERNAME", "root")
 ARANGO_PASSWORD = os.getenv("ARANGO_PASSWORD", "test")
 ARANGO_DB_NAME = os.getenv("ARANGO_DB_NAME", "_system")
 
-# ArangoDB graph configuration
+# ArangoDB Graph Insertion configuration
 USE_ONE_ENTITY_COLLECTION = os.getenv("USE_ONE_ENTITY_COLLECTION", True)
 INSERT_ASYNC = os.getenv("INSERT_ASYNC", False)
 ARANGO_BATCH_SIZE = os.getenv("ARANGO_BATCH_SIZE", 500)

diff --git a/comps/dataprep/arango/langchain/prepare_doc_arango.py b/comps/dataprep/arango/langchain/prepare_doc_arango.py
@@ -19,6 +19,8 @@
     INSERT_ASYNC,
     NODE_PROPERTIES,
     OPENAI_API_KEY,
+    OPENAI_CHAT_MODEL,
+    OPENAI_CHAT_TEMPERATURE,
     OPENAI_EMBED_DIMENSIONS,
     OPENAI_EMBED_MODEL,
     RELATIONSHIP_PROPERTIES,
@@ -84,110 +86,13 @@
         logger.error(f"Could not set custom Prompt: {e}")
 
 
-def ingest_data_to_arango(doc_path: DocPath, graph_name: str, create_embeddings: bool) -> bool:
+def ingest_data_to_arango(doc_path: DocPath, graph_name: str, generate_chunk_embeddings: bool) -> bool:
     """Ingest document to ArangoDB."""
     path = doc_path.path
+
     if logflag:
         logger.info(f"Parsing document {path}.")
 
-    #############################
-    # Text Generation Inference #
-    #############################
-
-    if OPENAI_API_KEY:
-        if logflag:
-            logger.info("OpenAI API Key is set. Verifying its validity...")
-        openai.api_key = OPENAI_API_KEY
-
-        try:
-            openai.models.list()
-            if logflag:
-                logger.info("OpenAI API Key is valid.")
-            llm = ChatOpenAI(temperature=0, model_name="gpt-4o")
-        except openai.error.AuthenticationError:
-            if logflag:
-                logger.info("OpenAI API Key is invalid.")
-        except Exception as e:
-            if logflag:
-                logger.info(f"An error occurred while verifying the API Key: {e}")
-
-    elif TGI_LLM_ENDPOINT:
-        llm = HuggingFaceEndpoint(
-            endpoint_url=TGI_LLM_ENDPOINT,
-            max_new_tokens=TGI_LLM_MAX_NEW_TOKENS,
-            top_k=TGI_LLM_TOP_K,
-            top_p=TGI_LLM_TOP_P,
-            temperature=TGI_LLM_TEMPERATURE,
-            timeout=TGI_LLM_TIMEOUT,
-        )
-    else:
-        raise ValueError("No text generation inference endpoint is set.")
-
-    try:
-        llm_transformer = LLMGraphTransformer(
-            llm=llm,
-            allowed_nodes=ALLOWED_NODES,
-            allowed_relationships=ALLOWED_RELATIONSHIPS,
-            prompt=PROMPT_TEMPLATE,
-            node_properties=NODE_PROPERTIES if NODE_PROPERTIES else False,
-            relationship_properties=RELATIONSHIP_PROPERTIES if RELATIONSHIP_PROPERTIES else False,
-        )
-    except (TypeError, ValueError) as e:
-        if logflag:
-            logger.warning(f"Advanced LLMGraphTransformer failed: {e}")
-        # Fall back to basic config
-        try:
-            llm_transformer = LLMGraphTransformer(llm=llm)
-        except (TypeError, ValueError) as e:
-            if logflag:
-                logger.error(f"Failed to initialize LLMGraphTransformer: {e}")
-            raise
-
-    ########################################
-    # Text Embeddings Inference (optional) #
-    ########################################
-
-    embeddings = None
-    if create_embeddings:
-        if OPENAI_API_KEY:
-            # Use OpenAI embeddings
-            embeddings = OpenAIEmbeddings(
-                model=OPENAI_EMBED_MODEL,
-                dimensions=OPENAI_EMBED_DIMENSIONS,
-            )
-
-        elif TEI_EMBEDDING_ENDPOINT and HUGGINGFACEHUB_API_TOKEN:
-            # Use TEI endpoint service
-            embeddings = HuggingFaceHubEmbeddings(
-                model=TEI_EMBEDDING_ENDPOINT,
-                huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
-            )
-        elif TEI_EMBED_MODEL:
-            # Use local embedding model
-            embeddings = HuggingFaceBgeEmbeddings(model_name=TEI_EMBED_MODEL)
-        else:
-            if logflag:
-                logger.warning("No embeddings environment variables are set, cannot generate embeddings.")
-            embeddings = None
-
-    ############
-    # ArangoDB #
-    ############
-
-    client = ArangoClient(hosts=ARANGO_URL)
-    sys_db = client.db(name="_system", username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)
-
-    if not sys_db.has_database(ARANGO_DB_NAME):
-        sys_db.create_database(ARANGO_DB_NAME)
-
-    db = client.db(name=ARANGO_DB_NAME, username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)
-
-    graph = ArangoGraph(
-        db=db,
-        include_examples=False,
-        generate_schema_on_init=False,
-    )
-
     ############
     # Chunking #
     ############
@@ -221,14 +126,19 @@ def ingest_data_to_arango(doc_path: DocPath, graph_name: str, create_embeddings:
         table_chunks = get_tables_result(path, doc_path.table_strategy)
         if isinstance(table_chunks, list):
             chunks = chunks + table_chunks
+
     if logflag:
         logger.info(f"Done preprocessing. Created {len(chunks)} chunks of the original file.")
 
     ################################
     # Graph generation & insertion #
     ################################
 
-    generate_chunk_embeddings = embeddings is not None
+    graph = ArangoGraph(
+        db=db,
+        include_examples=False,
+        generate_schema_on_init=False,
+    )
 
     for text in chunks:
         document = Document(page_content=text)
@@ -294,7 +204,7 @@ async def ingest_documents(
                     table_strategy=table_strategy,
                 ),
                 graph_name=graph_name,
-                create_embeddings=create_embeddings,
+                generate_chunk_embeddings=create_embeddings and embeddings is not None,
             )
             uploaded_files.append(save_path)
             if logflag:
@@ -323,7 +233,7 @@ async def ingest_documents(
                         table_strategy=table_strategy,
                     ),
                     graph_name=graph_name,
-                    create_embeddings=create_embeddings,
+                    generate_chunk_embeddings=create_embeddings and embeddings is not None,
                 )
             except json.JSONDecodeError:
                 raise HTTPException(status_code=500, detail="Fail to ingest data into qdrant.")
@@ -340,4 +250,95 @@ async def ingest_documents(
 
 
 if __name__ == "__main__":
+
+    #############################
+    # Text Generation Inference #
+    #############################
+
+    if OPENAI_API_KEY:
+        if logflag:
+            logger.info("OpenAI API Key is set. Verifying its validity...")
+        openai.api_key = OPENAI_API_KEY
+
+        try:
+            openai.models.list()
+            if logflag:
+                logger.info("OpenAI API Key is valid.")
+            llm = ChatOpenAI(temperature=OPENAI_CHAT_TEMPERATURE, model_name=OPENAI_CHAT_MODEL)
+        except openai.error.AuthenticationError:
+            if logflag:
+                logger.info("OpenAI API Key is invalid.")
+        except Exception as e:
+            if logflag:
+                logger.info(f"An error occurred while verifying the API Key: {e}")
+
+    elif TGI_LLM_ENDPOINT:
+        llm = HuggingFaceEndpoint(
+            endpoint_url=TGI_LLM_ENDPOINT,
+            max_new_tokens=TGI_LLM_MAX_NEW_TOKENS,
+            top_k=TGI_LLM_TOP_K,
+            top_p=TGI_LLM_TOP_P,
+            temperature=TGI_LLM_TEMPERATURE,
+            timeout=TGI_LLM_TIMEOUT,
+        )
+    else:
+        raise ValueError("No text generation inference endpoint is set.")
+
+    try:
+        llm_transformer = LLMGraphTransformer(
+            llm=llm,
+            allowed_nodes=ALLOWED_NODES,
+            allowed_relationships=ALLOWED_RELATIONSHIPS,
+            prompt=PROMPT_TEMPLATE,
+            node_properties=NODE_PROPERTIES or False,
+            relationship_properties=RELATIONSHIP_PROPERTIES or False,
+        )
+    except (TypeError, ValueError) as e:
+        if logflag:
+            logger.warning(f"Advanced LLMGraphTransformer failed: {e}")
+        # Fall back to basic config
+        try:
+            llm_transformer = LLMGraphTransformer(llm=llm)
+        except (TypeError, ValueError) as e:
+            if logflag:
+                logger.error(f"Failed to initialize LLMGraphTransformer: {e}")
+            raise
+
+    ########################################
+    # Text Embeddings Inference (optional) #
+    ########################################
+
+    if OPENAI_API_KEY:
+        # Use OpenAI embeddings
+        embeddings = OpenAIEmbeddings(
+            model=OPENAI_EMBED_MODEL,
+            dimensions=OPENAI_EMBED_DIMENSIONS,
+        )
+
+    elif TEI_EMBEDDING_ENDPOINT and HUGGINGFACEHUB_API_TOKEN:
+        # Use TEI endpoint service
+        embeddings = HuggingFaceHubEmbeddings(
+            model=TEI_EMBEDDING_ENDPOINT,
+            huggingfacehub_api_token=HUGGINGFACEHUB_API_TOKEN,
+        )
+    elif TEI_EMBED_MODEL:
+        # Use local embedding model
+        embeddings = HuggingFaceBgeEmbeddings(model_name=TEI_EMBED_MODEL)
+    else:
+        if logflag:
+            logger.warning("No embeddings environment variables are set, cannot generate embeddings.")
+        embeddings = None
+
+    ############
+    # ArangoDB #
+    ############
+
+    client = ArangoClient(hosts=ARANGO_URL)
+    sys_db = client.db(name="_system", username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)
+
+    if not sys_db.has_database(ARANGO_DB_NAME):
+        sys_db.create_database(ARANGO_DB_NAME)
+
+    db = client.db(name=ARANGO_DB_NAME, username=ARANGO_USERNAME, password=ARANGO_PASSWORD, verify=True)
+
     opea_microservices["opea_service@prepare_doc_arango"].start()