opea-project · mhbuehler · Dec 16, 2024 · Dec 16, 2024 · Dec 17, 2024 · Dec 17, 2024
diff --git a/MultimodalQnA/Dockerfile b/MultimodalQnA/Dockerfile
@@ -16,13 +16,12 @@ RUN useradd -m -s /bin/bash user && \
 
 WORKDIR $HOME
 
-
 # Stage 2: latest GenAIComps sources
 FROM base AS git
 
 RUN apt-get update && apt-get install -y --no-install-recommends git
-RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git
-
+#RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git
+RUN git clone --depth 1 https://github.com/mhbuehler/GenAIComps.git --single-branch --branch mmqna-image-query
 
 # Stage 3: common layer shared by services using GenAIComps
 FROM base AS comps-base

diff --git a/MultimodalQnA/README.md b/MultimodalQnA/README.md
@@ -1,8 +1,8 @@
 # MultimodalQnA Application
 
-Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
+Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.
 
-`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
+`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
 
 The MultimodalQnA architecture shows below:
 
@@ -87,12 +87,12 @@ In the below, we provide a table that describes for each microservice component
 <details>
 <summary><b>Gaudi default compose.yaml</b></summary>
 
-| MicroService | Open Source Project   | HW    | Port | Endpoint                                        |
-| ------------ | --------------------- | ----- | ---- | ----------------------------------------------- |
-| Embedding    | Langchain             | Xeon  | 6000 | /v1/embeddings                                  |
-| Retriever    | Langchain, Redis      | Xeon  | 7000 | /v1/multimodal_retrieval                        |
-| LVM          | Langchain, TGI        | Gaudi | 9399 | /v1/lvm                                         |
-| Dataprep     | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions |
+| MicroService | Open Source Project   | HW    | Port | Endpoint                                                              |
+| ------------ | --------------------- | ----- | ---- | --------------------------------------------------------------------- |
+| Embedding    | Langchain             | Xeon  | 6000 | /v1/embeddings                                                        |
+| Retriever    | Langchain, Redis      | Xeon  | 7000 | /v1/multimodal_retrieval                                              |
+| LVM          | Langchain, TGI        | Gaudi | 9399 | /v1/lvm                                                               |
+| Dataprep     | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest_with_text |
 
 </details>
 
@@ -172,8 +172,38 @@ docker compose -f compose.yaml up -d
 
 ## MultimodalQnA Demo on Gaudi2
 
-![MultimodalQnA-upload-waiting-screenshot](./assets/img/upload-gen-trans.png)
+### Multimodal QnA UI
 
-![MultimodalQnA-upload-done-screenshot](./assets/img/upload-gen-captions.png)
+![MultimodalQnA-ui-screenshot](./assets/img/mmqna-ui.png)
 
-![MultimodalQnA-query-example-screenshot](./assets/img/example_query.png)
+### Video Ingestion
+
+![MultimodalQnA-ingest-video-screenshot](./assets/img/video-ingestion.png)
+
+### Text Query following the ingestion of a Video
+
+![MultimodalQnA-video-query-screenshot](./assets/img/video-query.png)
+
+### Image Ingestion
+
+![MultimodalQnA-ingest-image-screenshot](./assets/img/image-ingestion.png)
+
+### Text Query following the ingestion of an image
+
+![MultimodalQnA-video-query-screenshot](./assets/img/image-query.png)
+
+### Audio Ingestion
+
+![MultimodalQnA-audio-ingestion-screenshot](./assets/img/audio-ingestion.png)
+
+### Text Query following the ingestion of an Audio Podcast
+
+![MultimodalQnA-audio-query-screenshot](./assets/img/audio-query.png)
+
+### PDF Ingestion
+
+![MultimodalQnA-upload-pdf-screenshot](./assets/img/ingest_pdf.png)
+
+### Text query following the ingestion of a PDF
+
+![MultimodalQnA-pdf-query-example-screenshot](./assets/img/pdf-query.png)
diff --git a/MultimodalQnA/assets/img/audio-ingestion.png b/MultimodalQnA/assets/img/audio-ingestion.png
diff --git a/MultimodalQnA/assets/img/audio-query.png b/MultimodalQnA/assets/img/audio-query.png
diff --git a/MultimodalQnA/assets/img/image-ingestion.png b/MultimodalQnA/assets/img/image-ingestion.png
diff --git a/MultimodalQnA/assets/img/image-query.png b/MultimodalQnA/assets/img/image-query.png
diff --git a/MultimodalQnA/assets/img/ingest_pdf.png b/MultimodalQnA/assets/img/ingest_pdf.png
diff --git a/MultimodalQnA/assets/img/mmqna-ui.png b/MultimodalQnA/assets/img/mmqna-ui.png
diff --git a/MultimodalQnA/assets/img/pdf-query.png b/MultimodalQnA/assets/img/pdf-query.png
diff --git a/MultimodalQnA/assets/img/video-ingestion.png b/MultimodalQnA/assets/img/video-ingestion.png
diff --git a/MultimodalQnA/assets/img/video-query.png b/MultimodalQnA/assets/img/video-query.png
diff --git a/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md b/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
@@ -40,6 +40,10 @@ lvm
 ===
 Port 9399 - Open to 0.0.0.0/0
 
+whisper
+===
+port 7066 - Open to 0.0.0.0/0
+
 dataprep-multimodal-redis
 ===
 Port 6007 - Open to 0.0.0.0/0
@@ -75,34 +79,47 @@ export your_no_proxy=${your_no_proxy},"External_Public_IP"
 export no_proxy=${your_no_proxy}
 export http_proxy=${your_http_proxy}
 export https_proxy=${your_http_proxy}
-export EMBEDDER_PORT=6006
-export MMEI_EMBEDDING_ENDPOINT="http://${host_ip}:$EMBEDDER_PORT"
-export MM_EMBEDDING_PORT_MICROSERVICE=6000
-export WHISPER_SERVER_PORT=7066
-export WHISPER_SERVER_ENDPOINT="http://${host_ip}:${WHISPER_SERVER_PORT}/v1/asr"
-export REDIS_URL="redis://${host_ip}:6379"
+export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
+export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
+export LVM_SERVICE_HOST_IP=${host_ip}
+export MEGA_SERVICE_HOST_IP=${host_ip}
+export WHISPER_PORT=7066
+export WHISPER_SERVER_ENDPOINT="http://${host_ip}:${WHISPER_PORT}/v1/asr"
+export WHISPER_MODEL="base"
+export MAX_IMAGES=1
+export REDIS_DB_PORT=6379
+export REDIS_INSIGHTS_PORT=8001
+export REDIS_URL="redis://${host_ip}:${REDIS_DB_PORT}"
 export REDIS_HOST=${host_ip}
 export INDEX_NAME="mm-rag-redis"
+export DATAPREP_MMR_PORT=6007
+export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/ingest_with_text"
+export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/generate_transcripts"
+export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/generate_captions"
+export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/dataprep/get_files"
+export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/dataprep/delete_files"
+export EMM_BRIDGETOWER_PORT=6006
+export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
 export BRIDGE_TOWER_EMBEDDING=true
+export MMEI_EMBEDDING_ENDPOINT="http://${host_ip}:$EMM_BRIDGETOWER_PORT"
+export MM_EMBEDDING_PORT_MICROSERVICE=6000
+export REDIS_RETRIEVER_PORT=7000
+export LVM_PORT=9399
 export LLAVA_SERVER_PORT=8399
-export LVM_ENDPOINT="http://${host_ip}:8399"
-export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
 export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
-export WHISPER_MODEL="base"
-export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
-export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
-export LVM_SERVICE_HOST_IP=${host_ip}
-export MEGA_SERVICE_HOST_IP=${host_ip}
-export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
-export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
-export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
-export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
-export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
-export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
+export LVM_ENDPOINT="http://${host_ip}:$LLAVA_SERVER_PORT"
+export MEGA_SERVICE_PORT=8888
+export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:$MEGA_SERVICE_PORT/v1/multimodalqna"
+export UI_PORT=5173
 ```
 
 Note: Please replace with `host_ip` with you external IP address, do not use localhost.
 
+> Note: The `MAX_IMAGES` environment variable is used to specify the maximum number of images that will be sent from the LVM service to the LLaVA server.
+> If an image list longer than `MAX_IMAGES` is sent to the LVM server, a shortened image list will be sent to the LLaVA service. If the image list
+> needs to be shortened, the most recent images (the ones at the end of the list) are prioritized to send to the LLaVA service. Some LLaVA models have not
+> been trained with multiple images and may lead to inaccurate results. If `MAX_IMAGES` is not set, it will default to `1`.
+
 ## 🚀 Build Docker Images
 
 ### 1. Build embedding-multimodal-bridgetower Image
@@ -112,7 +129,7 @@ Build embedding-multimodal-bridgetower docker image
 ```bash
 git clone https://github.com/opea-project/GenAIComps.git
 cd GenAIComps
-docker build --no-cache -t opea/embedding-multimodal-bridgetower:latest --build-arg EMBEDDER_PORT=$EMBEDDER_PORT --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/third_parties/bridgetower/src/Dockerfile .
+docker build --no-cache -t opea/embedding-multimodal-bridgetower:latest --build-arg EMBEDDER_PORT=$EMM_BRIDGETOWER_PORT --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/third_parties/bridgetower/src/Dockerfile .
 ```
 
 Build embedding microservice image
@@ -147,7 +164,7 @@ docker build --no-cache -t opea/lvm:latest --build-arg https_proxy=$https_proxy
 docker build --no-cache -t opea/dataprep-multimodal-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/multimodal/redis/langchain/Dockerfile .
 ```
 
-### 5. Build asr images
+### 5. Build Whisper Server Image
 
 Build whisper server image
 
@@ -214,14 +231,14 @@ docker compose -f compose.yaml up -d
 1. embedding-multimodal-bridgetower
 
 ```bash
-curl http://${host_ip}:${EMBEDDER_PORT}/v1/encode \
+curl http://${host_ip}:${EMM_BRIDGETOWER_PORT}/v1/encode \
      -X POST \
      -H "Content-Type:application/json" \
      -d '{"text":"This is example"}'
 ```
 
 ```bash
-curl http://${host_ip}:${EMBEDDER_PORT}/v1/encode \
+curl http://${host_ip}:${EMM_BRIDGETOWER_PORT}/v1/encode \
      -X POST \
      -H "Content-Type:application/json" \
      -d '{"text":"This is example", "img_b64_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC"}'
@@ -247,13 +264,13 @@ curl http://${host_ip}:$MM_EMBEDDING_PORT_MICROSERVICE/v1/embeddings \
 
 ```bash
 export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(512)]; print(embedding)")
-curl http://${host_ip}:7000/v1/multimodal_retrieval \
+curl http://${host_ip}:${REDIS_RETRIEVER_PORT}/v1/multimodal_retrieval \
     -X POST \
     -H "Content-Type: application/json" \
     -d "{\"text\":\"test\",\"embedding\":${your_embedding}}"
 ```
 
-4. asr
+4. whisper
 
 ```bash
 curl ${WHISPER_SERVER_ENDPOINT} \
@@ -274,14 +291,14 @@ curl http://${host_ip}:${LLAVA_SERVER_PORT}/generate \
 6. lvm
 
 ```bash
-curl http://${host_ip}:9399/v1/lvm \
+curl http://${host_ip}:${LVM_PORT}/v1/lvm \
     -X POST \
     -H 'Content-Type: application/json' \
     -d '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
 ```
 
 ```bash
-curl http://${host_ip}:9399/v1/lvm  \
+curl http://${host_ip}:${LVM_PORT}/v1/lvm  \
     -X POST \
     -H 'Content-Type: application/json' \
     -d '{"image": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "prompt":"What is this?"}'
@@ -290,15 +307,15 @@ curl http://${host_ip}:9399/v1/lvm  \
 Also, validate LVM Microservice with empty retrieval results
 
 ```bash
-curl http://${host_ip}:9399/v1/lvm \
+curl http://${host_ip}:${LVM_PORT}/v1/lvm \
     -X POST \
     -H 'Content-Type: application/json' \
     -d '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
 ```
 
 7. dataprep-multimodal-redis
 
-Download a sample video, image, and audio file and create a caption
+Download a sample video, image, pdf, and audio file and create a caption
 
 ```bash
 export video_fn="WeAreGoingOnBullrun.mp4"
@@ -307,6 +324,9 @@ wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoing
 export image_fn="apple.png"
 wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}
 
+export pdf_fn="nke-10k-2023.pdf"
+wget https://raw.githubusercontent.com/opea-project/GenAIComps/v1.1/comps/retrievers/redis/data/nke-10k-2023.pdf -O ${pdf_fn}
+
 export caption_fn="apple.txt"
 echo "This is an apple."  > ${caption_fn}
 
@@ -325,7 +345,7 @@ curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     -F "files=@./${audio_fn}"
 ```
 
-Also, test dataprep microservice with generating an image caption using lvm microservice
+Also, test dataprep microservice with generating an image caption using lvm microservice.
 
 ```bash
 curl --silent --write-out "HTTPSTATUS:%{http_code}" \
@@ -334,13 +354,14 @@ curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     -X POST -F "files=@./${image_fn}"
 ```
 
-Now, test the microservice with posting a custom caption along with an image
+Now, test the microservice with posting a custom caption along with an image and a PDF containing images and text.
 
 ```bash
 curl --silent --write-out "HTTPSTATUS:%{http_code}" \
     ${DATAPREP_INGEST_SERVICE_ENDPOINT} \
     -H 'Content-Type: multipart/form-data' \
-    -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
+    -X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}" \
+    -F "files=@./${pdf_fn}"
 ```
 
 Also, you are able to get the list of all files that you uploaded:
@@ -358,7 +379,8 @@ Then you will get the response python-style LIST like this. Notice the name of e
     "WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
     "WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
     "apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
-    "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
+    "nke-10k-2023_28000757-5533-4b1b-89fe-7c0a1b7e2cd0.pdf",
+    "AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav"
 ]
 ```
 
@@ -372,21 +394,41 @@ curl -X POST \
 
 8. MegaService
 
+Test the MegaService with a text query:
+
 ```bash
-curl http://${host_ip}:8888/v1/multimodalqna \
+curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
     -H "Content-Type: application/json" \
     -X POST \
     -d '{"messages": "What is the revenue of Nike in 2023?"}'
 ```
 
+Test the MegaService with an audio query:
+
+```bash
+curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna  \
+    -H "Content-Type: application/json"  \
+    -d '{"messages": [{"role": "user", "content": [{"type": "audio", "audio": "UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA"}]}]}'
+```
+
+Test the MegaService with a text and image query:
+
+```bash
+curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
+    -H "Content-Type: application/json" \
+    -d  '{"messages": [{"role": "user", "content": [{"type": "text", "text": "Green bananas in a tree"}, {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test-stuff2017/000000004248.jpg"}}]}]}'
+```
+
+Test the MegaService with a back and forth conversation between the user and assistant:
+
 ```bash
-curl http://${host_ip}:8888/v1/multimodalqna  \
+curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna  \
     -H "Content-Type: application/json"  \
     -d '{"messages": [{"role": "user", "content": [{"type": "audio", "audio": "UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA"}]}]}'
 ```
 
 ```bash
-curl http://${host_ip}:8888/v1/multimodalqna \
+curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
     -H "Content-Type: application/json" \
     -d '{"messages": [{"role": "user", "content": [{"type": "text", "text": "hello, "}, {"type": "image_url", "image_url": {"url": "https://www.ilankelman.org/stopsigns/australia.jpg"}}]}, {"role": "assistant", "content": "opea project! "}, {"role": "user", "content": "chao, "}], "max_tokens": 10}'
 ```