Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MultimodalQnA image query, pdf, dynamic ports, and UI updates #1381

Open
wants to merge 39 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
0c80db1
MultimodalQnA updates to add support for image queries (#23)
dmsuehir Dec 16, 2024
6ccced2
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Dec 16, 2024
7e05f57
Added logic for image_query
HarshaRamayanam Dec 17, 2024
e25ada0
Merge branch 'hramayan/img_query' into mmqna-image-query
HarshaRamayanam Dec 17, 2024
e903d33
Dynamic wait for lvm service in MultimodalQnA tests (#33)
mhbuehler Dec 18, 2024
480bb4a
Update branch used for testing (#35)
dmsuehir Dec 18, 2024
a0487bb
Merge branch 'main' into mmqna-image-query
mhbuehler Dec 19, 2024
d2f8f64
Fix comps clone so it points to our branch (#36)
mhbuehler Dec 19, 2024
e47e9c1
Made all accessible GenAIExamples Ports dynamic (#34)
okhleif-IL Dec 30, 2024
c4d569c
Added fix for video upload format
HarshaRamayanam Jan 7, 2025
8360618
Merge pull request #38 from mhbuehler/hramayan/vid_upload_bug_fix
HarshaRamayanam Jan 8, 2025
8b54159
Adds transcript to MMQnA conversation (#37)
mhbuehler Jan 8, 2025
5378cb4
MultimodalQnA PDF Upload & Display (#32)
mhbuehler Jan 8, 2025
72c0709
merging main
okhleif-IL Jan 8, 2025
d14ab65
Merge branch 'mmqna-image-query' of https://github.com/mhbuehler/GenA…
okhleif-IL Jan 8, 2025
b67d3ce
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Jan 9, 2025
05812b5
removed merge conflicts
okhleif-IL Jan 9, 2025
fda37f4
Fixed Whisper Service Port merge (#39)
okhleif-IL Jan 10, 2025
675735e
Fixes bug with image query (#40)
HarshaRamayanam Jan 10, 2025
21c3f88
Merge branch 'main' into mmqna-image-query
mhbuehler Jan 10, 2025
da1e6d2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 10, 2025
5961795
Add asr to list of images to build in tests (#41)
dmsuehir Jan 10, 2025
17a25cc
Merge branch 'main' into mmqna-image-query
mhbuehler Jan 13, 2025
152a6b6
Merge branch 'main' into mmqna-image-query
mhbuehler Jan 14, 2025
53cc8aa
Merge branch 'main' into mmqna-image-query
mhbuehler Jan 15, 2025
100e7c7
Revert clones to OPEA main branch
mhbuehler Jan 15, 2025
48d06fe
Rollback accidental commit
mhbuehler Jan 15, 2025
b426966
Fix env vars in the MMQnA test environment and align_inputs error (#45)
dmsuehir Jan 15, 2025
343403f
Fix for omitted transcripts for pdfs (#44)
mhbuehler Jan 15, 2025
b1dd5d9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 15, 2025
2f27302
Increase timeout waiting the LVM to download in Gaudi MMQnA test (#47)
dmsuehir Jan 15, 2025
7e3eca2
Fixes some issues for image queries (#48)
mhbuehler Jan 16, 2025
888cb71
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 16, 2025
9402fcd
Merge branch 'main' of github.com:mhbuehler/GenAIExamples into mmqna-…
dmsuehir Jan 16, 2025
0f0bcb5
MMQnA Doc Updates for Latest Release (#43)
okhleif-IL Jan 16, 2025
a7cc4bf
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jan 16, 2025
512a3f4
Update compose.yaml to use fixed internal port for the whisper server…
dmsuehir Jan 16, 2025
7fb22f3
Change git clone in rocm test script to dev branch (#49)
mhbuehler Jan 16, 2025
428c0ea
Merge branch 'main' into mmqna-image-query
ashahba Jan 17, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 2 additions & 3 deletions MultimodalQnA/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,12 @@ RUN useradd -m -s /bin/bash user && \

WORKDIR $HOME


# Stage 2: latest GenAIComps sources
FROM base AS git

RUN apt-get update && apt-get install -y --no-install-recommends git
RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git

#RUN git clone --depth 1 https://github.com/opea-project/GenAIComps.git
RUN git clone --depth 1 https://github.com/mhbuehler/GenAIComps.git --single-branch --branch mmqna-image-query

# Stage 3: common layer shared by services using GenAIComps
FROM base AS comps-base
Expand Down
52 changes: 41 additions & 11 deletions MultimodalQnA/README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# MultimodalQnA Application

Suppose you possess a set of videos and wish to perform question-answering to extract insights from these videos. To respond to your questions, it typically necessitates comprehension of visual cues within the videos, knowledge derived from the audio content, or often a mix of both these visual elements and auditory facts. The MultimodalQnA framework offers an optimal solution for this purpose.
Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.

`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (frames, transcripts, and/or captions) from your collection of videos, images, and audio files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user.

The MultimodalQnA architecture shows below:

Expand Down Expand Up @@ -87,12 +87,12 @@ In the below, we provide a table that describes for each microservice component
<details>
<summary><b>Gaudi default compose.yaml</b></summary>

| MicroService | Open Source Project | HW | Port | Endpoint |
| ------------ | --------------------- | ----- | ---- | ----------------------------------------------- |
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval |
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions |
| MicroService | Open Source Project | HW | Port | Endpoint |
| ------------ | --------------------- | ----- | ---- | --------------------------------------------------------------------- |
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/multimodal_retrieval |
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest_with_text |

</details>

Expand Down Expand Up @@ -172,8 +172,38 @@ docker compose -f compose.yaml up -d

## MultimodalQnA Demo on Gaudi2

![MultimodalQnA-upload-waiting-screenshot](./assets/img/upload-gen-trans.png)
### Multimodal QnA UI

![MultimodalQnA-upload-done-screenshot](./assets/img/upload-gen-captions.png)
![MultimodalQnA-ui-screenshot](./assets/img/mmqna-ui.png)

![MultimodalQnA-query-example-screenshot](./assets/img/example_query.png)
### Video Ingestion

![MultimodalQnA-ingest-video-screenshot](./assets/img/video-ingestion.png)

### Text Query following the ingestion of a Video

![MultimodalQnA-video-query-screenshot](./assets/img/video-query.png)

### Image Ingestion

![MultimodalQnA-ingest-image-screenshot](./assets/img/image-ingestion.png)

### Text Query following the ingestion of an image

![MultimodalQnA-video-query-screenshot](./assets/img/image-query.png)

### Audio Ingestion

![MultimodalQnA-audio-ingestion-screenshot](./assets/img/audio-ingestion.png)

### Text Query following the ingestion of an Audio Podcast

![MultimodalQnA-audio-query-screenshot](./assets/img/audio-query.png)

### PDF Ingestion

![MultimodalQnA-upload-pdf-screenshot](./assets/img/ingest_pdf.png)

### Text query following the ingestion of a PDF

![MultimodalQnA-pdf-query-example-screenshot](./assets/img/pdf-query.png)
Binary file added MultimodalQnA/assets/img/audio-ingestion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/audio-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/image-ingestion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/image-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/ingest_pdf.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/mmqna-ui.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/pdf-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/video-ingestion.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added MultimodalQnA/assets/img/video-query.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
114 changes: 78 additions & 36 deletions MultimodalQnA/docker_compose/intel/cpu/xeon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,10 @@ lvm
===
Port 9399 - Open to 0.0.0.0/0

whisper
===
port 7066 - Open to 0.0.0.0/0

dataprep-multimodal-redis
===
Port 6007 - Open to 0.0.0.0/0
Expand Down Expand Up @@ -75,34 +79,47 @@ export your_no_proxy=${your_no_proxy},"External_Public_IP"
export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export EMBEDDER_PORT=6006
export MMEI_EMBEDDING_ENDPOINT="http://${host_ip}:$EMBEDDER_PORT"
export MM_EMBEDDING_PORT_MICROSERVICE=6000
export WHISPER_SERVER_PORT=7066
export WHISPER_SERVER_ENDPOINT="http://${host_ip}:${WHISPER_SERVER_PORT}/v1/asr"
export REDIS_URL="redis://${host_ip}:6379"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export WHISPER_PORT=7066
export WHISPER_SERVER_ENDPOINT="http://${host_ip}:${WHISPER_PORT}/v1/asr"
export WHISPER_MODEL="base"
export MAX_IMAGES=1
export REDIS_DB_PORT=6379
export REDIS_INSIGHTS_PORT=8001
export REDIS_URL="redis://${host_ip}:${REDIS_DB_PORT}"
export REDIS_HOST=${host_ip}
export INDEX_NAME="mm-rag-redis"
export DATAPREP_MMR_PORT=6007
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/generate_captions"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:${DATAPREP_MMR_PORT}/v1/dataprep/delete_files"
export EMM_BRIDGETOWER_PORT=6006
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export BRIDGE_TOWER_EMBEDDING=true
export MMEI_EMBEDDING_ENDPOINT="http://${host_ip}:$EMM_BRIDGETOWER_PORT"
export MM_EMBEDDING_PORT_MICROSERVICE=6000
export REDIS_RETRIEVER_PORT=7000
export LVM_PORT=9399
export LLAVA_SERVER_PORT=8399
export LVM_ENDPOINT="http://${host_ip}:8399"
export EMBEDDING_MODEL_ID="BridgeTower/bridgetower-large-itm-mlm-itc"
export LVM_MODEL_ID="llava-hf/llava-1.5-7b-hf"
export WHISPER_MODEL="base"
export MM_EMBEDDING_SERVICE_HOST_IP=${host_ip}
export MM_RETRIEVER_SERVICE_HOST_IP=${host_ip}
export LVM_SERVICE_HOST_IP=${host_ip}
export MEGA_SERVICE_HOST_IP=${host_ip}
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:8888/v1/multimodalqna"
export DATAPREP_INGEST_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/ingest_with_text"
export DATAPREP_GEN_TRANSCRIPT_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_transcripts"
export DATAPREP_GEN_CAPTION_SERVICE_ENDPOINT="http://${host_ip}:6007/v1/generate_captions"
export DATAPREP_GET_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/get_files"
export DATAPREP_DELETE_FILE_ENDPOINT="http://${host_ip}:6007/v1/dataprep/delete_files"
export LVM_ENDPOINT="http://${host_ip}:$LLAVA_SERVER_PORT"
export MEGA_SERVICE_PORT=8888
export BACKEND_SERVICE_ENDPOINT="http://${host_ip}:$MEGA_SERVICE_PORT/v1/multimodalqna"
export UI_PORT=5173
```

Note: Please replace with `host_ip` with you external IP address, do not use localhost.

> Note: The `MAX_IMAGES` environment variable is used to specify the maximum number of images that will be sent from the LVM service to the LLaVA server.
> If an image list longer than `MAX_IMAGES` is sent to the LVM server, a shortened image list will be sent to the LLaVA service. If the image list
> needs to be shortened, the most recent images (the ones at the end of the list) are prioritized to send to the LLaVA service. Some LLaVA models have not
> been trained with multiple images and may lead to inaccurate results. If `MAX_IMAGES` is not set, it will default to `1`.

## 🚀 Build Docker Images

### 1. Build embedding-multimodal-bridgetower Image
Expand All @@ -112,7 +129,7 @@ Build embedding-multimodal-bridgetower docker image
```bash
git clone https://github.com/opea-project/GenAIComps.git
cd GenAIComps
docker build --no-cache -t opea/embedding-multimodal-bridgetower:latest --build-arg EMBEDDER_PORT=$EMBEDDER_PORT --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/third_parties/bridgetower/src/Dockerfile .
docker build --no-cache -t opea/embedding-multimodal-bridgetower:latest --build-arg EMBEDDER_PORT=$EMM_BRIDGETOWER_PORT --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/third_parties/bridgetower/src/Dockerfile .
```

Build embedding microservice image
Expand Down Expand Up @@ -147,7 +164,7 @@ docker build --no-cache -t opea/lvm:latest --build-arg https_proxy=$https_proxy
docker build --no-cache -t opea/dataprep-multimodal-redis:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/multimodal/redis/langchain/Dockerfile .
```

### 5. Build asr images
### 5. Build Whisper Server Image

Build whisper server image

Expand Down Expand Up @@ -214,14 +231,14 @@ docker compose -f compose.yaml up -d
1. embedding-multimodal-bridgetower

```bash
curl http://${host_ip}:${EMBEDDER_PORT}/v1/encode \
curl http://${host_ip}:${EMM_BRIDGETOWER_PORT}/v1/encode \
-X POST \
-H "Content-Type:application/json" \
-d '{"text":"This is example"}'
```

```bash
curl http://${host_ip}:${EMBEDDER_PORT}/v1/encode \
curl http://${host_ip}:${EMM_BRIDGETOWER_PORT}/v1/encode \
-X POST \
-H "Content-Type:application/json" \
-d '{"text":"This is example", "img_b64_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC"}'
Expand All @@ -247,13 +264,13 @@ curl http://${host_ip}:$MM_EMBEDDING_PORT_MICROSERVICE/v1/embeddings \

```bash
export your_embedding=$(python3 -c "import random; embedding = [random.uniform(-1, 1) for _ in range(512)]; print(embedding)")
curl http://${host_ip}:7000/v1/multimodal_retrieval \
curl http://${host_ip}:${REDIS_RETRIEVER_PORT}/v1/multimodal_retrieval \
-X POST \
-H "Content-Type: application/json" \
-d "{\"text\":\"test\",\"embedding\":${your_embedding}}"
```

4. asr
4. whisper

```bash
curl ${WHISPER_SERVER_ENDPOINT} \
Expand All @@ -274,14 +291,14 @@ curl http://${host_ip}:${LLAVA_SERVER_PORT}/generate \
6. lvm

```bash
curl http://${host_ip}:9399/v1/lvm \
curl http://${host_ip}:${LVM_PORT}/v1/lvm \
-X POST \
-H 'Content-Type: application/json' \
-d '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [{"b64_img_str": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "transcript_for_inference": "yellow image", "video_id": "8c7461df-b373-4a00-8696-9a2234359fe0", "time_of_frame_ms":"37000000", "source_video":"WeAreGoingOnBullrun_8c7461df-b373-4a00-8696-9a2234359fe0.mp4"}], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
```

```bash
curl http://${host_ip}:9399/v1/lvm \
curl http://${host_ip}:${LVM_PORT}/v1/lvm \
-X POST \
-H 'Content-Type: application/json' \
-d '{"image": "iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC", "prompt":"What is this?"}'
Expand All @@ -290,15 +307,15 @@ curl http://${host_ip}:9399/v1/lvm \
Also, validate LVM Microservice with empty retrieval results

```bash
curl http://${host_ip}:9399/v1/lvm \
curl http://${host_ip}:${LVM_PORT}/v1/lvm \
-X POST \
-H 'Content-Type: application/json' \
-d '{"retrieved_docs": [], "initial_query": "What is this?", "top_n": 1, "metadata": [], "chat_template":"The caption of the image is: '\''{context}'\''. {question}"}'
```

7. dataprep-multimodal-redis

Download a sample video, image, and audio file and create a caption
Download a sample video, image, pdf, and audio file and create a caption

```bash
export video_fn="WeAreGoingOnBullrun.mp4"
Expand All @@ -307,6 +324,9 @@ wget http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/WeAreGoing
export image_fn="apple.png"
wget https://github.com/docarray/docarray/blob/main/tests/toydata/image-data/apple.png?raw=true -O ${image_fn}

export pdf_fn="nke-10k-2023.pdf"
wget https://raw.githubusercontent.com/opea-project/GenAIComps/v1.1/comps/retrievers/redis/data/nke-10k-2023.pdf -O ${pdf_fn}

export caption_fn="apple.txt"
echo "This is an apple." > ${caption_fn}

Expand All @@ -325,7 +345,7 @@ curl --silent --write-out "HTTPSTATUS:%{http_code}" \
-F "files=@./${audio_fn}"
```

Also, test dataprep microservice with generating an image caption using lvm microservice
Also, test dataprep microservice with generating an image caption using lvm microservice.

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
Expand All @@ -334,13 +354,14 @@ curl --silent --write-out "HTTPSTATUS:%{http_code}" \
-X POST -F "files=@./${image_fn}"
```

Now, test the microservice with posting a custom caption along with an image
Now, test the microservice with posting a custom caption along with an image and a PDF containing images and text.

```bash
curl --silent --write-out "HTTPSTATUS:%{http_code}" \
${DATAPREP_INGEST_SERVICE_ENDPOINT} \
-H 'Content-Type: multipart/form-data' \
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}"
-X POST -F "files=@./${image_fn}" -F "files=@./${caption_fn}" \
-F "files=@./${pdf_fn}"
```

Also, you are able to get the list of all files that you uploaded:
Expand All @@ -358,7 +379,8 @@ Then you will get the response python-style LIST like this. Notice the name of e
"WeAreGoingOnBullrun_7ac553a1-116c-40a2-9fc5-deccbb89b507.mp4",
"WeAreGoingOnBullrun_6d13cf26-8ba2-4026-a3a9-ab2e5eb73a29.mp4",
"apple_fcade6e6-11a5-44a2-833a-3e534cbe4419.png",
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav
"nke-10k-2023_28000757-5533-4b1b-89fe-7c0a1b7e2cd0.pdf",
"AudioSample_976a85a6-dc3e-43ab-966c-9d81beef780c.wav"
]
```

Expand All @@ -372,21 +394,41 @@ curl -X POST \

8. MegaService

Test the MegaService with a text query:

```bash
curl http://${host_ip}:8888/v1/multimodalqna \
curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
-H "Content-Type: application/json" \
-X POST \
-d '{"messages": "What is the revenue of Nike in 2023?"}'
```

Test the MegaService with an audio query:

```bash
curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": [{"type": "audio", "audio": "UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA"}]}]}'
```

Test the MegaService with a text and image query:

```bash
curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": [{"type": "text", "text": "Green bananas in a tree"}, {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test-stuff2017/000000004248.jpg"}}]}]}'
```

Test the MegaService with a back and forth conversation between the user and assistant:

```bash
curl http://${host_ip}:8888/v1/multimodalqna \
curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": [{"type": "audio", "audio": "UklGRigAAABXQVZFZm10IBIAAAABAAEARKwAAIhYAQACABAAAABkYXRhAgAAAAEA"}]}]}'
```

```bash
curl http://${host_ip}:8888/v1/multimodalqna \
curl http://${host_ip}:${MEGA_SERVICE_PORT}/v1/multimodalqna \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": [{"type": "text", "text": "hello, "}, {"type": "image_url", "image_url": {"url": "https://www.ilankelman.org/stopsigns/australia.jpg"}}]}, {"role": "assistant", "content": "opea project! "}, {"role": "user", "content": "chao, "}], "max_tokens": 10}'
```
Loading
Loading