Skip to content

Commit

Permalink
[ChatQnA] Update the default LLM to llama3-8B on cpu/gpu/hpu
Browse files Browse the repository at this point in the history
Update the default LLM to llama3-8B on cpu/nvgpu/amdgpu/gaudi to avoid the potential model serving issue or the missing chat-template issue using neural-chat-7b.

#1420
Signed-off-by: Wang, Kai Lawrence <[email protected]>
  • Loading branch information
wangkl2 committed Jan 20, 2025
1 parent 6bfd156 commit 58a6a06
Show file tree
Hide file tree
Showing 25 changed files with 68 additions and 52 deletions.
12 changes: 8 additions & 4 deletions ChatQnA/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ RAG bridges the knowledge gap by dynamically fetching relevant information from

| Cloud Provider | Intel Architecture | Intel Optimized Cloud Module for Terraform | Comments |
| -------------------- | --------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------- |
| AWS | 4th Gen Intel Xeon with Intel AMX | [AWS Module](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna) | Uses Intel/neural-chat-7b-v3-3 by default |
| AWS | 4th Gen Intel Xeon with Intel AMX | [AWS Module](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna) | Uses meta-llama/Meta-Llama-3-8B-Instruct by default |
| AWS Falcon2-11B | 4th Gen Intel Xeon with Intel AMX | [AWS Module with Falcon11B](https://github.com/intel/terraform-intel-aws-vm/tree/main/examples/gen-ai-xeon-opea-chatqna-falcon11B) | Uses TII Falcon2-11B LLM Model |
| GCP | 5th Gen Intel Xeon with Intel AMX | [GCP Module](https://github.com/intel/terraform-intel-gcp-vm/tree/main/examples/gen-ai-xeon-opea-chatqna) | Also supports Confidential AI by using Intel® TDX with 4th Gen Xeon |
| Azure | 5th Gen Intel Xeon with Intel AMX | Work-in-progress | Work-in-progress |
Expand All @@ -25,7 +25,7 @@ Use this if you are not using Terraform and have provisioned your system with an

## Manually Deploy ChatQnA Service

The ChatQnA service can be effortlessly deployed on Intel Gaudi2, Intel Xeon Scalable Processors and Nvidia GPU.
The ChatQnA service can be effortlessly deployed on Intel Gaudi2, Intel Xeon Scalable Processors,Nvidia GPU and AMD GPU.

Two types of ChatQnA pipeline are supported now: `ChatQnA with/without Rerank`. And the `ChatQnA without Rerank` pipeline (including Embedding, Retrieval, and LLM) is offered for Xeon customers who can not run rerank service on HPU yet require high performance and accuracy.

Expand All @@ -35,7 +35,11 @@ Quick Start Deployment Steps:
2. Run Docker Compose.
3. Consume the ChatQnA Service.

Note: If you do not have docker installed you can run this script to install docker : `bash docker_compose/install_docker.sh`
Note:

1. If you do not have docker installed you can run this script to install docker : `bash docker_compose/install_docker.sh`.

2. The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

### Quick Start: 1.Setup Environment Variable

Expand Down Expand Up @@ -213,7 +217,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` in `docker_compose/xxx/set_env.sh` for your needs.

Expand Down
2 changes: 1 addition & 1 deletion ChatQnA/chatqna.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ def generate_rag_prompt(question, documents):
RERANK_SERVER_PORT = int(os.getenv("RERANK_SERVER_PORT", 80))
LLM_SERVER_HOST_IP = os.getenv("LLM_SERVER_HOST_IP", "0.0.0.0")
LLM_SERVER_PORT = int(os.getenv("LLM_SERVER_PORT", 80))
LLM_MODEL = os.getenv("LLM_MODEL", "Intel/neural-chat-7b-v3-3")
LLM_MODEL = os.getenv("LLM_MODEL", "meta-llama/Meta-Llama-3-8B-Instruct")


def align_inputs(self, inputs, cur_node, runtime_graph, llm_parameters_dict, **kwargs):
Expand Down
6 changes: 4 additions & 2 deletions ChatQnA/docker_compose/amd/gpu/rocm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Quick Start Deployment Steps:
2. Run Docker Compose.
3. Consume the ChatQnA Service.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## Quick Start: 1.Setup Environment Variable

To set up environment variables for deploying ChatQnA services, follow these steps:
Expand Down Expand Up @@ -159,7 +161,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` below for your needs.

Expand All @@ -179,7 +181,7 @@ Change the `xxx_MODEL_ID` below for your needs.
export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
export CHATQNA_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
export CHATQNA_LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export CHATQNA_TGI_SERVICE_PORT=8008
export CHATQNA_TEI_EMBEDDING_PORT=8090
export CHATQNA_TEI_EMBEDDING_ENDPOINT="http://${HOST_IP}:${CHATQNA_TEI_EMBEDDING_PORT}"
Expand Down
2 changes: 1 addition & 1 deletion ChatQnA/docker_compose/amd/gpu/rocm/set_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@
export CHATQNA_TGI_SERVICE_IMAGE="ghcr.io/huggingface/text-generation-inference:2.3.1-rocm"
export CHATQNA_EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
export CHATQNA_RERANK_MODEL_ID="BAAI/bge-reranker-base"
export CHATQNA_LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
export CHATQNA_LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export CHATQNA_TGI_SERVICE_PORT=18008
export CHATQNA_TEI_EMBEDDING_PORT=18090
export CHATQNA_TEI_EMBEDDING_ENDPOINT="http://${HOST_IP}:${CHATQNA_TEI_EMBEDDING_PORT}"
Expand Down
14 changes: 8 additions & 6 deletions ChatQnA/docker_compose/intel/cpu/xeon/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Quick Start:
2. Run Docker Compose.
3. Consume the ChatQnA Service.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## Quick Start: 1.Setup Environment Variable

To set up environment variables for deploying ChatQnA services, follow these steps:
Expand Down Expand Up @@ -184,7 +186,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` below for your needs.

Expand All @@ -195,7 +197,7 @@ For users in China who are unable to download models directly from Huggingface,
```bash
export HF_TOKEN=${your_hf_token}
export HF_ENDPOINT="https://hf-mirror.com"
model_name="Intel/neural-chat-7b-v3-3"
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
# Start vLLM LLM Service
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
# Start TGI LLM Service
Expand All @@ -204,7 +206,7 @@ For users in China who are unable to download models directly from Huggingface,

2. Offline

- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/ai-modelscope/neural-chat-7b-v3-1/files) for model `neural-chat-7b-v3-1`.
- Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.

- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.

Expand Down Expand Up @@ -337,7 +339,7 @@ For details on how to verify the correctness of the response, refer to [how-to-v
# either vLLM or TGI service
curl http://${host_ip}:9009/v1/chat/completions \
-X POST \
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-H 'Content-Type: application/json'
```
Expand Down Expand Up @@ -450,7 +452,7 @@ Users could follow previous section to testing vLLM microservice or ChatQnA Mega
```bash
curl http://${host_ip}:9009/start_profile \
-H "Content-Type: application/json" \
-d '{"model": "Intel/neural-chat-7b-v3-3"}'
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct"}'
```
Users would see below docker logs from vllm-service if profiling is started correctly.
Expand All @@ -473,7 +475,7 @@ By following command, users could stop vLLM profliing and generate a \*.pt.trace
# vLLM Service
curl http://${host_ip}:9009/stop_profile \
-H "Content-Type: application/json" \
-d '{"model": "Intel/neural-chat-7b-v3-3"}'
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct"}'
```
Users would see below docker logs from vllm-service if profiling is stopped correctly.
Expand Down
10 changes: 6 additions & 4 deletions ChatQnA/docker_compose/intel/cpu/xeon/README_pinecone.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Quick Start:
2. Run Docker Compose.
3. Consume the ChatQnA Service.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## Quick Start: 1.Setup Environment Variable

To set up environment variables for deploying ChatQnA services, follow these steps:
Expand Down Expand Up @@ -187,7 +189,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` below for your needs.

Expand All @@ -198,13 +200,13 @@ For users in China who are unable to download models directly from Huggingface,
```bash
export HF_TOKEN=${your_hf_token}
export HF_ENDPOINT="https://hf-mirror.com"
model_name="Intel/neural-chat-7b-v3-3"
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
docker run -p 8008:80 -v ./data:/data --name vllm-service -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --shm-size 128g opea/vllm:latest --model $model_name --host 0.0.0.0 --port 80
```

2. Offline

- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/ai-modelscope/neural-chat-7b-v3-1/files) for model `neural-chat-7b-v3-1`.
- Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.

- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.

Expand Down Expand Up @@ -324,7 +326,7 @@ For details on how to verify the correctness of the response, refer to [how-to-v
```bash
curl http://${host_ip}:9009/v1/chat/completions \
-X POST \
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-H 'Content-Type: application/json'
```
Expand Down
8 changes: 5 additions & 3 deletions ChatQnA/docker_compose/intel/cpu/xeon/README_qdrant.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,8 @@ This document outlines the deployment process for a ChatQnA application utilizin

The default pipeline deploys with vLLM as the LLM serving component and leverages rerank component.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## 🚀 Apply Xeon Server on AWS

To apply a Xeon server on AWS, start by creating an AWS account if you don't have one already. Then, head to the [EC2 Console](https://console.aws.amazon.com/ec2/v2/home) to begin the process. Within the EC2 service, select the Amazon EC2 M7i or M7i-flex instance type to leverage the power of 4th Generation Intel Xeon Scalable processors. These instances are optimized for high-performance computing and demanding workloads.
Expand Down Expand Up @@ -145,7 +147,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` below for your needs.

Expand Down Expand Up @@ -181,7 +183,7 @@ export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
export RERANK_MODEL_ID="BAAI/bge-reranker-base"
export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export INDEX_NAME="rag-qdrant"
```

Expand Down Expand Up @@ -256,7 +258,7 @@ For details on how to verify the correctness of the response, refer to [how-to-v
```bash
curl http://${host_ip}:6042/v1/chat/completions \
-X POST \
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-H 'Content-Type: application/json'
```

Expand Down
2 changes: 1 addition & 1 deletion ChatQnA/docker_compose/intel/cpu/xeon/set_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ popd > /dev/null

export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
export RERANK_MODEL_ID="BAAI/bge-reranker-base"
export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export INDEX_NAME="rag-redis"
# Set it as a non-null string, such as true, if you want to enable logging facility,
# otherwise, keep it as "" to disable it.
Expand Down
8 changes: 5 additions & 3 deletions ChatQnA/docker_compose/intel/hpu/gaudi/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@ Quick Start:
2. Run Docker Compose.
3. Consume the ChatQnA Service.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## Quick Start: 1.Setup Environment Variable

To set up environment variables for deploying ChatQnA services, follow these steps:
Expand Down Expand Up @@ -182,7 +184,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` below for your needs.

Expand All @@ -193,7 +195,7 @@ For users in China who are unable to download models directly from Huggingface,
```bash
export HF_TOKEN=${your_hf_token}
export HF_ENDPOINT="https://hf-mirror.com"
model_name="Intel/neural-chat-7b-v3-3"
model_name="meta-llama/Meta-Llama-3-8B-Instruct"
# Start vLLM LLM Service
docker run -p 8007:80 -v ./data:/data --name vllm-gaudi-server -e HF_ENDPOINT=$HF_ENDPOINT -e http_proxy=$http_proxy -e https_proxy=$https_proxy --runtime=habana -e HABANA_VISIBLE_DEVICES=all -e OMPI_MCA_btl_vader_single_copy_mechanism=none -e HUGGING_FACE_HUB_TOKEN=$HF_TOKEN -e VLLM_TORCH_PROFILER_DIR="/mnt" --cap-add=sys_nice --ipc=host opea/vllm-gaudi:latest --model $model_name --tensor-parallel-size 1 --host 0.0.0.0 --port 80 --block-size 128 --max-num-seqs 256 --max-seq_len-to-capture 2048
# Start TGI LLM Service
Expand All @@ -202,7 +204,7 @@ For users in China who are unable to download models directly from Huggingface,

2. Offline

- Search your model name in ModelScope. For example, check [this page](https://www.modelscope.cn/models/ai-modelscope/neural-chat-7b-v3-1/files) for model `neural-chat-7b-v3-1`.
- Search your model name in ModelScope. For example, check [this page](https://modelscope.cn/models/LLM-Research/Meta-Llama-3-8B-Instruct/files) for model `Meta-Llama-3-8B-Instruct`.

- Click on `Download this model` button, and choose one way to download the model to your local path `/path/to/model`.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -231,15 +231,15 @@ and the log shows model warm up, please wait for a while and try it later.
```
2024-06-05T05:45:27.707509646Z 2024-06-05T05:45:27.707361Z WARN text_generation_router: router/src/main.rs:357: `--revision` is not set
2024-06-05T05:45:27.707539740Z 2024-06-05T05:45:27.707379Z WARN text_generation_router: router/src/main.rs:358: We strongly advise to set it to a known supported commit.
2024-06-05T05:45:27.852525522Z 2024-06-05T05:45:27.852437Z INFO text_generation_router: router/src/main.rs:379: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model Intel/neural-chat-7b-v3-3
2024-06-05T05:45:27.852525522Z 2024-06-05T05:45:27.852437Z INFO text_generation_router: router/src/main.rs:379: Serving revision bdd31cf498d13782cc7497cba5896996ce429f91 of model meta-llama/Meta-Llama-3-8B-Instruct
2024-06-05T05:45:27.867833811Z 2024-06-05T05:45:27.867759Z INFO text_generation_router: router/src/main.rs:221: Warming up model
```

### 5 MegaService

```
curl http://${host_ip}:8888/v1/chatqna -H "Content-Type: application/json" -d '{
"model": "Intel/neural-chat-7b-v3-3",
"model": "meta-llama/Meta-Llama-3-8B-Instruct",
"messages": "What is the revenue of Nike in 2023?"
}'
```
Expand Down
2 changes: 1 addition & 1 deletion ChatQnA/docker_compose/intel/hpu/gaudi/set_env.sh
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ popd > /dev/null

export EMBEDDING_MODEL_ID="BAAI/bge-base-en-v1.5"
export RERANK_MODEL_ID="BAAI/bge-reranker-base"
export LLM_MODEL_ID="Intel/neural-chat-7b-v3-3"
export LLM_MODEL_ID="meta-llama/Meta-Llama-3-8B-Instruct"
export INDEX_NAME="rag-redis"
# Set it as a non-null string, such as true, if you want to enable logging facility,
# otherwise, keep it as "" to disable it.
Expand Down
6 changes: 4 additions & 2 deletions ChatQnA/docker_compose/nvidia/gpu/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ Quick Start Deployment Steps:
3. Run Docker Compose.
4. Consume the ChatQnA Service.

Note: The default LLM is `meta-llama/Meta-Llama-3-8B-Instruct`. Before deploying the application, please make sure either you've requested and granted the access to it on (Huggingface)[https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct] or you've downloaded the model locally from [ModelScope](https://www.modelscope.cn/models).

## Quick Start: 1.Setup Environment Variable

To set up environment variables for deploying ChatQnA services, follow these steps:
Expand Down Expand Up @@ -169,7 +171,7 @@ By default, the embedding, reranking and LLM models are set to a default value a
| --------- | ------------------------- |
| Embedding | BAAI/bge-base-en-v1.5 |
| Reranking | BAAI/bge-reranker-base |
| LLM | Intel/neural-chat-7b-v3-3 |
| LLM | meta-llama/Meta-Llama-3-8B-Instruct |

Change the `xxx_MODEL_ID` below for your needs.

Expand Down Expand Up @@ -287,7 +289,7 @@ docker compose up -d
```bash
curl http://${host_ip}:8008/v1/chat/completions \
-X POST \
-d '{"model": "Intel/neural-chat-7b-v3-3", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-d '{"model": "meta-llama/Meta-Llama-3-8B-Instruct", "messages": [{"role": "user", "content": "What is Deep Learning?"}], "max_tokens":17}' \
-H 'Content-Type: application/json'
```

Expand Down
Loading

0 comments on commit 58a6a06

Please sign in to comment.