-
Notifications
You must be signed in to change notification settings - Fork 227
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Provide unified scalable deployment and benchmarking support for exam… (
#1315) Signed-off-by: Cathy Zhang <[email protected]> Signed-off-by: letonghan <[email protected]> Co-authored-by: letonghan <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
- Loading branch information
1 parent
259099d
commit ed16308
Showing
6 changed files
with
1,470 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,83 @@ | ||
# Copyright (C) 2025 Intel Corporation | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
deploy: | ||
device: gaudi | ||
version: 1.1.0 | ||
modelUseHostPath: /mnt/models | ||
HUGGINGFACEHUB_API_TOKEN: "" | ||
node: [1, 2, 4, 8] | ||
namespace: "" | ||
|
||
services: | ||
backend: | ||
instance_num: [2, 2, 4, 8] | ||
cores_per_instance: "" | ||
memory_capacity: "" | ||
|
||
teirerank: | ||
enabled: True | ||
model_id: "" | ||
replicaCount: [1, 1, 1, 1] | ||
cards_per_instance: 1 | ||
|
||
tei: | ||
model_id: "" | ||
replicaCount: [1, 2, 4, 8] | ||
cores_per_instance: "" | ||
memory_capacity: "" | ||
|
||
llm: | ||
engine: tgi | ||
model_id: "" | ||
replicaCount: [7, 15, 31, 63] | ||
max_batch_size: [1, 2, 4, 8] | ||
max_input_length: "" | ||
max_total_tokens: "" | ||
max_batch_total_tokens: "" | ||
max_batch_prefill_tokens: "" | ||
cards_per_instance: 1 | ||
|
||
data-prep: | ||
replicaCount: [1, 1, 1, 1] | ||
cores_per_instance: "" | ||
memory_capacity: "" | ||
|
||
retriever-usvc: | ||
replicaCount: [2, 2, 4, 8] | ||
cores_per_instance: "" | ||
memory_capacity: "" | ||
|
||
redis-vector-db: | ||
replicaCount: [1, 1, 1, 1] | ||
cores_per_instance: "" | ||
memory_capacity: "" | ||
|
||
chatqna-ui: | ||
replicaCount: [1, 1, 1, 1] | ||
|
||
nginx: | ||
replicaCount: [1, 1, 1, 1] | ||
|
||
benchmark: | ||
# http request behavior related fields | ||
concurrency: [1, 2, 4] | ||
totoal_query_num: [2048, 4096] | ||
duration: [5, 10] # unit minutes | ||
query_num_per_concurrency: [4, 8, 16] | ||
possion: True | ||
possion_arrival_rate: 1.0 | ||
warmup_iterations: 10 | ||
seed: 1024 | ||
|
||
# workload, all of the test cases will run for benchmark | ||
test_cases: | ||
- chatqnafixed | ||
- chatqna_qlist_pubmed: | ||
dataset: pub_med10 # pub_med10, pub_med100, pub_med1000 | ||
user_queries: [1, 2, 4] | ||
query_token_size: 128 # if specified, means fixed query token size will be sent out | ||
|
||
llm: | ||
# specify the llm output token size | ||
max_token_size: [128, 256] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
# ChatQnA Benchmarking | ||
|
||
## Purpose | ||
|
||
We aim to run these benchmarks and share them with the OPEA community for three primary reasons: | ||
|
||
- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs. | ||
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case. | ||
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading LLMs, serving frameworks etc. | ||
|
||
## Table of Contents | ||
|
||
- [Prerequisites](#prerequisites) | ||
- [Overview](#overview) | ||
- [Using deploy_and_benchmark.py](#using-deploy_and_benchmark.py-recommended) | ||
- [Data Preparation](#data-preparation) | ||
- [Configuration](#configuration) | ||
|
||
## Prerequisites | ||
|
||
Before running the benchmarks, ensure you have: | ||
|
||
1. **Kubernetes Environment** | ||
|
||
- Kubernetes installation: Use [kubespray](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md) or other official Kubernetes installation guides | ||
- (Optional) [Kubernetes set up guide on Intel Gaudi product](https://github.com/opea-project/GenAIInfra/blob/main/README.md#setup-kubernetes-cluster) | ||
|
||
2. **Configuration YAML** | ||
The configuration file (e.g., `./ChatQnA/benchmark_chatqna.yaml`) consists of two main sections: deployment and benchmarking. Required fields must be filled with valid values (like the Hugging Face token). For all other fields, you can either customize them according to your needs or leave them empty ("") to use the default values from the [helm charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts). | ||
|
||
## Data Preparation | ||
|
||
Before running benchmarks, you need to: | ||
|
||
1. **Prepare Test Data** | ||
|
||
- Download the retrieval file: | ||
```bash | ||
wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt | ||
``` | ||
- For the `chatqna_qlist_pubmed` test case, prepare `pubmed_${max_lines}.txt` by following this [README](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/stresscli/README_Pubmed_qlist.md) | ||
|
||
2. **Prepare Model Files (Recommended)** | ||
```bash | ||
pip install -U "huggingface_hub[cli]" | ||
sudo mkdir -p /mnt/models | ||
sudo chmod 777 /mnt/models | ||
huggingface-cli download --cache-dir /mnt/models Intel/neural-chat-7b-v3-3 | ||
``` | ||
|
||
## Overview | ||
|
||
The benchmarking process consists of two main components: deployment and benchmarking. We provide `deploy_and_benchmark.py` as a unified entry point that combines both steps. | ||
|
||
### Using deploy_and_benchmark.py (Recommended) | ||
|
||
The script `deploy_and_benchmark.py` serves as the main entry point. Here's an example using ChatQnA configuration (you can replace it with any other example's configuration YAML file): | ||
|
||
1. For a specific number of nodes: | ||
|
||
```bash | ||
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 | ||
``` | ||
|
||
2. For all node configurations: | ||
```bash | ||
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml | ||
``` | ||
This will iterate through the node list in your configuration YAML file, performing deployment and benchmarking for each node count. |
Oops, something went wrong.