Skip to content

Commit

Permalink
Provide unified scalable deployment and benchmarking support for exam… (
Browse files Browse the repository at this point in the history
#1315)

Signed-off-by: Cathy Zhang <[email protected]>
Signed-off-by: letonghan <[email protected]>
Co-authored-by: letonghan <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
  • Loading branch information
3 people authored Jan 24, 2025
1 parent 259099d commit ed16308
Show file tree
Hide file tree
Showing 6 changed files with 1,470 additions and 0 deletions.
83 changes: 83 additions & 0 deletions ChatQnA/benchmark_chatqna.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,83 @@
# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

deploy:
device: gaudi
version: 1.1.0
modelUseHostPath: /mnt/models
HUGGINGFACEHUB_API_TOKEN: ""
node: [1, 2, 4, 8]
namespace: ""

services:
backend:
instance_num: [2, 2, 4, 8]
cores_per_instance: ""
memory_capacity: ""

teirerank:
enabled: True
model_id: ""
replicaCount: [1, 1, 1, 1]
cards_per_instance: 1

tei:
model_id: ""
replicaCount: [1, 2, 4, 8]
cores_per_instance: ""
memory_capacity: ""

llm:
engine: tgi
model_id: ""
replicaCount: [7, 15, 31, 63]
max_batch_size: [1, 2, 4, 8]
max_input_length: ""
max_total_tokens: ""
max_batch_total_tokens: ""
max_batch_prefill_tokens: ""
cards_per_instance: 1

data-prep:
replicaCount: [1, 1, 1, 1]
cores_per_instance: ""
memory_capacity: ""

retriever-usvc:
replicaCount: [2, 2, 4, 8]
cores_per_instance: ""
memory_capacity: ""

redis-vector-db:
replicaCount: [1, 1, 1, 1]
cores_per_instance: ""
memory_capacity: ""

chatqna-ui:
replicaCount: [1, 1, 1, 1]

nginx:
replicaCount: [1, 1, 1, 1]

benchmark:
# http request behavior related fields
concurrency: [1, 2, 4]
totoal_query_num: [2048, 4096]
duration: [5, 10] # unit minutes
query_num_per_concurrency: [4, 8, 16]
possion: True
possion_arrival_rate: 1.0
warmup_iterations: 10
seed: 1024

# workload, all of the test cases will run for benchmark
test_cases:
- chatqnafixed
- chatqna_qlist_pubmed:
dataset: pub_med10 # pub_med10, pub_med100, pub_med1000
user_queries: [1, 2, 4]
query_token_size: 128 # if specified, means fixed query token size will be sent out

llm:
# specify the llm output token size
max_token_size: [128, 256]
69 changes: 69 additions & 0 deletions README-deploy-benchmark.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
# ChatQnA Benchmarking

## Purpose

We aim to run these benchmarks and share them with the OPEA community for three primary reasons:

- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading LLMs, serving frameworks etc.

## Table of Contents

- [Prerequisites](#prerequisites)
- [Overview](#overview)
- [Using deploy_and_benchmark.py](#using-deploy_and_benchmark.py-recommended)
- [Data Preparation](#data-preparation)
- [Configuration](#configuration)

## Prerequisites

Before running the benchmarks, ensure you have:

1. **Kubernetes Environment**

- Kubernetes installation: Use [kubespray](https://github.com/opea-project/docs/blob/main/guide/installation/k8s_install/k8s_install_kubespray.md) or other official Kubernetes installation guides
- (Optional) [Kubernetes set up guide on Intel Gaudi product](https://github.com/opea-project/GenAIInfra/blob/main/README.md#setup-kubernetes-cluster)

2. **Configuration YAML**
The configuration file (e.g., `./ChatQnA/benchmark_chatqna.yaml`) consists of two main sections: deployment and benchmarking. Required fields must be filled with valid values (like the Hugging Face token). For all other fields, you can either customize them according to your needs or leave them empty ("") to use the default values from the [helm charts](https://github.com/opea-project/GenAIInfra/tree/main/helm-charts).

## Data Preparation

Before running benchmarks, you need to:

1. **Prepare Test Data**

- Download the retrieval file:
```bash
wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt
```
- For the `chatqna_qlist_pubmed` test case, prepare `pubmed_${max_lines}.txt` by following this [README](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/stresscli/README_Pubmed_qlist.md)

2. **Prepare Model Files (Recommended)**
```bash
pip install -U "huggingface_hub[cli]"
sudo mkdir -p /mnt/models
sudo chmod 777 /mnt/models
huggingface-cli download --cache-dir /mnt/models Intel/neural-chat-7b-v3-3
```

## Overview

The benchmarking process consists of two main components: deployment and benchmarking. We provide `deploy_and_benchmark.py` as a unified entry point that combines both steps.

### Using deploy_and_benchmark.py (Recommended)

The script `deploy_and_benchmark.py` serves as the main entry point. Here's an example using ChatQnA configuration (you can replace it with any other example's configuration YAML file):

1. For a specific number of nodes:

```bash
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1
```

2. For all node configurations:
```bash
python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml
```
This will iterate through the node list in your configuration YAML file, performing deployment and benchmarking for each node count.
Loading

0 comments on commit ed16308

Please sign in to comment.