diff --git a/benchmarking_on_ec2.html b/benchmarking_on_ec2.html index 25657c70..1fc079c9 100755 --- a/benchmarking_on_ec2.html +++ b/benchmarking_on_ec2.html @@ -1460,8 +1460,23 @@

Benchmar

Benchmarking on an instance type with NVIDIA GPU and the Triton inference server

  1. -

    No special procedure needed, just follow steps in the Benchmarking on an instance type with NVIDIA GPUs or AWS Chips section and then run FMBench with a config file for Triton. For example for benchmarking Llama3-8b model on a g5.12xlarge use the following command (after completing the steps for setting up FMBench).

    -
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
    +

    Follow steps in the Benchmarking on an instance type with NVIDIA GPUs or AWS Chips section to install FMBench but do not run any benchmarking tests yet.

    +
  2. +
  3. +

    Once FMBench is installed then install the following additional dependencies for Triton.

    +
    cd ~
    +git clone https://github.com/triton-inference-server/tensorrtllm_backend.git  --branch v0.12.0
    +# Update the submodules
    +cd tensorrtllm_backend
    +# Install git-lfs if needed
    +apt-get update && apt-get install git-lfs -y --no-install-recommends
    +git lfs install
    +git submodule update --init --recursive
    +
    +
  4. +
  5. +

    Now you are ready to run benchmarking with Triton. For example for benchmarking Llama3-8b model on a g5.12xlarge use the following command:

    +
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
     
@@ -1470,64 +1485,64 @@

here. (Note: Configure the storage of your EC2 instance to 500GB for this test)

-
# Install Docker and Git using the YUM package manager
-sudo yum install docker git -y
-
-# Start the Docker service
-sudo systemctl start docker
-
-# Download the Miniconda installer for Linux
-wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-bash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)
-rm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation
-eval "$(/home/$USER/miniconda3/bin/conda shell.bash hook)" # Initialize conda for bash shell
-conda init  # Initialize conda, adding it to the shell
+
# Install Docker and Git using the YUM package manager
+sudo yum install docker git -y
+
+# Start the Docker service
+sudo systemctl start docker
+
+# Download the Miniconda installer for Linux
+wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+bash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)
+rm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation
+eval "$(/home/$USER/miniconda3/bin/conda shell.bash hook)" # Initialize conda for bash shell
+conda init  # Initialize conda, adding it to the shell
 
  • Setup the fmbench_python311 conda environment.

    -
    # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
    -conda create --name fmbench_python311 -y python=3.11 ipykernel
    -
    -# Activate the newly created conda environment
    -source activate fmbench_python311
    -
    -# Upgrade pip and install the fmbench package
    -pip install -U fmbench
    +
    # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
    +conda create --name fmbench_python311 -y python=3.11 ipykernel
    +
    +# Activate the newly created conda environment
    +source activate fmbench_python311
    +
    +# Upgrade pip and install the fmbench package
    +pip install -U fmbench
     
  • First we need to build the required docker image for triton, and push it locally. To do this, curl the Triton Dockerfile and the script to build and push the triton image locally:

    -

        # curl the docker file for triton
    -    curl -o ./Dockerfile_triton https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton
    -
    -    # curl the script that builds and pushes the triton image locally
    -    curl -o build_and_push_triton.sh https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh
    -
    -    # Make the triton build and push script executable, and run it
    -    chmod +x build_and_push_triton.sh
    -    ./build_and_push_triton.sh
    +

        # curl the docker file for triton
    +    curl -o ./Dockerfile_triton https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton
    +
    +    # curl the script that builds and pushes the triton image locally
    +    curl -o build_and_push_triton.sh https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh
    +
    +    # Make the triton build and push script executable, and run it
    +    chmod +x build_and_push_triton.sh
    +    ./build_and_push_triton.sh
     
    - Now wait until the docker image is saved locally and then follow the instructions below to start a benchmarking test.

  • Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

    -
    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp
    +
    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp
     
  • To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

    -
    echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt
    +
    echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt
     
  • Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

    -
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
    +
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
     
  • Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

    -
    tail -f fmbench.log
    +
    tail -f fmbench.log
     
  • @@ -1539,30 +1554,30 @@

    Benchmarking o
    1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here

      -
      # Install Docker and Git using the YUM package manager
      -sudo yum install docker git -y
      -
      -# Start the Docker service
      -sudo systemctl start docker
      -
      -# Download the Miniconda installer for Linux
      -wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      -bash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)
      -rm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation
      -eval "$(/home/$USER/miniconda3/bin/conda shell.bash hook)" # Initialize conda for bash shell
      -conda init  # Initialize conda, adding it to the shell
      +
      # Install Docker and Git using the YUM package manager
      +sudo yum install docker git -y
      +
      +# Start the Docker service
      +sudo systemctl start docker
      +
      +# Download the Miniconda installer for Linux
      +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      +bash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)
      +rm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation
      +eval "$(/home/$USER/miniconda3/bin/conda shell.bash hook)" # Initialize conda for bash shell
      +conda init  # Initialize conda, adding it to the shell
       
    2. Setup the fmbench_python311 conda environment.

      -
      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
      -conda create --name fmbench_python311 -y python=3.11 ipykernel
      -
      -# Activate the newly created conda environment
      -source activate fmbench_python311
      -
      -# Upgrade pip and install the fmbench package
      -pip install -U fmbench
      +
      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
      +conda create --name fmbench_python311 -y python=3.11 ipykernel
      +
      +# Activate the newly created conda environment
      +source activate fmbench_python311
      +
      +# Upgrade pip and install the fmbench package
      +pip install -U fmbench
       
    3. @@ -1573,51 +1588,51 @@

      Benchmarking o

    4. The container being build is for CPU only (GPU support might be added in future).

      -
      # Clone the vLLM project repository from GitHub
      -git clone https://github.com/vllm-project/vllm.git
      -
      -# Change the directory to the cloned vLLM project
      -cd vllm
      -
      -# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB
      -sudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
      +
      # Clone the vLLM project repository from GitHub
      +git clone https://github.com/vllm-project/vllm.git
      +
      +# Change the directory to the cloned vLLM project
      +cd vllm
      +
      +# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB
      +sudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .
       

  • Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

    -
    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp
    +
    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp
     
  • To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

    -
    echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt
    +
    echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt
     
  • Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use sudo each time.

    -
    sudo usermod -a -G docker $USER
    -newgrp docker
    +
    sudo usermod -a -G docker $USER
    +newgrp docker
     
  • Install docker-compose.

    -
    DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
    -mkdir -p $DOCKER_CONFIG/cli-plugins
    -sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose
    -sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
    -docker compose version
    +
    DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
    +mkdir -p $DOCKER_CONFIG/cli-plugins
    +sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose
    +sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
    +docker compose version
     
  • Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

    -
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
    +
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
     
  • Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

    -
    tail -f fmbench.log
    +
    tail -f fmbench.log
     
  • @@ -1629,30 +1644,30 @@

    Benchmarking
    1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here

      -
      # Install Docker and Git using the YUM package manager
      -sudo yum install docker git -y
      -
      -# Start the Docker service
      -sudo systemctl start docker
      -
      -# Download the Miniconda installer for Linux
      -wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      -bash Miniconda3-latest-Linux-x86_64.sh -b # Run the Miniconda installer in batch mode (no manual intervention)
      -rm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation
      -eval "$(/home/$USER/miniconda3/bin/conda shell.bash hook)" # Initialize conda for bash shell
      -conda init  # Initialize conda, adding it to the shell
      +
      # Install Docker and Git using the YUM package manager
      +sudo yum install docker git -y
      +
      +# Start the Docker service
      +sudo systemctl start docker
      +
      +# Download the Miniconda installer for Linux
      +wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
      +bash Miniconda3-latest-Linux-x86_64.sh -b # Run the Miniconda installer in batch mode (no manual intervention)
      +rm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation
      +eval "$(/home/$USER/miniconda3/bin/conda shell.bash hook)" # Initialize conda for bash shell
      +conda init  # Initialize conda, adding it to the shell
       
    2. Setup the fmbench_python311 conda environment.

      -
      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
      -conda create --name fmbench_python311 -y python=3.11 ipykernel
      -
      -# Activate the newly created conda environment
      -source activate fmbench_python311
      -
      -# Upgrade pip and install the fmbench package
      -pip install -U fmbench
      +
      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel
      +conda create --name fmbench_python311 -y python=3.11 ipykernel
      +
      +# Activate the newly created conda environment
      +source activate fmbench_python311
      +
      +# Upgrade pip and install the fmbench package
      +pip install -U fmbench
       
    3. @@ -1663,51 +1678,51 @@

      Benchmarking

    4. The container being build is for CPU only (GPU support might be added in future).

      -
      # Clone the vLLM project repository from GitHub
      -git clone https://github.com/vllm-project/vllm.git
      -
      -# Change the directory to the cloned vLLM project
      -cd vllm
      -
      -# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB
      -sudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=12g .
      +
      # Clone the vLLM project repository from GitHub
      +git clone https://github.com/vllm-project/vllm.git
      +
      +# Change the directory to the cloned vLLM project
      +cd vllm
      +
      +# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB
      +sudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=12g .
       

  • Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

    -
    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp
    +
    curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp
     
  • To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

    -
    echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt
    +
    echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt
     
  • Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use sudo each time.

    -
    sudo usermod -a -G docker $USER
    -newgrp docker
    +
    sudo usermod -a -G docker $USER
    +newgrp docker
     
  • Install docker-compose.

    -
    DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
    -mkdir -p $DOCKER_CONFIG/cli-plugins
    -sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose
    -sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
    -docker compose version
    +
    DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}
    +mkdir -p $DOCKER_CONFIG/cli-plugins
    +sudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose
    +sudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose
    +docker compose version
     
  • Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

    -
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
    +
    fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1
     
  • Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

    -
    tail -f fmbench.log
    +
    tail -f fmbench.log
     
  • diff --git a/search/search_index.json b/search/search_index.json index 03df44a7..b6a92637 100755 --- a/search/search_index.json +++ b/search/search_index.json @@ -1 +1 @@ -{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Benchmark foundation models on AWS","text":"

    FMBench is a Python package for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. The FMs could be deployed on these platforms either directly through FMbench, or, if they are already deployed then also they could be benchmarked through the Bring your own endpoint mode supported by FMBench.

    Here are some salient features of FMBench:

    1. Highly flexible: in that it allows for using any combinations of instance types (g5, p4d, p5, Inf2), inference containers (DeepSpeed, TensorRT, HuggingFace TGI and others) and parameters such as tensor parallelism, rolling batch etc. as long as those are supported by the underlying platform.

    2. Benchmark any model: it can be used to be benchmark open-source models, third party models, and proprietary models trained by enterprises on their own data.

    3. Run anywhere: it can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.

    "},{"location":"index.html#the-need-for-benchmarking","title":"The need for benchmarking","text":"

    Customers often wonder what is the best AWS service to run FMs for my specific use-case and my specific price performance requirements. While model evaluation metrics are available on several leaderboards (HELM, LMSys), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as (LongBench, QMSum). This is the problem that FMBench solves.

    "},{"location":"index.html#fmbench-an-open-source-python-package-for-fm-benchmarking-on-aws","title":"FMBench: an open-source Python package for FM benchmarking on AWS","text":"

    FMBench runs inference requests against endpoints that are either deployed through FMBench itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given FM for a given use-case.

    The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the Llama2-13b model on different instance types available on SageMaker using prompts for Q&A task created from the LongBench dataset, these prompts are between 3000 to 3840 tokens in length. Note that the numbers are hidden in this figure but you would be able to see them when you run FMBench yourself.

    The following table (also included in the report) provides information about the best available instance type for that experiment1.

    Information Value experiment_name llama2-13b-inf2.24xlarge payload_file payload_en_3000-3840.jsonl instance_type ml.inf2.24xlarge concurrency ** error_rate ** prompt_token_count_mean 3394 prompt_token_throughput 2400 completion_token_count_mean 31 completion_token_throughput 15 latency_mean ** latency_p50 ** latency_p95 ** latency_p99 ** transactions_per_minute ** price_per_txn **

    1 ** values hidden on purpose, these are available when you run the tool yourself.

    The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types).

    "},{"location":"index.html#determine-the-optimal-model-for-your-generative-ai-workload","title":"Determine the optimal model for your generative AI workload","text":"

    Use FMBench to determine model accuracy using a panel of LLM evaluators (PoLL [1]). Here is one of the plots generated by FMBench to help answer the accuracy question for various FMs on Amazon Bedrock (the model ids in the charts have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).

    "},{"location":"index.html#references","title":"References","text":"

    [1] Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\", arXiv:2404.18796, 2024.

    "},{"location":"accuracy.html","title":"Model evaluations using panel of LLM evaluators","text":"

    FMBench release 2.0.0 adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators (PoLL). It gathers quantitative metrics such as Cosine Similarity and overall majority voting accuracy metrics to measure the similarity and accuracy of model responses compared to the ground truth.

    Accuracy is defined as percentage of responses generated by the LLM that match the ground truth included in the dataset (as a separate column). In order to determine if an LLM generated response matches the ground truth we ask other LLMs called the evaluator LLMs to compare the LLM output and the ground truth and provide a verdict if the LLM generated ground truth is correct or not given the ground truth. Here is the link to the Anthropic Claude 3 Sonnet model prompt being used as an evaluator (or a judge model). A combination of the cosine similarity and the LLM evaluator verdict decides if the LLM generated response is correct or incorrect. Finally, one LLM evaluator could be biased, could have inaccuracies so instead of relying on the judgement of a single evaluator, we rely on the majority vote of 3 different LLM evaluators. By default we use the Anthropic Claude 3 Sonnet, Meta Llama3-70b and the Cohere Command R plus model as LLM evaluators. See Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\", arXiv:2404.18796, 2024. for more details on using a Panel of LLM Evaluators (PoLL).

    "},{"location":"accuracy.html#evaluation-flow","title":"Evaluation Flow","text":"
    1. Provide a dataset that includes ground truth responses for each sample. FMBench uses the LongBench dataset by default.

    2. Configure the candidate models to be evaluated in the FMBench config file. See this config file for an example that runs evaluations for multiple models available via Amazon Bedrock. Running evaluations only requires the following two changes to the config file:

      • Set the 4_get_evaluations.ipynb: yes, see this line.
      • Set the ground_truth_col_key: answers and question_col_key: input parameters, see this line. The value of ground_truth_col_key and the question_col_key is set to the name of the column in the dataset that contains the ground truth and question respectively.
    3. Run FMBench, which will:

    4. Fetch the inference results containing the model responses

    5. Calculate quantitative metrics (Cosine Similarity)

    6. Use a Panel of LLM Evaluators to compare each model response to the ground truth

    7. Each LLM evaluator will provide a binary verdict (correct/incorrect) and an explanation

    8. Validate the LLM evaluations using Cosine Similarity thresholds

    9. Categorize the final evaluation for each response as correctly correct, correctly incorrect, or needs further evaluation

    10. Review the FMBench report to analyze the evaluation results and compare the performance of the candidate models. The report contains tables and charts that provide insights into model accuracy.

    By leveraging ground truth data and a Panel of LLM Evaluators, FMBench provides a comprehensive and efficient way to assess the quality of generative AI models. The majority voting approach, combined with quantitative metrics, enables a robust evaluation that reduces bias and latency while maintaining consistency across responses.

    "},{"location":"advanced.html","title":"Advanced","text":"

    Beyond running FMBench with the configuration files provided, you may want try out bringing your own dataset or endpoint to FMBench.

    "},{"location":"analytics.html","title":"Generate downstream summarized reports for further analysis","text":"

    You can use several results from various FMBench runs to generate a summarized report of all runs based on your cost, latency, and concurrency budgets. This report helps answer the following question:

    What is the minimum number of instances N, of most cost optimal instance type T, that are needed to serve a real-time workload W while keeping the average transaction latency under L seconds?\u201d

    W: = {R transactions per-minute, average prompt token length P, average generation token length G}\n
    • With this summarized report, we test the following hypothesis: At the low end of the total number of requests/minute smaller instances which provide good inference latency at low concurrencies would suffice (said another way, the larger more expensive instances are an overkill at this stage) but as the number of requests/minute increase there comes an inflection point beyond which the number of smaller instances required would be so much that it would be more economical to use fewer instances of the larger more expensive instances.
    "},{"location":"analytics.html#an-example-report-that-gets-generated-is-as-follows","title":"An example report that gets generated is as follows:","text":""},{"location":"analytics.html#summary-for-payload-payload_en_x-y","title":"Summary for payload: payload_en_x-y","text":"
    • The metrics below in the table are examples and do not represent any specific model or instance type. This table can be used to make analysis on the cost and instance maintenance perspective based on the use case. For example, instance_type_1 costs 10 dollars and requires 1 instance to host model_1 until it can handle 100 requests per minute. As the requests scale to a 1,000 requests per minute, 5 instances are required and cost 50 dollars. As the requests scale to 10,000 requests per minute, the number of instances to maintain scale to 30, and the cost becomes 450 dollars.

    • On the other hand, instance_type_2 is more costly, with a price of $499 for 10,000 requests per minute to host the same model, but only requires 22 instances to maintain, which is 8 less than when the model is hosted on instance_type_1.

    • Based on these summaries, users can make decisions based on their use case priorities. For a real time and latency sensitive application, a user might select instance_type_2 to host model_1 since the user would have to maintain 8 lesser instances than hosting the model on instance_type_1. Hosting the model on instance_type_2 would also maintain the p_95 latency (0.5s), which is half compared to instance_type_1 (p_95 latency: 1s) even though it costs more than instance_type_1. On the other hand, if the application is cost sensitive, and the user is flexible to maintain more instances at a higher latency, they might want to shift gears to using instance_type_1.

    • Note: Based on varying needs for prompt size, cost, and latency, the table might change.

    experiment_name instance_type concurrency latency_p95 transactions_per_minute instance_count_and_cost_1_rpm instance_count_and_cost_10_rpm instance_count_and_cost_100_rpm instance_count_and_cost_1000_rpm instance_count_and_cost_10000_rpm model_1 instance_type_1 1 1.0 _ (1, 10) (1, 10) (1, 10) (5, 50) (30, 450) model_1 instance_type_2 1 0.5 _ (1, 10) (1, 20) (1, 20) (6, 47) (22, 499)"},{"location":"analytics.html#fmbench-heatmap","title":"FMBench Heatmap","text":"

    This step also generates a heatmap that contains information about each instance, and how much it costs with per request-per-minute (rpm) breakdown. The default breakdown is [1 rpm, 10 rpm, 100 rpm, 1000 rpm, 10000 rpm]. View an example of a heatmap below. The model name, instance type, is masked but can be generated for your specific use case/requirements.

    "},{"location":"analytics.html#steps-to-run-analytics","title":"Steps to run analytics","text":"
    1. Clone the FMBench repo from GitHub.

    2. Place all of the result-{model-id}-... folders that are generated from various runs in the top level directory.

    3. Run the following command to generate downstream analytics and summarized tables. Replace x, y, z and model_id with the latency, concurrency thresholds, payload file of interest (for example payload_en_1000-2000.jsonl) and the model_id respectively. The model_id would have to be appended to the results-{model-id} folders so the analytics.py file can generate a report for all of those respective result folders.

      python analytics/analytics.py --latency-threshold x --concurrency-threshold y  --payload-file z --model-id model_id\n
    "},{"location":"announcement.html","title":"Release 2.0 announcement","text":"

    We are excited to share news about a major FMBench release, we now have release 2.0 for FMBench that supports model evaluations through a panel of LLM evaluators\ud83c\udf89. With the recent feature additions to FMBench we are already seeing increased interest from customers and hope to reach even more customers and have an even greater impact. Check out all the latest and greatest features from FMBench on the FMBench website.

    Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.

    Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).

    Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.

    Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.

    "},{"location":"benchmarking.html","title":"Benchmark models deployed on different AWS Generative AI services","text":"

    FMBench comes packaged with configuration files for benchmarking models on different AWS Generative AI services.

    "},{"location":"benchmarking.html#full-list-of-benchmarked-models","title":"Full list of benchmarked models","text":"Model EC2 g5 EC2 p4 EC2 p5 EC2 Inf2/Trn1 SageMaker g4dn/g5/p3 SageMaker Inf2/Trn1 SageMaker P4 SageMaker P5 Bedrock On-demand throughput Bedrock provisioned throughput Anthropic Claude-3 Sonnet \u2705 \u2705 Anthropic Claude-3 Haiku \u2705 Mistral-7b-instruct \u2705 \u2705 \u2705 \u2705 \u2705 Mistral-7b-AWQ \u2705 Mixtral-8x7b-instruct \u2705 Llama3.1-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3.1-70b instruct \u2705 \u2705 \u2705 Llama3-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3-70b instruct \u2705 \u2705 \u2705 \u2705 \u2705 Llama2-13b chat \u2705 \u2705 \u2705 \u2705 Llama2-70b chat \u2705 \u2705 \u2705 \u2705 Amazon Titan text lite \u2705 Amazon Titan text express \u2705 Cohere Command text \u2705 Cohere Command light text \u2705 AI21 J2 Mid \u2705 AI21 J2 Ultra \u2705 Gemma-2b \u2705 Phi-3-mini-4k-instruct \u2705 distilbert-base-uncased \u2705"},{"location":"benchmarking_on_bedrock.html","title":"Benchmark models on Bedrock","text":"

    Choose any config file from the bedrock folder and either run these directly or use them as templates for creating new config files specific to your use-case. Here is an example for benchmarking the Llama3 models on Bedrock.

    fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/bedrock/config-bedrock-llama3.yml > fmbench.log 2>&1\n
    "},{"location":"benchmarking_on_ec2.html","title":"Benchmark models on EC2","text":"

    You can use FMBench to benchmark models on hosted on EC2. This can be done in one of two ways:

    • Deploy the model on your EC2 instance independantly of FMBench and then benchmark it through the Bring your own endpoint mode.
    • Deploy the model on your EC2 instance through FMBench and then benchmark it.

    The steps for deploying the model on your EC2 instance are described below.

    \ud83d\udc49 In this configuration both the model being benchmarked and FMBench are deployed on the same EC2 instance.

    Create a new EC2 instance suitable for hosting an LMI as per the steps described here. Note that you will need to select the correct AMI based on your instance type, this is called out in the instructions.

    The steps for benchmarking on different types of EC2 instances (GPU/CPU/Neuron) and different inference containers differ slightly. These are all described below.

    "},{"location":"benchmarking_on_ec2.html#benchmarking-options-on-ec2","title":"Benchmarking options on EC2","text":"
    • Benchmarking on an instance type with NVIDIA GPUs or AWS Chips
    • Benchmarking on an instance type with NVIDIA GPU and the Triton inference server
    • Benchmarking on an instance type with AWS Chips and the Triton inference server
    • Benchmarking on an CPU instance type with AMD processors
    • Benchmarking on an CPU instance type with Intel processors

    • Benchmarking the Triton inference server

    "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips","title":"Benchmarking on an instance type with NVIDIA GPUs or AWS Chips","text":"
    1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench.

      wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n
    2. Install docker-compose.

      sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\ndocker compose version \n
    3. Setup the fmbench_python311 conda environment.

      conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n
    4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

      curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
    5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

      echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
    6. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. Skip to the next step if benchmarking for AWS Chips. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

      fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
    7. For example, to run FMBench on a llama3-8b-Instruct model on an inf2.48xlarge instance, run the command command below. The config file for this example can be viewed here.

      fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
    8. Open a new Terminal and do a tail on fmbench.log to see a live log of the run.

      tail -f fmbench.log\n
    9. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

    "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpu-and-the-triton-inference-server","title":"Benchmarking on an instance type with NVIDIA GPU and the Triton inference server","text":"
    1. No special procedure needed, just follow steps in the Benchmarking on an instance type with NVIDIA GPUs or AWS Chips section and then run FMBench with a config file for Triton. For example for benchmarking Llama3-8b model on a g5.12xlarge use the following command (after completing the steps for setting up FMBench).

      fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
    "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference-server","title":"Benchmarking on an instance type with AWS Chips and the Triton inference server","text":"

    As of 2024-09-26 this has been tested on a trn1.32xlarge instance

    1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here. (Note: Configure the storage of your EC2 instance to 500GB for this test)

      # Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n
    2. Setup the fmbench_python311 conda environment.

      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n
    3. First we need to build the required docker image for triton, and push it locally. To do this, curl the Triton Dockerfile and the script to build and push the triton image locally:

          # curl the docker file for triton\n    curl -o ./Dockerfile_triton https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton\n\n    # curl the script that builds and pushes the triton image locally\n    curl -o build_and_push_triton.sh https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh\n\n    # Make the triton build and push script executable, and run it\n    chmod +x build_and_push_triton.sh\n    ./build_and_push_triton.sh\n
      - Now wait until the docker image is saved locally and then follow the instructions below to start a benchmarking test.

    4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

      curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
    5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

      echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
    6. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

      fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
    7. Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

      tail -f fmbench.log\n
    8. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

    "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-amd-processors","title":"Benchmarking on an CPU instance type with AMD processors","text":"

    As of 2024-08-27 this has been tested on a m7a.16xlarge instance

    1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here

      # Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n
    2. Setup the fmbench_python311 conda environment.

      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n
    3. Build the vllm container for serving the model.

      1. \ud83d\udc49 The vllm container we are building locally is going to be references in the FMBench config file.

      2. The container being build is for CPU only (GPU support might be added in future).

        # Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .\n
    4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

      curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
    5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

      echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
    6. Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use sudo each time.

      sudo usermod -a -G docker $USER\nnewgrp docker\n
    7. Install docker-compose.

      DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n
    8. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

      fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
    9. Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

      tail -f fmbench.log\n
    10. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

    "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-intel-processors","title":"Benchmarking on an CPU instance type with Intel processors","text":"

    As of 2024-08-27 this has been tested on c5.18xlarge and m5.16xlarge instances

    1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here

      # Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n
    2. Setup the fmbench_python311 conda environment.

      # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n
    3. Build the vllm container for serving the model.

      1. \ud83d\udc49 The vllm container we are building locally is going to be references in the FMBench config file.

      2. The container being build is for CPU only (GPU support might be added in future).

        # Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=12g .\n
    4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

      curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
    5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

      echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
    6. Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use sudo each time.

      sudo usermod -a -G docker $USER\nnewgrp docker\n
    7. Install docker-compose.

      DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n
    8. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

      fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
    9. Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

      tail -f fmbench.log\n
    10. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

    "},{"location":"benchmarking_on_eks.html","title":"Benchmark models on EKS","text":"

    You can use FMBench to benchmark models on hosted on EKS. This can be done in one of two ways:

    • Deploy the model on your EKS cluster independantly of FMBench and then benchmark it through the Bring your own endpoint mode.
    • Deploy the model on your EKS cluster through FMBench and then benchmark it.

    The steps for deploying the model on your EKS cluster are described below.

    \ud83d\udc49 EKS cluster creation itself is not a part of the FMBench functionality, the cluster needs to exist before you run the following steps. Steps for cluster creation are provided in this file but it would be best to consult the DoEKS repo on GitHub for comprehensive instructions.

    1. Add the following IAM policies to your existing FMBench Role:

      1. AmazonEKSClusterPolicy: This policy provides Kubernetes the permissions it requires to manage resources on your behalf.

      2. AmazonEKS_CNI_Policy: This policy provides the Amazon VPC CNI Plugin (amazon-vpc-cni-k8s) the permissions it requires to modify the IP address configuration on your EKS worker nodes. This permission set allows the CNI to list, describe, and modify Elastic Network Interfaces on your behalf.

      3. AmazonEKSWorkerNodePolicy: This policy allows Amazon EKS worker nodes to connect to Amazon EKS Clusters.

    2. Once the EKS cluster is available you can use either the following two files or create your own config files using these files as examples for running benchmarking for these models. These config files require that the EKS cluster has been created as per the steps in these instructions.

      1. config-llama3-8b-eks-inf2.yml: Deploy Llama3 on Trn1/Inf2 instances.

      2. config-mistral-7b-eks-inf2.yml: Deploy Mistral 7b on Trn1/Inf2 instances.

      For more information about the blueprints used by FMBench to deploy these models, view: DoEKS docs gen-ai.

    3. Run the Llama3-8b benchmarking using the command below (replace the config file as needed for a different model). This will first deploy the model on your EKS cluster and then run benchmarking on the deployed model.

      fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-eks-inf2.yml > fmbench.log 2>&1\n
    4. As the model is getting deployed you might want to run the following kubectl commands to monitor the deployment progress. Set the model_namespace to llama3 or mistral or a different model as appropriate.

      1. kubectl get pods -n <model_namespace> -w: Watch the pods in the model specific namespace.
      2. kubectl -n karpenter get pods: Get the pods in the karpenter namespace.
      3. kubectl describe pod -n <model_namespace> <pod-name>: Describe a specific pod in the mistral namespace to view the live logs.
    "},{"location":"benchmarking_on_sagemaker.html","title":"Benchmark models on SageMaker","text":"

    Choose any config file from the model specific folders, for example the Llama3 folder for Llama3 family of models. These configuration files also include instructions for FMBench to first deploy the model on SageMaker using your configured instance type and inference parameters of choice and then run the benchmarking. Here is an example for benchmarking Llama3-8b model on an ml.inf2.24xlarge and ml.g5.12xlarge instance.

    fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-inf2-g5.yml > fmbench.log 2>&1\n
    "},{"location":"build.html","title":"Building the FMBench Python package","text":"

    If you would like to build a dev version of FMBench for your own development and testing purposes, the following steps describe how to do that.

    1. Clone the FMBench repo from GitHub.

    2. Make any code changes as needed.

    3. Install poetry.

      pip install poetry mkdocs-material mknotebooks\n
    4. Change directory to the FMBench repo directory and run poetry build.

      poetry build\n
    5. The .whl file is generated in the dist folder. Install the .whl in your current Python environment.

      pip install dist/fmbench-X.Y.Z-py3-none-any.whl\n
    6. Run FMBench as usual through the FMBench CLI command.

    7. You may have added new config files as part of your work, to make sure these files are called out in the manifest.txt run the following command. This command will overwrite the existing manifest.txt and manifest.md files. Both these files need to be committed to the repo. Reach out to the maintainers of this repo so that they can add new or modified config files to the blogs bucket (the CloudFormation stack would fail if a new file is added to the manifest but is not available for download through the S3 bucket).

      python create_manifest.py\n
    8. To create updated documentation run the following command. You need to be added as a contributor to the FMBench repo to be able to publish to the website, so this command would not work for you if you are not added as a contributor to the repo.

      mkdocs gh-deploy\n
    "},{"location":"byo_dataset.html","title":"Bring your own dataset","text":"

    By default FMBench uses the LongBench dataset dataset for testing the models, but this is not the only dataset you can test with. You may want to test with other datasets available on HuggingFace or use your own datasets for testing. You can do this by converting your dataset to the JSON lines format. We provide a code sample for converting any HuggingFace dataset into JSON lines format and uploading it to the S3 bucket used by FMBench in the bring_your_own_dataset notebook. Follow the steps described in the notebook to bring your own dataset for testing with FMBench.

    "},{"location":"byo_dataset.html#support-for-open-orca-dataset","title":"Support for Open-Orca dataset","text":"

    Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral, see:

    1. bring_your_own_dataset.ipynb
    2. prompt templates
    3. Llama3 config file with OpenOrca
    "},{"location":"byo_rest_predictor.html","title":"Bring your own REST Predictor (data-on-eks version)","text":"

    FMBench now provides an example of bringing your own endpoint as a REST Predictor for benchmarking. View this script as an example. This script is an inference file for the NousResearch/Llama-2-13b-chat-hf model deployed on an Amazon EKS cluster using Ray Serve. The model is deployed via data-on-eks which is a comprehensive resource for scaling your data and machine learning workloads on Amazon EKS and unlocking the power of Gen AI. Using data-on-eks, you can harness the capabilities of AWS Trainium, AWS Inferentia and NVIDIA GPUs to scale and optimize your Gen AI workloads and benchmark those models on FMBench with ease.

    "},{"location":"byoe.html","title":"Bring your own endpoint (a.k.a. support for external endpoints)","text":"

    If you have an endpoint deployed on say Amazon EKS or Amazon EC2 or have your models hosted on a fully-managed service such as Amazon Bedrock, you can still bring your endpoint to FMBench and run tests against your endpoint. To do this you need to do the following:

    1. Create a derived class from FMBenchPredictor abstract class and provide implementation for the constructor, the get_predictions method and the endpoint_name property. See SageMakerPredictor for an example. Save this file locally as say my_custom_predictor.py.

    2. Upload your new Python file (my_custom_predictor.py) for your custom FMBench predictor to your FMBench read bucket and the scripts prefix specified in the s3_read_data section (read_bucket and scripts_prefix).

    3. Edit the configuration file you are using for your FMBench for the following:

      • Skip the deployment step by setting the 2_deploy_model.ipynb step under run_steps to no.
      • Set the inference_script under any experiment in the experiments section for which you want to use your new custom inference script to point to your new Python file (my_custom_predictor.py) that contains your custom predictor.
    "},{"location":"ec2.html","title":"Run FMBench on Amazon EC2","text":"

    For some enterprise scenarios it might be desirable to run FMBench directly on an EC2 instance with no dependency on S3. Here are the steps to do this:

    1. Have a t3.xlarge (or larger) instance in the Running stage. Make sure that the instance has at least 50GB of disk space and the IAM role associated with your EC2 instance has AmazonSageMakerFullAccess policy associated with it and sagemaker.amazonaws.com added to its Trust relationships.

      {\n    \"Effect\": \"Allow\",\n    \"Principal\": {\n        \"Service\": \"sagemaker.amazonaws.com\"\n    },\n    \"Action\": \"sts:AssumeRole\"\n}\n

    2. Setup the fmbench_python311 conda environment. This step required conda to be installed on the EC2 instance, see instructions for downloading Anaconda.

      conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n
    3. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo.

      curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n
    4. Run FMBench with a quickstart config file.

      fmbench --config-file /tmp/fmbench-read/configs/llama2/7b/config-llama2-7b-g5-quick.yml --local-mode yes > fmbench.log 2>&1\n
    5. Open a new Terminal and navigate to the foundation-model-benchmarking-tool directory and do a tail on fmbench.log to see a live log of the run.

      tail -f fmbench.log\n
    6. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

    "},{"location":"features.html","title":"FMBench features","text":"

    Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.

    Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).

    Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.

    Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.

    "},{"location":"gettingstarted.html","title":"Getting started with FMBench","text":"

    FMBench is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.

    While technically you can run FMBench on any AWS compute but practically speaking we either run it on a SageMaker Notebook or on EC2. Both these options are described below.

    Intro Video

    "},{"location":"gettingstarted.html#fmbench-in-a-client-server-configuration-on-amazon-ec2","title":"FMBench in a client-server configuration on Amazon EC2","text":"

    Often times there might be a need where a platform team would like to have a bunch of LLM endpoints deployed in an account available permanently for data science teams or application teams to benchmark performance and accuracy for their specific use-case. They can take advantage of a special client-server configuration for FMBench where it can be used to deploy models on EC2 instances in one AWS account (called the server account) and run tests against these endpoints from FMBench deployed on EC2 instances in another AWS account (called the client AWS account).

    This has the advantage that every team that wants to benchmark a set of LLMs does not first have to deploy the LLMs, a platform team can do that for them and have these LLMs available for a longer duration as these teams do their benchmarking, for example for their specific datasets, for their specific cost and performance criteria. Using FMBench in this way makes the process simpler for both teams as the platform team can use FMBench for easily deploying the models with full control on the configuration of the serving stack without having to write any LLM deployment code for EC2 and the data science teams or application teams can test with different datasets, performance criteria and inference parameters. As long as the security groups have an inbound rule to allow access to the model endpoint (typically TCP port 8080) an FMBench installation in the client AWS account should be able to access an endpoint in the server AWS account.

    "},{"location":"manifest.html","title":"Files","text":"

    Here is a listing of the various configuration files available out-of-the-box with FMBench. Click on any link to view a file. You can use these files as-is or use them as templates to create a custom configuration file for your use-case of interest.

    bedrock \u251c\u2500\u2500 bedrock/config-bedrock-all-anthropic-models-longbench-data.yml \u251c\u2500\u2500 bedrock/config-bedrock-anthropic-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-claude.yml \u251c\u2500\u2500 bedrock/config-bedrock-evals-only-conc-1.yml \u251c\u2500\u2500 bedrock/config-bedrock-haiku-sonnet-majority-voting.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-70b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-8b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-no-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-titan-text-express.yml \u2514\u2500\u2500 bedrock/config-bedrock.yml bert \u2514\u2500\u2500 bert/config-distilbert-base-uncased.yml byoe \u2514\u2500\u2500 byoe/config-model-byo-sagemaker-endpoint.yml eks_manifests \u251c\u2500\u2500 eks_manifests/llama3-ray-service.yaml \u2514\u2500\u2500 eks_manifests/mistral-ray-service.yaml gemma \u2514\u2500\u2500 gemma/config-gemma-2b-g5.yml llama2 \u251c\u2500\u2500 llama2/13b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-bedrock-sagemaker-llama2.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-byo-rest-ep-llama2-13b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5.yml \u251c\u2500\u2500 llama2/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-ec2-llama2-70b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-tgi.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-trt.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/70b/config-llama2-70b-inf2-g5.yml \u2514\u2500\u2500 llama2/7b \u251c\u2500\u2500 llama2/7b/config-llama2-7b-byo-sagemaker-endpoint.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g4dn-g5-trt.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-no-s3-quick.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-quick.yml \u2514\u2500\u2500 llama2/7b/config-llama2-7b-inf2-g5.yml llama3 \u251c\u2500\u2500 llama3/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-bedrock.yml -> ../../bedrock/config-bedrock.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-llama3-70b-instruct.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-neuron-llama3-70b-inf2-48xl-deploy-sm.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-48xl.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3/70b/config-llama3-70b-instruct-p4d.yml \u2514\u2500\u2500 llama3/8b \u251c\u2500\u2500 llama3/8b/config-bedrock.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-24xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7i-12xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-neuron-trn1-32xl-tp16-sm.yml \u251c\u2500\u2500 config-llama3-8b-trn1-32xl-tp16-bs-4-ec2.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-24xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-48xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-eks-inf2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5-streaming.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-24xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-48xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5-byoe-w-openorca.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-all.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl-4-instances.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-2xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-p4d.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p5-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-16-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-8-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-24xl-byoe-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-48xl-byoe-g5-24xl.yml \u2514\u2500\u2500 llama3/8b/llama3-8b-trn1-32xl-byoe-g5-24xl.yml llama3.1 \u251c\u2500\u2500 llama3.1/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-48xl-deploy-ec2.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-deploy-sm.yml \u2514\u2500\u2500 llama3.1/8b \u251c\u2500\u2500 llama3.1/8b/client-config-ec2-llama3-1-8b.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2-tp24-bs12.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-4-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-8-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p5-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-tp-8-mc-auto-p5.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-trn1-32xl-deploy-ec2-tp32-bs8.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.2xl-tp-1-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-8-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-24-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn1-32xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn32xl-triton-vllm.yml \u2514\u2500\u2500 llama3.1/8b/server-config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml mistral \u251c\u2500\u2500 mistral/config-mistral-7b-eks-inf2.yml \u251c\u2500\u2500 mistral/config-mistral-7b-tgi-g5.yml \u251c\u2500\u2500 mistral/config-mistral-7b-trn1-32xl-triton.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5-byo-ep.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v1-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-trn1-32xl-deploy-ec2-tp32.yml \u2514\u2500\u2500 mistral/config-mistral-v3-inf2-48xl-deploy-ec2-tp24.yml model_eval_all_info.yml phi \u2514\u2500\u2500 phi/config-phi-3-g5.yml pricing.yml

    "},{"location":"mm_copies.html","title":"Running multiple model copies on Amazon EC2","text":"

    It is possible to run multiple copies of a model if the tensor parallelism degree and the number of GPUs/Neuron cores on the instance allow it. For example if a model can fit into 2 GPU devices and there are 8 devices available then we could run 4 copies of the model on that instance. Some inference containers, such as the DJL Serving LMI automatically start multiple copies of the model within the same inference container for the scenario described in the example above. However, it is also possible to do this ourselves by running multiple containers and a load balancer through a Docker compose file. FMBench now supports this functionality by adding a single parameter called model_copies in the configuration file.

    For example, here is a snippet from the config-ec2-llama3-1-8b-p4-tp-2-mc-max config file. The new parameters are model_copies, tp_degree and shm_size in the inference_spec section. Note that the tp_degree in the inference_spec and option.tensor_parallel_degree in the serving.properties section should be set to the same value.

        inference_spec:\n      # this should match one of the sections in the inference_parameters section above\n      parameter_set: ec2_djl\n      # how many copies of the model, \"1\", \"2\",..max\n      # set to 1 in the code if not configured,\n      # max: FMBench figures out the max number of model containers to be run\n      #      based on TP degree configured and number of neuron cores/GPUs available.\n      #      For example, if TP=2, GPUs=8 then FMBench will start 4 containers and 1 load balancer,\n      # auto: only supported if the underlying inference container would automatically \n      #       start multiple copies of the model internally based on TP degree and neuron cores/GPUs\n      #       available. In this case only a single container is created, no load balancer is created.\n      #       The DJL serving containers supports auto.  \n      model_copies: max\n      # if you set the model_copies parameter then it is mandatory to set the \n      # tp_degree, shm_size, model_loading_timeout parameters\n      tp_degree: 2\n      shm_size: 12g\n      model_loading_timeout: 2400\n    # modify the serving properties to match your model and requirements\n    serving.properties: |\n      engine=MPI\n      option.tensor_parallel_degree=2\n      option.max_rolling_batch_size=256\n      option.model_id=meta-llama/Meta-Llama-3.1-8B-Instruct\n      option.rolling_batch=lmi-dist\n
    "},{"location":"mm_copies.html#considerations-while-setting-the-model_copies-parameter","title":"Considerations while setting the model_copies parameter","text":"
    1. The model_copies parameter is an EC2 only parameter, which means that you cannot use it when deploying models on SageMaker for example.

    2. If you are looking for the best (lowest) inference latency then you might get better results with setting the tp_degree and option.tensor_parallel_degree to the total number of GPUs/Neuron cores available on your EC2 instance and model_copies to max or auto or 1, in other words, the model is being shared across all accelerators and there can be only 1 copy of the model that can run on that instance (therefore setting model_copies to max or auto or 1 all result in the same thing i.e. a single copy of the model running on that EC2 instance).

    3. If you are looking for the best (highest) transaction throughput while keeping the inference latency within a given latency budget then you might want to configure tp_degree and option.tensor_parallel_degree to the least number of GPUs/Neuron cores on which the model can run (for example for Llama3.1-8b that would be 2 GPUs or 4 Neuron cores) and set the model_copies to max. Let us understand this with an example, say you want to run Llama3.1-8b on a p4de.24xlarge instance type, you set tp_degree and option.tensor_parallel_degree to 2 and model_copies to max, FMBench will start 4 containers (as the p4de.24xlarge has 8 GPUs) and an Nginx load balancer that will round-robin the incoming requests to these 4 containers. In case of the DJL serving LMI you can achieve similar results by setting the model_copies to auto in which case FMBench will start a single container (and no load balancer since there is only one container) and then the DJL serving container will internally start 4 copies of the model within the same container and route the requests to these 4 copies internally. Theoretically you should expect the same performance but in our testing we have seen better performance with model_copies set to max and having an external (Nginx) container doing the load balancing.

    "},{"location":"neuron.html","title":"Benchmark foundation models for AWS Chips","text":"

    You can use FMBench for benchmarking foundation model on AWS Chips: Trainium 1, Inferentia 2. This can be done on Amazon SageMaker, Amazon EKS or on Amazon EC2. FMs need to be first compiled for Neuron before they can be deployed on AWS Chips, this is made easier by SageMaker JumpStart which provides most of the FMs as a JumpStart Model that can be deployed on SageMaker directly, you can also compile models for Neuron yourself or do this through FMBench itself. All of these options are described below.

    "},{"location":"neuron.html#benchmarking-for-aws-chips-on-sagemaker","title":"Benchmarking for AWS Chips on SageMaker","text":"
    1. Several FMs are available through SageMaker JumpStart already compiled for Neuron and ready to deploy. See this link for more details.

    2. You can compile the model outside of FMBench using instructions available here and on the Neuron documentation, deploy on SageMaker and use FMBench in the bring your own endpoint mode, see this config file for an example.

    3. You can have FMBench compile and deploy the model on SageMaker for you. See this Llama3-8b config file for example or this Llama3.1-70b. Search this website for \"inf2\" or \"trn1\" to find other examples. In this case FMBench will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt file, the file simply contains the token without any formatting), compile it for neuron, upload the compiled model to S3 (you specify the bucket in the config file) and then deploy the model to a SageMaker endpoint.

    "},{"location":"neuron.html#benchmarking-for-aws-chips-on-ec2","title":"Benchmarking for AWS Chips on EC2","text":"

    You may want to benchmark models hosted directly on EC2. In this case both FMBench and the model are running on the same EC2 instance. FMBench will deploy the model for you on the EC2 instance. See this Llama3.1-70b file for example or this Llama3-8b file. In this case FMBench will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt file, the file simply contains the token without any formatting), pull the inference container from the ECR repo and then run the container with the downloaded model, a local endpoint is provided that is then used by FMBench to run inference.

    "},{"location":"quickstart.html","title":"Quickstart - run FMBench on SageMaker Notebook","text":"
    1. Each FMBench run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical FMBench workflow involves either directly using an already provided config file from the configs folder in the FMBench GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).

      \ud83d\udc49 A simple config file with key parameters annotated is included in this repo, see config-llama2-7b-g5-quick.yml. This file benchmarks performance of Llama2-7b on an ml.g5.xlarge instance and an ml.g5.2xlarge instance. You can use this config file as it is for this Quickstart.

    2. Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run FMBench and a write S3 bucket is created which will hold the metrics and reports generated by FMBench. The CloudFormation stack takes about 5-minutes to create.

    AWS Region Link us-east-1 (N. Virginia) us-west-2 (Oregon) us-gov-west-1 (GovCloud West)
    1. Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the fmbench-notebook.

    2. On the fmbench-notebook open a Terminal and run the following commands.

      conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n

    3. Now you are ready to fmbench with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.

      1. We benchmark performance for the Llama2-7b model on a ml.g5.xlarge and a ml.g5.2xlarge instance type, using the huggingface-pytorch-tgi-inference inference container. This test would take about 30 minutes to complete and cost about $0.20.

      2. It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the Llama2 tokenizer (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.

        account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml > fmbench.log 2>&1\n
      3. Open another terminal window and do a tail -f on the fmbench.log file to see all the traces being generated at runtime.

        tail -f fmbench.log\n
      4. \ud83d\udc49 For streaming support on SageMaker and Bedrock checkout these config files:

        1. config-llama3-8b-g5-streaming.yml
        2. config-bedrock-llama3-streaming.yml
    4. The generated reports and metrics are available in the sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id> bucket. The metrics and report files are also downloaded locally and in the results directory (created by FMBench) and the benchmarking report is available as a markdown file called report.md in the results directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.

    "},{"location":"quickstart.html#fmbench-on-govcloud","title":"FMBench on GovCloud","text":"

    No special steps are required for running FMBench on GovCloud. The CloudFormation link for us-gov-west-1 has been provided in the section above.

    1. Not all models available via Bedrock or other services may be available in GovCloud. The following commands show how to run FMBench to benchmark the Amazon Titan Text Express model in the GovCloud. See the Amazon Bedrock GovCloud page for more details.

      account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/bedrock/config-bedrock-titan-text-express.yml > fmbench.log 2>&1\n
    "},{"location":"releases.html","title":"Releases","text":""},{"location":"releases.html#207","title":"2.0.7","text":"
    1. Support Triton-TensorRT for GPU instances and Triton-vllm for AWS Chips.
    2. Misc. bug fixes.
    "},{"location":"releases.html#206","title":"2.0.6","text":"
    1. Run multiple model copies with the DJL serving container and an Nginx load balancer on Amazon EC2.
    2. Config files for Llama3.1-8b on g5, p4de and p5 Amazon EC2 instance types.
    3. Better analytics for creating internal leaderboards.
    "},{"location":"releases.html#205","title":"2.0.5","text":"
    1. Support for Intel CPU based instances such as c5.18xlarge and m5.16xlarge.
    "},{"location":"releases.html#204","title":"2.0.4","text":"
    1. Support for AMD CPU based instances such as m7a.
    "},{"location":"releases.html#203","title":"2.0.3","text":"
    1. Support for a EFA directory for benchmarking on EC2.
    "},{"location":"releases.html#202","title":"2.0.2","text":"
    1. Code cleanup, minor bug fixes and report improvements.
    "},{"location":"releases.html#200","title":"2.0.0","text":"
    1. \ud83d\udea8 Model evaluations done by a Panel of LLM Evaluators \ud83d\udea8
    "},{"location":"releases.html#v1052","title":"v1.0.52","text":"
    1. Compile for AWS Chips (Trainium, Inferentia) and deploy to SageMaker directly through FMBench.
    2. Llama3.1-8b and Llama3.1-70b config files for AWS Chips (Trainium, Inferentia).
    3. Misc. bug fixes.
    "},{"location":"releases.html#v1051","title":"v1.0.51","text":"
    1. FMBench has a website now. Rework the README file to make it lightweight.
    2. Llama3.1 config files for Bedrock.
    "},{"location":"releases.html#v1050","title":"v1.0.50","text":"
    1. Llama3-8b on Amazon EC2 inf2.48xlarge config file.
    2. Update to new version of DJL LMI (0.28.0).
    "},{"location":"releases.html#v1049","title":"v1.0.49","text":"
    1. Streaming support for Amazon SageMaker and Amazon Bedrock.
    2. Per-token latency metrics such as time to first token (TTFT) and mean time per-output token (TPOT).
    3. Misc. bug fixes.
    "},{"location":"releases.html#v1048","title":"v1.0.48","text":"
    1. Faster result file download at the end of a test run.
    2. Phi-3-mini-4k-instruct configuration file.
    3. Tokenizer and misc. bug fixes.
    "},{"location":"releases.html#v1047","title":"v1.0.47","text":"
    1. Run FMBench as a Docker container.
    2. Bug fixes for GovCloud support.
    3. Updated README for EKS cluster creation.
    "},{"location":"releases.html#v1046","title":"v1.0.46","text":"
    1. Native model deployment support for EC2 and EKS (i.e. you can now deploy and benchmark models on EC2 and EKS).
    2. FMBench is now available in GovCloud.
    3. Update to latest version of several packages.
    "},{"location":"releases.html#v1045","title":"v1.0.45","text":"
    1. Analytics for results across multiple runs.
    2. Llama3-70b config files for g5.48xlarge instances.
    "},{"location":"releases.html#v1044","title":"v1.0.44","text":"
    1. Endpoint metrics (CPU/GPU utilization, memory utiliztion, model latency) and invocation metrics (including errors) for SageMaker Endpoints.
    2. Llama3-8b config files for g6 instances.
    "},{"location":"releases.html#v1042","title":"v1.0.42","text":"
    1. Config file for running Llama3-8b on all instance types except p5.
    2. Fix bug with business summary chart.
    3. Fix bug with deploying model using a DJL DeepSpeed container in the no S3 dependency mode.
    "},{"location":"releases.html#v1040","title":"v1.0.40","text":"
    1. Make it easy to run in the Amazon EC2 without any dependency on Amazon S3 dependency mode.
    "},{"location":"releases.html#v1039","title":"v1.0.39","text":"
    1. Add an internal FMBench website.
    "},{"location":"releases.html#v1038","title":"v1.0.38","text":"
    1. Support for running FMBench on Amazon EC2 without any dependency on Amazon S3.
    2. Llama3-8b-Instruct config file for ml.p5.48xlarge.
    "},{"location":"releases.html#v1037","title":"v1.0.37","text":"
    1. g5/p4d/inf2/trn1 specific config files for Llama3-8b-Instruct.
      1. p4d config file for both vllm and lmi-dist.
    "},{"location":"releases.html#v1036","title":"v1.0.36","text":"
    1. Fix bug at higher concurrency levels (20 and above).
    2. Support for instance count > 1.
    "},{"location":"releases.html#v1035","title":"v1.0.35","text":"
    1. Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral.
    "},{"location":"releases.html#v1034","title":"v1.0.34","text":"
    1. Don't delete endpoints for the bring your own endpoint case.
    2. Fix bug with business summary chart.
    "},{"location":"releases.html#v1032","title":"v1.0.32","text":"
    1. Report enhancements: New business summary chart, config file embedded in the report, version numbering and others.

    2. Additional config files: Meta Llama3 on Inf2, Mistral instruct with lmi-dist on p4d and p5 instances.

    "},{"location":"resources.html","title":"Resources","text":""},{"location":"resources.html#pending-enhancements","title":"Pending enhancements","text":"

    View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.

    "},{"location":"resources.html#security","title":"Security","text":"

    See CONTRIBUTING for more information.

    "},{"location":"resources.html#license","title":"License","text":"

    This library is licensed under the MIT-0 License. See the LICENSE file.

    "},{"location":"results.html","title":"Results","text":"

    Depending upon the experiments in the config file, the FMBench run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the local results-* folder in the directory from where FMBench was run. The rpeort and metrics are also written to the write S3 bucket set in the config file.

    Here is a screenshot of the report.md file generated by FMBench.

    "},{"location":"run_as_container.html","title":"Run FMBench as a Docker container","text":"

    You can now run FMBench on any platform where you can run a Docker container, for example on an EC2 VM, SageMaker Notebook etc. The advantage is that you do not have to install anything locally, so no conda installs needed anymore. Here are the steps to do that.

    1. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. You can place model specific tokenizers and any new configuration files you create in the /tmp/fmbench-read directory that is created after running the following command.

      curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n
    2. That's it! You are now ready to run the container.

      # set the config file path to point to the config file of interest\nCONFIG_FILE=https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama2/7b/config-llama2-7b-g5-quick.yml\ndocker run -v $(pwd)/fmbench:/app \\\n  -v /tmp/fmbench-read:/tmp/fmbench-read \\\n  -v /tmp/fmbench-write:/tmp/fmbench-write \\\n  aarora79/fmbench:v1.0.47 \\\n \"fmbench --config-file ${CONFIG_FILE} --local-mode yes --write-bucket placeholder > fmbench.log 2>&1\"\n
    3. The above command will create a fmbench directory inside the current working directory. This directory contains the fmbench.log and the results-* folder that is created once the run finished.

    "},{"location":"website.html","title":"Create a website for FMBench reports","text":"

    When you use FMBench as a tool for benchmarking your foundation models you would soon want to have an easy way to view all the reports in one place and search through the results, for example, \"Llama3.1-8b results on trn1.32xlarge\". An FMBench website provides a simple way of viewing these results.

    Here are the steps to setup a website using mkdocs and nginx. The steps below generate a self-signed certificate for SSL and use username and password for authentication. It is strongly recommended that you use a valid SSL cert and a better authentication mechanism than username and password for your FMBench website.

    1. Start an Amazon EC2 machine which will host the FMBench website. A t3.xlarge machine with an Ubuntu AMI say ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20240801 and 50GB storage is good enough. Allow SSH and TCP port 443 traffic from anywhere into that machine.

    2. SSH into that machine and install conda.

      wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n
    3. Install docker-compose.

      sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\nsudo usermod -a -G docker $USER\nnewgrp docker\ndocker compose version \n
    4. Setup the fmbench_python311 conda environment and clone FMBench repo.

      conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311\npip install -U fmbench mkdocs mkdocs-material mknotebooks\ngit clone https://github.com/aws-samples/foundation-model-benchmarking-tool.git\n
    5. Get the FMBench results data from Amazon S3 or whichever storage system you used to store all the results.

      curl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nsudo apt-get install unzip -y\nunzip awscliv2.zip\nsudo ./aws/install\nFMBENCH_S3_BUCKET=your-fmbench-s3-bucket-name-here\naws s3 sync s3://$FMBENCH_S3_BUCKET $HOME/fmbench_data --exclude \"*.json\"\n
    6. Create a directory for the FMBench website contents.

      mkdir $HOME/fmbench_site\nmkdir $HOME/fmbench_site/ssl\n
      1. Setup SSL certs (we strongly encourage you to not use self-signed certs, this step here is just for demo purposes, get SSL certs the same way you get them for your current production workloads).

      sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $HOME/fmbench_site/ssl/nginx-selfsigned.key -out $HOME/fmbench_site/ssl/nginx-selfsigned.crt\n
    7. Create an .httpasswd file. The FMBench website will use the fmbench_admin as a username and a password that you enter as part of the command below to allow login to the website.

      sudo apt-get install apache2-utils -y\nhtpasswd -c $HOME/fmbench_site/.htpasswd fmbench_admin\n
    8. Create the mkdocs.yml file for the website.

      cd foundation-model-benchmarking-tool\ncp website/index.md $HOME/fmbench_data/\ncp -r img $HOME/fmbench_data/\npython website/create_fmbench_website.py\nmkdocs build -f website/mkdocs.yml --site-dir $HOME/fmbench_site/site\n
    9. Update nginx.conf file. Note the hostname that is printed out below, the FMBench website would be served at this address.

      TOKEN=`curl -X PUT \"http://169.254.169.254/latest/api/token\" -H \"X-aws-ec2-metadata-token-ttl-seconds: 21600\"`\nHOSTNAME=`curl -H \"X-aws-ec2-metadata-token: $TOKEN\" http://169.254.169.254/latest/meta-data/public-hostname`\necho \"hostname is: $HOSTNAME\"\nsed \"s/__HOSTNAME__/$HOSTNAME/g\" website/nginx.conf.template > $HOME/fmbench_site/nginx.conf\n
    10. Serve the website.

      docker run --name fmbench-nginx -d -p 80:80 -p 443:443   -v $HOME/fmbench_site/site:/usr/share/nginx/html   -v $HOME/fmbench_site/nginx.conf:/etc/nginx/nginx.conf   -v $HOME/fmbench_site/ssl:/etc/nginx/ssl   -v $HOME/fmbench_site/.htpasswd:/etc/nginx/.htpasswd   nginx\n
    11. Open a web browser and navigate to the hostname you noted in the step above, for example https://<your-ec2-hostname>.us-west-2.compute.amazonaws.com, ignore the security warnings if you used a self-signed SSL cert (replace this with a cert that you would normally use in your production websites) and then enter the username and password (the username would be fmbench_admin and password would be what you had set when running the htpasswd command). You should see a website as shown in the screenshot below.

    "},{"location":"workflow.html","title":"Workflow for FMBench","text":"

    The workflow for FMBench is as follows:

    Create configuration file\n        |\n        |-----> Deploy model on SageMaker/Use models on Bedrock/Bring your own endpoint\n                    |\n                    |-----> Run inference against deployed endpoint(s)\n                                     |\n                                     |------> Create a benchmarking report\n
    1. Create a dataset of different prompt sizes and select one or more such datasets for running the tests.

      1. Currently FMBench supports datasets from LongBench and filter out individual items from the dataset based on their size in tokens (for example, prompts less than 500 tokens, between 500 to 1000 tokens and so on and so forth). Alternatively, you can download the folder from this link to load the data.
    2. Deploy any model that is deployable on SageMaker on any supported instance type (g5, p4d, Inf2).

      1. Models could be either available via SageMaker JumpStart (list available here) as well as models not available via JumpStart but still deployable on SageMaker through the low level boto3 (Python) SDK (Bring Your Own Script).
      2. Model deployment is completely configurable in terms of the inference container to use, environment variable to set, setting.properties file to provide (for inference containers such as DJL that use it) and instance type to use.
    3. Benchmark FM performance in terms of inference latency, transactions per minute and dollar cost per transaction for any FM that can be deployed on SageMaker.

      1. Tests are run for each combination of the configured concurrency levels i.e. transactions (inference requests) sent to the endpoint in parallel and dataset. For example, run multiple datasets of say prompt sizes between 3000 to 4000 tokens at concurrency levels of 1, 2, 4, 6, 8 etc. so as to test how many transactions of what token length can the endpoint handle while still maintaining an acceptable level of inference latency.
    4. Generate a report that compares and contrasts the performance of the model over different test configurations and stores the reports in an Amazon S3 bucket.

      1. The report is generated in the Markdown format and consists of plots, tables and text that highlight the key results and provide an overall recommendation on what is the best combination of instance type and serving stack to use for the model under stack for a dataset of interest.
      2. The report is created as an artifact of reproducible research so that anyone having access to the model, instance type and serving stack can run the code and recreate the same results and report.
    5. Multiple configuration files that can be used as reference for benchmarking new models and instance types.

    "},{"location":"misc/ec2_instance_creation_steps.html","title":"Create an EC2 instance suitable for an LMI (Large Model Inference)","text":"

    Follow the steps below to create an EC2 instance for hosting a model in an LMI.

    1. On the homepage of AWS Console go to \u2018EC2\u2019 - it is likely in recently visited:

    2. If not found, go to the search bar on the top of the page. Type ec2 into the search box and click the entry that pops up with name EC2 :

    3. Click \u201cInstances\u201d:

    4. Click \"Launch Instances\":

    5. Type in a name for your instance (recommended to include your alias in the name), and then scroll down. Search for \u2018deep learning ami\u2019 in the box. Select the one that says Deep Learning OSS Nvidia Driver AMI GPU PyTorch for a GPU instance type, select Deep Learning AMI Neuron (Ubuntu 22.04) for an Inferential/Trainium instance type. Your version number might be different.

    6. Name your instance FMBenchInstance.

    7. Add a fmbench-version tag to your instance.

    8. Scroll down to Instance Type. For large model inference, the g5.12xlarge is recommended.

    1. Make a key pair by clicking Create new key pair. Give it a name, keep all settings as is, and then click \u201cCreate key pair\u201d.

    2. Skip over Network settings (leave it as it is), going straight to Configure storage. 45 GB, the suggested amount, is not nearly enough, and using that will cause the LMI docker container to download for an arbitrarily long time and then error out. Change it to 100 GB or more:

    3. Create an IAM role to your instance called FMBenchEC2Role. Attach the following permission policies: AmazonSageMakerFullAccess, AmazonBedrockFullAccess.

      Edit the trust policy to be the following:

      {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"ec2.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"sagemaker.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"bedrock.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        }\n    ]\n}\n
      Select this role in the IAM instance profile setting of your instance.

    4. Then, we\u2019re done with the settings of the instance. Click Launch Instance to finish. You can connect to your EC2 instance using any of these option

    "},{"location":"misc/eks_cluster-creation_steps.html","title":"EKS cluster creation steps","text":"

    The steps below create an EKS cluster called trainium-inferentia.

    1. Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your machine: aws-cli, kubectl and terraform. We use the DoEKS repository as a guide to deploy the cluster infrastructure in an AWS account.

    2. Ensue that your account has enough Inf2 on-demand VCPUs as most of the DoEKS blueprints utilize this specific instance. To increase service quota navigate to the service quota page for the region you are in service quota. Then select services under the left side menu and search for Amazon Elastic Compute Cloud (Amazon EC2). This will bring up the service quota page, here search for inf and there should be an option for Running On-Demand Inf instances. Increase this quota to 300.

    3. Clone the DoEKS repository

      git clone https://github.com/awslabs/data-on-eks.git\n
    4. Ensure that the region names are correct in variables.tf file before running the cluster creation script.

    5. Ensure that the ELB to be created would be external facing. Change the helm value from internal to internet-facing here.

    6. Ensure that the IAM role you are using has the permissions needed to create the cluster. While we expect the following set of permissions to work but the current recommendation is to also add the AdminstratorAccess permission to the IAM role. At a later date you could remove the AdminstratorAccess and experiment with cluster creation without it.

      1. Attach the following managed policies: AmazonEKSClusterPolicy, AmazonEKS_CNI_Policy, and AmazonEKSWorkerNodePolicy.
      2. In addition to the managed policies add the following as inline policy. Replace your-account-id with the actual value of the AWS account id you are using.

        {\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n    {\n        \"Sid\": \"VisualEditor0\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateVpc\",\n            \"ec2:DeleteVpc\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor1\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:ModifyVpcAttribute\",\n            \"ec2:DescribeVpcAttribute\"\n        ],\n        \"Resource\": \"arn:aws:ec2:*:<your-account-id>:vpc/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor2\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AssociateVpcCidrBlock\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor3\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:DescribeSecurityGroupRules\",\n            \"ec2:DescribeNatGateways\",\n            \"ec2:DescribeAddressesAttribute\"\n        ],\n        \"Resource\": \"*\"\n    },\n    {\n        \"Sid\": \"VisualEditor4\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateInternetGateway\",\n            \"ec2:RevokeSecurityGroupEgress\",\n            \"ec2:CreateRouteTable\",\n            \"ec2:CreateSubnet\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:security-group/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor5\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:AttachInternetGateway\",\n            \"ec2:AssociateRouteTable\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:vpn-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor6\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AllocateAddress\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv4pool-ec2/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor7\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:ReleaseAddress\",\n        \"Resource\": \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor8\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:CreateNatGateway\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:natgateway/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    }\n]\n}\n
        1. Add the Role ARN and name here in the variables.tf file by updating these lines. Move the structure inside the defaut list and replace the role ARN and name values with the values for the role you are using.

    7. Navigate into the ai-ml/trainium-inferentia/ directory and run install.sh script.

      cd data-on-eks/ai-ml/trainium-inferentia/\n./install.sh\n

      Note: This step takes about 12-15 minutes to deploy the EKS infrastructure and cluster in the AWS account. To view more details on cluster creation, view an example here: Deploy Llama3 on EKS in the prerequisites section.

    8. After the cluster is created, navigate to the Karpenter EC2 node IAM role called karpenter-trainium-inferentia-XXXXXXXXXXXXXXXXXXXXXXXXX. Attach the following inline policy to the role:

      {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Sid\": \"Statement1\",\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"iam:CreateServiceLinkedRole\"\n            ],\n            \"Resource\": \"*\"\n        }\n    ]\n}\n
    "},{"location":"misc/the-diy-version-w-gory-details.html","title":"The diy version w gory details","text":""},{"location":"misc/the-diy-version-w-gory-details.html#the-diy-version-with-gory-details","title":"The DIY version (with gory details)","text":"

    Follow the prerequisites below to set up your environment before running the code:

    1. Python 3.11: Setup a Python 3.11 virtual environment and install FMBench.

      python -m venv .fmbench\npip install fmbench\n
    2. S3 buckets for test data, scripts, and results: Create two buckets within your AWS account:

      • Read bucket: This bucket contains tokenizer files, prompt template, source data and deployment scripts stored in a directory structure as shown below. FMBench needs to have read access to this bucket.

        s3://<read-bucket-name>\n    \u251c\u2500\u2500 source_data/\n    \u251c\u2500\u2500 source_data/<source-data-file-name>.json\n    \u251c\u2500\u2500 prompt_template/\n    \u251c\u2500\u2500 prompt_template/prompt_template.txt\n    \u251c\u2500\u2500 scripts/\n    \u251c\u2500\u2500 scripts/<deployment-script-name>.py\n    \u251c\u2500\u2500 tokenizer/\n    \u251c\u2500\u2500 tokenizer/tokenizer.json\n    \u251c\u2500\u2500 tokenizer/config.json\n
        • The details of the bucket structure is as follows:

          1. Source Data Directory: Create a source_data directory that stores the dataset you want to benchmark with. FMBench uses Q&A datasets from the LongBench dataset or alternatively from this link. Support for bring your own dataset will be added soon.

            • Download the different files specified in the LongBench dataset into the source_data directory. Following is a good list to get started with:

              • 2wikimqa
              • hotpotqa
              • narrativeqa
              • triviaqa

              Store these files in the source_data directory.

          2. Prompt Template Directory: Create a prompt_template directory that contains a prompt_template.txt file. This .txt file contains the prompt template that your specific model supports. FMBench already supports the prompt template compatible with Llama models.

          3. Scripts Directory: FMBench also supports a bring your own script (BYOS) mode for deploying models that are not natively available via SageMaker JumpStart i.e. anything not included in this list. Here are the steps to use BYOS.

            1. Create a Python script to deploy your model on a SageMaker endpoint. This script needs to have a deploy function that 2_deploy_model.ipynb can invoke. See p4d_hf_tgi.py for reference.

            2. Place your deployment script in the scripts directory in your read bucket. If your script deploys a model directly from HuggingFace and needs to have access to a HuggingFace auth token, then create a file called hf_token.txt and put the auth token in that file. The .gitignore file in this repo has rules to not commit the hf_token.txt to the repo. Today, FMBench provides inference scripts for:

              • All SageMaker Jumpstart Models
              • Text-Generation-Inference (TGI) container supported models
              • Deep Java Library DeepSpeed container supported models

              Deployment scripts for the options above are available in the scripts directory, you can use these as reference for creating your own deployment scripts as well.

          4. Tokenizer Directory: Place the tokenizer.json, config.json and any other files required for your model's tokenizer in the tokenizer directory. The tokenizer for your model should be compatible with the tokenizers package. FMBench uses AutoTokenizer.from_pretrained to load the tokenizer. >As an example, to use the Llama 2 Tokenizer for counting prompt and generation tokens for the Llama 2 family of models: Accept the License here: meta approval form and download the tokenizer.json and config.json files from Hugging Face website and place them in the tokenizer directory.

      • Write bucket: All prompt payloads, model endpoint and metrics generated by FMBench are stored in this bucket. FMBench requires write permissions to store the results in this bucket. No directory structure needs to be pre-created in this bucket, everything is created by FMBench at runtime.

        ```{.bash} s3:// \u251c\u2500\u2500 \u251c\u2500\u2500 /data \u251c\u2500\u2500 /data/metrics \u251c\u2500\u2500 /data/models \u251c\u2500\u2500 /data/prompts ````"}]} \ No newline at end of file +{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Benchmark foundation models on AWS","text":"

        FMBench is a Python package for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. The FMs could be deployed on these platforms either directly through FMbench, or, if they are already deployed then also they could be benchmarked through the Bring your own endpoint mode supported by FMBench.

        Here are some salient features of FMBench:

        1. Highly flexible: in that it allows for using any combinations of instance types (g5, p4d, p5, Inf2), inference containers (DeepSpeed, TensorRT, HuggingFace TGI and others) and parameters such as tensor parallelism, rolling batch etc. as long as those are supported by the underlying platform.

        2. Benchmark any model: it can be used to be benchmark open-source models, third party models, and proprietary models trained by enterprises on their own data.

        3. Run anywhere: it can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.

        "},{"location":"index.html#the-need-for-benchmarking","title":"The need for benchmarking","text":"

        Customers often wonder what is the best AWS service to run FMs for my specific use-case and my specific price performance requirements. While model evaluation metrics are available on several leaderboards (HELM, LMSys), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as (LongBench, QMSum). This is the problem that FMBench solves.

        "},{"location":"index.html#fmbench-an-open-source-python-package-for-fm-benchmarking-on-aws","title":"FMBench: an open-source Python package for FM benchmarking on AWS","text":"

        FMBench runs inference requests against endpoints that are either deployed through FMBench itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given FM for a given use-case.

        The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the Llama2-13b model on different instance types available on SageMaker using prompts for Q&A task created from the LongBench dataset, these prompts are between 3000 to 3840 tokens in length. Note that the numbers are hidden in this figure but you would be able to see them when you run FMBench yourself.

        The following table (also included in the report) provides information about the best available instance type for that experiment1.

        Information Value experiment_name llama2-13b-inf2.24xlarge payload_file payload_en_3000-3840.jsonl instance_type ml.inf2.24xlarge concurrency ** error_rate ** prompt_token_count_mean 3394 prompt_token_throughput 2400 completion_token_count_mean 31 completion_token_throughput 15 latency_mean ** latency_p50 ** latency_p95 ** latency_p99 ** transactions_per_minute ** price_per_txn **

        1 ** values hidden on purpose, these are available when you run the tool yourself.

        The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types).

        "},{"location":"index.html#determine-the-optimal-model-for-your-generative-ai-workload","title":"Determine the optimal model for your generative AI workload","text":"

        Use FMBench to determine model accuracy using a panel of LLM evaluators (PoLL [1]). Here is one of the plots generated by FMBench to help answer the accuracy question for various FMs on Amazon Bedrock (the model ids in the charts have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).

        "},{"location":"index.html#references","title":"References","text":"

        [1] Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\", arXiv:2404.18796, 2024.

        "},{"location":"accuracy.html","title":"Model evaluations using panel of LLM evaluators","text":"

        FMBench release 2.0.0 adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators (PoLL). It gathers quantitative metrics such as Cosine Similarity and overall majority voting accuracy metrics to measure the similarity and accuracy of model responses compared to the ground truth.

        Accuracy is defined as percentage of responses generated by the LLM that match the ground truth included in the dataset (as a separate column). In order to determine if an LLM generated response matches the ground truth we ask other LLMs called the evaluator LLMs to compare the LLM output and the ground truth and provide a verdict if the LLM generated ground truth is correct or not given the ground truth. Here is the link to the Anthropic Claude 3 Sonnet model prompt being used as an evaluator (or a judge model). A combination of the cosine similarity and the LLM evaluator verdict decides if the LLM generated response is correct or incorrect. Finally, one LLM evaluator could be biased, could have inaccuracies so instead of relying on the judgement of a single evaluator, we rely on the majority vote of 3 different LLM evaluators. By default we use the Anthropic Claude 3 Sonnet, Meta Llama3-70b and the Cohere Command R plus model as LLM evaluators. See Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\", arXiv:2404.18796, 2024. for more details on using a Panel of LLM Evaluators (PoLL).

        "},{"location":"accuracy.html#evaluation-flow","title":"Evaluation Flow","text":"
        1. Provide a dataset that includes ground truth responses for each sample. FMBench uses the LongBench dataset by default.

        2. Configure the candidate models to be evaluated in the FMBench config file. See this config file for an example that runs evaluations for multiple models available via Amazon Bedrock. Running evaluations only requires the following two changes to the config file:

          • Set the 4_get_evaluations.ipynb: yes, see this line.
          • Set the ground_truth_col_key: answers and question_col_key: input parameters, see this line. The value of ground_truth_col_key and the question_col_key is set to the name of the column in the dataset that contains the ground truth and question respectively.
        3. Run FMBench, which will:

        4. Fetch the inference results containing the model responses

        5. Calculate quantitative metrics (Cosine Similarity)

        6. Use a Panel of LLM Evaluators to compare each model response to the ground truth

        7. Each LLM evaluator will provide a binary verdict (correct/incorrect) and an explanation

        8. Validate the LLM evaluations using Cosine Similarity thresholds

        9. Categorize the final evaluation for each response as correctly correct, correctly incorrect, or needs further evaluation

        10. Review the FMBench report to analyze the evaluation results and compare the performance of the candidate models. The report contains tables and charts that provide insights into model accuracy.

        By leveraging ground truth data and a Panel of LLM Evaluators, FMBench provides a comprehensive and efficient way to assess the quality of generative AI models. The majority voting approach, combined with quantitative metrics, enables a robust evaluation that reduces bias and latency while maintaining consistency across responses.

        "},{"location":"advanced.html","title":"Advanced","text":"

        Beyond running FMBench with the configuration files provided, you may want try out bringing your own dataset or endpoint to FMBench.

        "},{"location":"analytics.html","title":"Generate downstream summarized reports for further analysis","text":"

        You can use several results from various FMBench runs to generate a summarized report of all runs based on your cost, latency, and concurrency budgets. This report helps answer the following question:

        What is the minimum number of instances N, of most cost optimal instance type T, that are needed to serve a real-time workload W while keeping the average transaction latency under L seconds?\u201d

        W: = {R transactions per-minute, average prompt token length P, average generation token length G}\n
        • With this summarized report, we test the following hypothesis: At the low end of the total number of requests/minute smaller instances which provide good inference latency at low concurrencies would suffice (said another way, the larger more expensive instances are an overkill at this stage) but as the number of requests/minute increase there comes an inflection point beyond which the number of smaller instances required would be so much that it would be more economical to use fewer instances of the larger more expensive instances.
        "},{"location":"analytics.html#an-example-report-that-gets-generated-is-as-follows","title":"An example report that gets generated is as follows:","text":""},{"location":"analytics.html#summary-for-payload-payload_en_x-y","title":"Summary for payload: payload_en_x-y","text":"
        • The metrics below in the table are examples and do not represent any specific model or instance type. This table can be used to make analysis on the cost and instance maintenance perspective based on the use case. For example, instance_type_1 costs 10 dollars and requires 1 instance to host model_1 until it can handle 100 requests per minute. As the requests scale to a 1,000 requests per minute, 5 instances are required and cost 50 dollars. As the requests scale to 10,000 requests per minute, the number of instances to maintain scale to 30, and the cost becomes 450 dollars.

        • On the other hand, instance_type_2 is more costly, with a price of $499 for 10,000 requests per minute to host the same model, but only requires 22 instances to maintain, which is 8 less than when the model is hosted on instance_type_1.

        • Based on these summaries, users can make decisions based on their use case priorities. For a real time and latency sensitive application, a user might select instance_type_2 to host model_1 since the user would have to maintain 8 lesser instances than hosting the model on instance_type_1. Hosting the model on instance_type_2 would also maintain the p_95 latency (0.5s), which is half compared to instance_type_1 (p_95 latency: 1s) even though it costs more than instance_type_1. On the other hand, if the application is cost sensitive, and the user is flexible to maintain more instances at a higher latency, they might want to shift gears to using instance_type_1.

        • Note: Based on varying needs for prompt size, cost, and latency, the table might change.

        experiment_name instance_type concurrency latency_p95 transactions_per_minute instance_count_and_cost_1_rpm instance_count_and_cost_10_rpm instance_count_and_cost_100_rpm instance_count_and_cost_1000_rpm instance_count_and_cost_10000_rpm model_1 instance_type_1 1 1.0 _ (1, 10) (1, 10) (1, 10) (5, 50) (30, 450) model_1 instance_type_2 1 0.5 _ (1, 10) (1, 20) (1, 20) (6, 47) (22, 499)"},{"location":"analytics.html#fmbench-heatmap","title":"FMBench Heatmap","text":"

        This step also generates a heatmap that contains information about each instance, and how much it costs with per request-per-minute (rpm) breakdown. The default breakdown is [1 rpm, 10 rpm, 100 rpm, 1000 rpm, 10000 rpm]. View an example of a heatmap below. The model name, instance type, is masked but can be generated for your specific use case/requirements.

        "},{"location":"analytics.html#steps-to-run-analytics","title":"Steps to run analytics","text":"
        1. Clone the FMBench repo from GitHub.

        2. Place all of the result-{model-id}-... folders that are generated from various runs in the top level directory.

        3. Run the following command to generate downstream analytics and summarized tables. Replace x, y, z and model_id with the latency, concurrency thresholds, payload file of interest (for example payload_en_1000-2000.jsonl) and the model_id respectively. The model_id would have to be appended to the results-{model-id} folders so the analytics.py file can generate a report for all of those respective result folders.

          python analytics/analytics.py --latency-threshold x --concurrency-threshold y  --payload-file z --model-id model_id\n
        "},{"location":"announcement.html","title":"Release 2.0 announcement","text":"

        We are excited to share news about a major FMBench release, we now have release 2.0 for FMBench that supports model evaluations through a panel of LLM evaluators\ud83c\udf89. With the recent feature additions to FMBench we are already seeing increased interest from customers and hope to reach even more customers and have an even greater impact. Check out all the latest and greatest features from FMBench on the FMBench website.

        Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.

        Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).

        Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.

        Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.

        "},{"location":"benchmarking.html","title":"Benchmark models deployed on different AWS Generative AI services","text":"

        FMBench comes packaged with configuration files for benchmarking models on different AWS Generative AI services.

        "},{"location":"benchmarking.html#full-list-of-benchmarked-models","title":"Full list of benchmarked models","text":"Model EC2 g5 EC2 p4 EC2 p5 EC2 Inf2/Trn1 SageMaker g4dn/g5/p3 SageMaker Inf2/Trn1 SageMaker P4 SageMaker P5 Bedrock On-demand throughput Bedrock provisioned throughput Anthropic Claude-3 Sonnet \u2705 \u2705 Anthropic Claude-3 Haiku \u2705 Mistral-7b-instruct \u2705 \u2705 \u2705 \u2705 \u2705 Mistral-7b-AWQ \u2705 Mixtral-8x7b-instruct \u2705 Llama3.1-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3.1-70b instruct \u2705 \u2705 \u2705 Llama3-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3-70b instruct \u2705 \u2705 \u2705 \u2705 \u2705 Llama2-13b chat \u2705 \u2705 \u2705 \u2705 Llama2-70b chat \u2705 \u2705 \u2705 \u2705 Amazon Titan text lite \u2705 Amazon Titan text express \u2705 Cohere Command text \u2705 Cohere Command light text \u2705 AI21 J2 Mid \u2705 AI21 J2 Ultra \u2705 Gemma-2b \u2705 Phi-3-mini-4k-instruct \u2705 distilbert-base-uncased \u2705"},{"location":"benchmarking_on_bedrock.html","title":"Benchmark models on Bedrock","text":"

        Choose any config file from the bedrock folder and either run these directly or use them as templates for creating new config files specific to your use-case. Here is an example for benchmarking the Llama3 models on Bedrock.

        fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/bedrock/config-bedrock-llama3.yml > fmbench.log 2>&1\n
        "},{"location":"benchmarking_on_ec2.html","title":"Benchmark models on EC2","text":"

        You can use FMBench to benchmark models on hosted on EC2. This can be done in one of two ways:

        • Deploy the model on your EC2 instance independantly of FMBench and then benchmark it through the Bring your own endpoint mode.
        • Deploy the model on your EC2 instance through FMBench and then benchmark it.

        The steps for deploying the model on your EC2 instance are described below.

        \ud83d\udc49 In this configuration both the model being benchmarked and FMBench are deployed on the same EC2 instance.

        Create a new EC2 instance suitable for hosting an LMI as per the steps described here. Note that you will need to select the correct AMI based on your instance type, this is called out in the instructions.

        The steps for benchmarking on different types of EC2 instances (GPU/CPU/Neuron) and different inference containers differ slightly. These are all described below.

        "},{"location":"benchmarking_on_ec2.html#benchmarking-options-on-ec2","title":"Benchmarking options on EC2","text":"
        • Benchmarking on an instance type with NVIDIA GPUs or AWS Chips
        • Benchmarking on an instance type with NVIDIA GPU and the Triton inference server
        • Benchmarking on an instance type with AWS Chips and the Triton inference server
        • Benchmarking on an CPU instance type with AMD processors
        • Benchmarking on an CPU instance type with Intel processors

        • Benchmarking the Triton inference server

        "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips","title":"Benchmarking on an instance type with NVIDIA GPUs or AWS Chips","text":"
        1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench.

          wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n
        2. Install docker-compose.

          sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\ndocker compose version \n
        3. Setup the fmbench_python311 conda environment.

          conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n
        4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

          curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
        5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

          echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
        6. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. Skip to the next step if benchmarking for AWS Chips. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

          fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
        7. For example, to run FMBench on a llama3-8b-Instruct model on an inf2.48xlarge instance, run the command command below. The config file for this example can be viewed here.

          fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
        8. Open a new Terminal and do a tail on fmbench.log to see a live log of the run.

          tail -f fmbench.log\n
        9. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

        "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpu-and-the-triton-inference-server","title":"Benchmarking on an instance type with NVIDIA GPU and the Triton inference server","text":"
        1. Follow steps in the Benchmarking on an instance type with NVIDIA GPUs or AWS Chips section to install FMBench but do not run any benchmarking tests yet.

        2. Once FMBench is installed then install the following additional dependencies for Triton.

          cd ~\ngit clone https://github.com/triton-inference-server/tensorrtllm_backend.git  --branch v0.12.0\n# Update the submodules\ncd tensorrtllm_backend\n# Install git-lfs if needed\napt-get update && apt-get install git-lfs -y --no-install-recommends\ngit lfs install\ngit submodule update --init --recursive\n
        3. Now you are ready to run benchmarking with Triton. For example for benchmarking Llama3-8b model on a g5.12xlarge use the following command:

          fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
        "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference-server","title":"Benchmarking on an instance type with AWS Chips and the Triton inference server","text":"

        As of 2024-09-26 this has been tested on a trn1.32xlarge instance

        1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here. (Note: Configure the storage of your EC2 instance to 500GB for this test)

          # Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n
        2. Setup the fmbench_python311 conda environment.

          # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n
        3. First we need to build the required docker image for triton, and push it locally. To do this, curl the Triton Dockerfile and the script to build and push the triton image locally:

              # curl the docker file for triton\n    curl -o ./Dockerfile_triton https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton\n\n    # curl the script that builds and pushes the triton image locally\n    curl -o build_and_push_triton.sh https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh\n\n    # Make the triton build and push script executable, and run it\n    chmod +x build_and_push_triton.sh\n    ./build_and_push_triton.sh\n
          - Now wait until the docker image is saved locally and then follow the instructions below to start a benchmarking test.

        4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

          curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
        5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

          echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
        6. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

          fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
        7. Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

          tail -f fmbench.log\n
        8. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

        "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-amd-processors","title":"Benchmarking on an CPU instance type with AMD processors","text":"

        As of 2024-08-27 this has been tested on a m7a.16xlarge instance

        1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here

          # Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n
        2. Setup the fmbench_python311 conda environment.

          # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n
        3. Build the vllm container for serving the model.

          1. \ud83d\udc49 The vllm container we are building locally is going to be references in the FMBench config file.

          2. The container being build is for CPU only (GPU support might be added in future).

            # Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .\n
        4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

          curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
        5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

          echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
        6. Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use sudo each time.

          sudo usermod -a -G docker $USER\nnewgrp docker\n
        7. Install docker-compose.

          DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n
        8. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

          fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
        9. Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

          tail -f fmbench.log\n
        10. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

        "},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-intel-processors","title":"Benchmarking on an CPU instance type with Intel processors","text":"

        As of 2024-08-27 this has been tested on c5.18xlarge and m5.16xlarge instances

        1. Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new conda environment for FMBench. See instructions for downloading anaconda here

          # Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n
        2. Setup the fmbench_python311 conda environment.

          # Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n
        3. Build the vllm container for serving the model.

          1. \ud83d\udc49 The vllm container we are building locally is going to be references in the FMBench config file.

          2. The container being build is for CPU only (GPU support might be added in future).

            # Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=12g .\n
        4. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. Replace /tmp in the command below with a different path if you want to store the config files and the FMBench generated data in a different directory.

          curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n
        5. To download the model files from HuggingFace, create a hf_token.txt file in the /tmp/fmbench-read/scripts/ directory containing the Hugging Face token you would like to use. In the command below replace the hf_yourtokenstring with your Hugging Face token.

          echo hf_yourtokenstring > /tmp/fmbench-read/scripts/hf_token.txt\n
        6. Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use sudo each time.

          sudo usermod -a -G docker $USER\nnewgrp docker\n
        7. Install docker-compose.

          DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n
        8. Run FMBench with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The --write-bucket parameter value is just a placeholder and an actual S3 bucket is not required. You could set the --tmp-dir flag to an EFA path instead of /tmp if using a shared path for storing config files and reports.

          fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp > fmbench.log 2>&1\n
        9. Open a new Terminal and and do a tail on fmbench.log to see a live log of the run.

          tail -f fmbench.log\n
        10. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

        "},{"location":"benchmarking_on_eks.html","title":"Benchmark models on EKS","text":"

        You can use FMBench to benchmark models on hosted on EKS. This can be done in one of two ways:

        • Deploy the model on your EKS cluster independantly of FMBench and then benchmark it through the Bring your own endpoint mode.
        • Deploy the model on your EKS cluster through FMBench and then benchmark it.

        The steps for deploying the model on your EKS cluster are described below.

        \ud83d\udc49 EKS cluster creation itself is not a part of the FMBench functionality, the cluster needs to exist before you run the following steps. Steps for cluster creation are provided in this file but it would be best to consult the DoEKS repo on GitHub for comprehensive instructions.

        1. Add the following IAM policies to your existing FMBench Role:

          1. AmazonEKSClusterPolicy: This policy provides Kubernetes the permissions it requires to manage resources on your behalf.

          2. AmazonEKS_CNI_Policy: This policy provides the Amazon VPC CNI Plugin (amazon-vpc-cni-k8s) the permissions it requires to modify the IP address configuration on your EKS worker nodes. This permission set allows the CNI to list, describe, and modify Elastic Network Interfaces on your behalf.

          3. AmazonEKSWorkerNodePolicy: This policy allows Amazon EKS worker nodes to connect to Amazon EKS Clusters.

        2. Once the EKS cluster is available you can use either the following two files or create your own config files using these files as examples for running benchmarking for these models. These config files require that the EKS cluster has been created as per the steps in these instructions.

          1. config-llama3-8b-eks-inf2.yml: Deploy Llama3 on Trn1/Inf2 instances.

          2. config-mistral-7b-eks-inf2.yml: Deploy Mistral 7b on Trn1/Inf2 instances.

          For more information about the blueprints used by FMBench to deploy these models, view: DoEKS docs gen-ai.

        3. Run the Llama3-8b benchmarking using the command below (replace the config file as needed for a different model). This will first deploy the model on your EKS cluster and then run benchmarking on the deployed model.

          fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-eks-inf2.yml > fmbench.log 2>&1\n
        4. As the model is getting deployed you might want to run the following kubectl commands to monitor the deployment progress. Set the model_namespace to llama3 or mistral or a different model as appropriate.

          1. kubectl get pods -n <model_namespace> -w: Watch the pods in the model specific namespace.
          2. kubectl -n karpenter get pods: Get the pods in the karpenter namespace.
          3. kubectl describe pod -n <model_namespace> <pod-name>: Describe a specific pod in the mistral namespace to view the live logs.
        "},{"location":"benchmarking_on_sagemaker.html","title":"Benchmark models on SageMaker","text":"

        Choose any config file from the model specific folders, for example the Llama3 folder for Llama3 family of models. These configuration files also include instructions for FMBench to first deploy the model on SageMaker using your configured instance type and inference parameters of choice and then run the benchmarking. Here is an example for benchmarking Llama3-8b model on an ml.inf2.24xlarge and ml.g5.12xlarge instance.

        fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-inf2-g5.yml > fmbench.log 2>&1\n
        "},{"location":"build.html","title":"Building the FMBench Python package","text":"

        If you would like to build a dev version of FMBench for your own development and testing purposes, the following steps describe how to do that.

        1. Clone the FMBench repo from GitHub.

        2. Make any code changes as needed.

        3. Install poetry.

          pip install poetry mkdocs-material mknotebooks\n
        4. Change directory to the FMBench repo directory and run poetry build.

          poetry build\n
        5. The .whl file is generated in the dist folder. Install the .whl in your current Python environment.

          pip install dist/fmbench-X.Y.Z-py3-none-any.whl\n
        6. Run FMBench as usual through the FMBench CLI command.

        7. You may have added new config files as part of your work, to make sure these files are called out in the manifest.txt run the following command. This command will overwrite the existing manifest.txt and manifest.md files. Both these files need to be committed to the repo. Reach out to the maintainers of this repo so that they can add new or modified config files to the blogs bucket (the CloudFormation stack would fail if a new file is added to the manifest but is not available for download through the S3 bucket).

          python create_manifest.py\n
        8. To create updated documentation run the following command. You need to be added as a contributor to the FMBench repo to be able to publish to the website, so this command would not work for you if you are not added as a contributor to the repo.

          mkdocs gh-deploy\n
        "},{"location":"byo_dataset.html","title":"Bring your own dataset","text":"

        By default FMBench uses the LongBench dataset dataset for testing the models, but this is not the only dataset you can test with. You may want to test with other datasets available on HuggingFace or use your own datasets for testing. You can do this by converting your dataset to the JSON lines format. We provide a code sample for converting any HuggingFace dataset into JSON lines format and uploading it to the S3 bucket used by FMBench in the bring_your_own_dataset notebook. Follow the steps described in the notebook to bring your own dataset for testing with FMBench.

        "},{"location":"byo_dataset.html#support-for-open-orca-dataset","title":"Support for Open-Orca dataset","text":"

        Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral, see:

        1. bring_your_own_dataset.ipynb
        2. prompt templates
        3. Llama3 config file with OpenOrca
        "},{"location":"byo_rest_predictor.html","title":"Bring your own REST Predictor (data-on-eks version)","text":"

        FMBench now provides an example of bringing your own endpoint as a REST Predictor for benchmarking. View this script as an example. This script is an inference file for the NousResearch/Llama-2-13b-chat-hf model deployed on an Amazon EKS cluster using Ray Serve. The model is deployed via data-on-eks which is a comprehensive resource for scaling your data and machine learning workloads on Amazon EKS and unlocking the power of Gen AI. Using data-on-eks, you can harness the capabilities of AWS Trainium, AWS Inferentia and NVIDIA GPUs to scale and optimize your Gen AI workloads and benchmark those models on FMBench with ease.

        "},{"location":"byoe.html","title":"Bring your own endpoint (a.k.a. support for external endpoints)","text":"

        If you have an endpoint deployed on say Amazon EKS or Amazon EC2 or have your models hosted on a fully-managed service such as Amazon Bedrock, you can still bring your endpoint to FMBench and run tests against your endpoint. To do this you need to do the following:

        1. Create a derived class from FMBenchPredictor abstract class and provide implementation for the constructor, the get_predictions method and the endpoint_name property. See SageMakerPredictor for an example. Save this file locally as say my_custom_predictor.py.

        2. Upload your new Python file (my_custom_predictor.py) for your custom FMBench predictor to your FMBench read bucket and the scripts prefix specified in the s3_read_data section (read_bucket and scripts_prefix).

        3. Edit the configuration file you are using for your FMBench for the following:

          • Skip the deployment step by setting the 2_deploy_model.ipynb step under run_steps to no.
          • Set the inference_script under any experiment in the experiments section for which you want to use your new custom inference script to point to your new Python file (my_custom_predictor.py) that contains your custom predictor.
        "},{"location":"ec2.html","title":"Run FMBench on Amazon EC2","text":"

        For some enterprise scenarios it might be desirable to run FMBench directly on an EC2 instance with no dependency on S3. Here are the steps to do this:

        1. Have a t3.xlarge (or larger) instance in the Running stage. Make sure that the instance has at least 50GB of disk space and the IAM role associated with your EC2 instance has AmazonSageMakerFullAccess policy associated with it and sagemaker.amazonaws.com added to its Trust relationships.

          {\n    \"Effect\": \"Allow\",\n    \"Principal\": {\n        \"Service\": \"sagemaker.amazonaws.com\"\n    },\n    \"Action\": \"sts:AssumeRole\"\n}\n

        2. Setup the fmbench_python311 conda environment. This step required conda to be installed on the EC2 instance, see instructions for downloading Anaconda.

          conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n
        3. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo.

          curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n
        4. Run FMBench with a quickstart config file.

          fmbench --config-file /tmp/fmbench-read/configs/llama2/7b/config-llama2-7b-g5-quick.yml --local-mode yes > fmbench.log 2>&1\n
        5. Open a new Terminal and navigate to the foundation-model-benchmarking-tool directory and do a tail on fmbench.log to see a live log of the run.

          tail -f fmbench.log\n
        6. All metrics are stored in the /tmp/fmbench-write directory created automatically by the fmbench package. Once the run completes all files are copied locally in a results-* folder as usual.

        "},{"location":"features.html","title":"FMBench features","text":"

        Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.

        Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).

        Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.

        Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.

        "},{"location":"gettingstarted.html","title":"Getting started with FMBench","text":"

        FMBench is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.

        While technically you can run FMBench on any AWS compute but practically speaking we either run it on a SageMaker Notebook or on EC2. Both these options are described below.

        Intro Video

        "},{"location":"gettingstarted.html#fmbench-in-a-client-server-configuration-on-amazon-ec2","title":"FMBench in a client-server configuration on Amazon EC2","text":"

        Often times there might be a need where a platform team would like to have a bunch of LLM endpoints deployed in an account available permanently for data science teams or application teams to benchmark performance and accuracy for their specific use-case. They can take advantage of a special client-server configuration for FMBench where it can be used to deploy models on EC2 instances in one AWS account (called the server account) and run tests against these endpoints from FMBench deployed on EC2 instances in another AWS account (called the client AWS account).

        This has the advantage that every team that wants to benchmark a set of LLMs does not first have to deploy the LLMs, a platform team can do that for them and have these LLMs available for a longer duration as these teams do their benchmarking, for example for their specific datasets, for their specific cost and performance criteria. Using FMBench in this way makes the process simpler for both teams as the platform team can use FMBench for easily deploying the models with full control on the configuration of the serving stack without having to write any LLM deployment code for EC2 and the data science teams or application teams can test with different datasets, performance criteria and inference parameters. As long as the security groups have an inbound rule to allow access to the model endpoint (typically TCP port 8080) an FMBench installation in the client AWS account should be able to access an endpoint in the server AWS account.

        "},{"location":"manifest.html","title":"Files","text":"

        Here is a listing of the various configuration files available out-of-the-box with FMBench. Click on any link to view a file. You can use these files as-is or use them as templates to create a custom configuration file for your use-case of interest.

        bedrock \u251c\u2500\u2500 bedrock/config-bedrock-all-anthropic-models-longbench-data.yml \u251c\u2500\u2500 bedrock/config-bedrock-anthropic-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-claude.yml \u251c\u2500\u2500 bedrock/config-bedrock-evals-only-conc-1.yml \u251c\u2500\u2500 bedrock/config-bedrock-haiku-sonnet-majority-voting.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-70b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-8b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-no-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-titan-text-express.yml \u2514\u2500\u2500 bedrock/config-bedrock.yml bert \u2514\u2500\u2500 bert/config-distilbert-base-uncased.yml byoe \u2514\u2500\u2500 byoe/config-model-byo-sagemaker-endpoint.yml eks_manifests \u251c\u2500\u2500 eks_manifests/llama3-ray-service.yaml \u2514\u2500\u2500 eks_manifests/mistral-ray-service.yaml gemma \u2514\u2500\u2500 gemma/config-gemma-2b-g5.yml llama2 \u251c\u2500\u2500 llama2/13b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-bedrock-sagemaker-llama2.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-byo-rest-ep-llama2-13b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5.yml \u251c\u2500\u2500 llama2/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-ec2-llama2-70b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-tgi.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-trt.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/70b/config-llama2-70b-inf2-g5.yml \u2514\u2500\u2500 llama2/7b \u251c\u2500\u2500 llama2/7b/config-llama2-7b-byo-sagemaker-endpoint.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g4dn-g5-trt.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-no-s3-quick.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-quick.yml \u2514\u2500\u2500 llama2/7b/config-llama2-7b-inf2-g5.yml llama3 \u251c\u2500\u2500 llama3/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-bedrock.yml -> ../../bedrock/config-bedrock.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-llama3-70b-instruct.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-neuron-llama3-70b-inf2-48xl-deploy-sm.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-48xl.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3/70b/config-llama3-70b-instruct-p4d.yml \u2514\u2500\u2500 llama3/8b \u251c\u2500\u2500 llama3/8b/config-bedrock.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-24xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7i-12xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-neuron-trn1-32xl-tp16-sm.yml \u251c\u2500\u2500 config-llama3-8b-trn1-32xl-tp16-bs-4-ec2.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-24xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-48xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-eks-inf2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5-streaming.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-24xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-48xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5-byoe-w-openorca.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-all.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl-4-instances.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-2xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-p4d.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p5-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-16-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-8-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-24xl-byoe-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-48xl-byoe-g5-24xl.yml \u2514\u2500\u2500 llama3/8b/llama3-8b-trn1-32xl-byoe-g5-24xl.yml llama3.1 \u251c\u2500\u2500 llama3.1/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-48xl-deploy-ec2.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-deploy-sm.yml \u2514\u2500\u2500 llama3.1/8b \u251c\u2500\u2500 llama3.1/8b/client-config-ec2-llama3-1-8b.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2-tp24-bs12.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-4-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-8-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p5-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-tp-8-mc-auto-p5.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-trn1-32xl-deploy-ec2-tp32-bs8.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.2xl-tp-1-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-8-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-24-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn1-32xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn32xl-triton-vllm.yml \u2514\u2500\u2500 llama3.1/8b/server-config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml mistral \u251c\u2500\u2500 mistral/config-mistral-7b-eks-inf2.yml \u251c\u2500\u2500 mistral/config-mistral-7b-tgi-g5.yml \u251c\u2500\u2500 mistral/config-mistral-7b-trn1-32xl-triton.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5-byo-ep.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v1-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-trn1-32xl-deploy-ec2-tp32.yml \u2514\u2500\u2500 mistral/config-mistral-v3-inf2-48xl-deploy-ec2-tp24.yml model_eval_all_info.yml phi \u2514\u2500\u2500 phi/config-phi-3-g5.yml pricing.yml

        "},{"location":"mm_copies.html","title":"Running multiple model copies on Amazon EC2","text":"

        It is possible to run multiple copies of a model if the tensor parallelism degree and the number of GPUs/Neuron cores on the instance allow it. For example if a model can fit into 2 GPU devices and there are 8 devices available then we could run 4 copies of the model on that instance. Some inference containers, such as the DJL Serving LMI automatically start multiple copies of the model within the same inference container for the scenario described in the example above. However, it is also possible to do this ourselves by running multiple containers and a load balancer through a Docker compose file. FMBench now supports this functionality by adding a single parameter called model_copies in the configuration file.

        For example, here is a snippet from the config-ec2-llama3-1-8b-p4-tp-2-mc-max config file. The new parameters are model_copies, tp_degree and shm_size in the inference_spec section. Note that the tp_degree in the inference_spec and option.tensor_parallel_degree in the serving.properties section should be set to the same value.

            inference_spec:\n      # this should match one of the sections in the inference_parameters section above\n      parameter_set: ec2_djl\n      # how many copies of the model, \"1\", \"2\",..max\n      # set to 1 in the code if not configured,\n      # max: FMBench figures out the max number of model containers to be run\n      #      based on TP degree configured and number of neuron cores/GPUs available.\n      #      For example, if TP=2, GPUs=8 then FMBench will start 4 containers and 1 load balancer,\n      # auto: only supported if the underlying inference container would automatically \n      #       start multiple copies of the model internally based on TP degree and neuron cores/GPUs\n      #       available. In this case only a single container is created, no load balancer is created.\n      #       The DJL serving containers supports auto.  \n      model_copies: max\n      # if you set the model_copies parameter then it is mandatory to set the \n      # tp_degree, shm_size, model_loading_timeout parameters\n      tp_degree: 2\n      shm_size: 12g\n      model_loading_timeout: 2400\n    # modify the serving properties to match your model and requirements\n    serving.properties: |\n      engine=MPI\n      option.tensor_parallel_degree=2\n      option.max_rolling_batch_size=256\n      option.model_id=meta-llama/Meta-Llama-3.1-8B-Instruct\n      option.rolling_batch=lmi-dist\n
        "},{"location":"mm_copies.html#considerations-while-setting-the-model_copies-parameter","title":"Considerations while setting the model_copies parameter","text":"
        1. The model_copies parameter is an EC2 only parameter, which means that you cannot use it when deploying models on SageMaker for example.

        2. If you are looking for the best (lowest) inference latency then you might get better results with setting the tp_degree and option.tensor_parallel_degree to the total number of GPUs/Neuron cores available on your EC2 instance and model_copies to max or auto or 1, in other words, the model is being shared across all accelerators and there can be only 1 copy of the model that can run on that instance (therefore setting model_copies to max or auto or 1 all result in the same thing i.e. a single copy of the model running on that EC2 instance).

        3. If you are looking for the best (highest) transaction throughput while keeping the inference latency within a given latency budget then you might want to configure tp_degree and option.tensor_parallel_degree to the least number of GPUs/Neuron cores on which the model can run (for example for Llama3.1-8b that would be 2 GPUs or 4 Neuron cores) and set the model_copies to max. Let us understand this with an example, say you want to run Llama3.1-8b on a p4de.24xlarge instance type, you set tp_degree and option.tensor_parallel_degree to 2 and model_copies to max, FMBench will start 4 containers (as the p4de.24xlarge has 8 GPUs) and an Nginx load balancer that will round-robin the incoming requests to these 4 containers. In case of the DJL serving LMI you can achieve similar results by setting the model_copies to auto in which case FMBench will start a single container (and no load balancer since there is only one container) and then the DJL serving container will internally start 4 copies of the model within the same container and route the requests to these 4 copies internally. Theoretically you should expect the same performance but in our testing we have seen better performance with model_copies set to max and having an external (Nginx) container doing the load balancing.

        "},{"location":"neuron.html","title":"Benchmark foundation models for AWS Chips","text":"

        You can use FMBench for benchmarking foundation model on AWS Chips: Trainium 1, Inferentia 2. This can be done on Amazon SageMaker, Amazon EKS or on Amazon EC2. FMs need to be first compiled for Neuron before they can be deployed on AWS Chips, this is made easier by SageMaker JumpStart which provides most of the FMs as a JumpStart Model that can be deployed on SageMaker directly, you can also compile models for Neuron yourself or do this through FMBench itself. All of these options are described below.

        "},{"location":"neuron.html#benchmarking-for-aws-chips-on-sagemaker","title":"Benchmarking for AWS Chips on SageMaker","text":"
        1. Several FMs are available through SageMaker JumpStart already compiled for Neuron and ready to deploy. See this link for more details.

        2. You can compile the model outside of FMBench using instructions available here and on the Neuron documentation, deploy on SageMaker and use FMBench in the bring your own endpoint mode, see this config file for an example.

        3. You can have FMBench compile and deploy the model on SageMaker for you. See this Llama3-8b config file for example or this Llama3.1-70b. Search this website for \"inf2\" or \"trn1\" to find other examples. In this case FMBench will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt file, the file simply contains the token without any formatting), compile it for neuron, upload the compiled model to S3 (you specify the bucket in the config file) and then deploy the model to a SageMaker endpoint.

        "},{"location":"neuron.html#benchmarking-for-aws-chips-on-ec2","title":"Benchmarking for AWS Chips on EC2","text":"

        You may want to benchmark models hosted directly on EC2. In this case both FMBench and the model are running on the same EC2 instance. FMBench will deploy the model for you on the EC2 instance. See this Llama3.1-70b file for example or this Llama3-8b file. In this case FMBench will download the model from Hugging Face (you need to provide your HuggingFace token in the /tmp/fmbench-read/scripts/hf_token.txt file, the file simply contains the token without any formatting), pull the inference container from the ECR repo and then run the container with the downloaded model, a local endpoint is provided that is then used by FMBench to run inference.

        "},{"location":"quickstart.html","title":"Quickstart - run FMBench on SageMaker Notebook","text":"
        1. Each FMBench run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical FMBench workflow involves either directly using an already provided config file from the configs folder in the FMBench GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).

          \ud83d\udc49 A simple config file with key parameters annotated is included in this repo, see config-llama2-7b-g5-quick.yml. This file benchmarks performance of Llama2-7b on an ml.g5.xlarge instance and an ml.g5.2xlarge instance. You can use this config file as it is for this Quickstart.

        2. Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run FMBench and a write S3 bucket is created which will hold the metrics and reports generated by FMBench. The CloudFormation stack takes about 5-minutes to create.

        AWS Region Link us-east-1 (N. Virginia) us-west-2 (Oregon) us-gov-west-1 (GovCloud West)
        1. Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the fmbench-notebook.

        2. On the fmbench-notebook open a Terminal and run the following commands.

          conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n

        3. Now you are ready to fmbench with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.

          1. We benchmark performance for the Llama2-7b model on a ml.g5.xlarge and a ml.g5.2xlarge instance type, using the huggingface-pytorch-tgi-inference inference container. This test would take about 30 minutes to complete and cost about $0.20.

          2. It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the Llama2 tokenizer (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.

            account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml > fmbench.log 2>&1\n
          3. Open another terminal window and do a tail -f on the fmbench.log file to see all the traces being generated at runtime.

            tail -f fmbench.log\n
          4. \ud83d\udc49 For streaming support on SageMaker and Bedrock checkout these config files:

            1. config-llama3-8b-g5-streaming.yml
            2. config-bedrock-llama3-streaming.yml
        4. The generated reports and metrics are available in the sagemaker-fmbench-write-<replace_w_your_aws_region>-<replace_w_your_aws_account_id> bucket. The metrics and report files are also downloaded locally and in the results directory (created by FMBench) and the benchmarking report is available as a markdown file called report.md in the results directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.

        "},{"location":"quickstart.html#fmbench-on-govcloud","title":"FMBench on GovCloud","text":"

        No special steps are required for running FMBench on GovCloud. The CloudFormation link for us-gov-west-1 has been provided in the section above.

        1. Not all models available via Bedrock or other services may be available in GovCloud. The following commands show how to run FMBench to benchmark the Amazon Titan Text Express model in the GovCloud. See the Amazon Bedrock GovCloud page for more details.

          account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/bedrock/config-bedrock-titan-text-express.yml > fmbench.log 2>&1\n
        "},{"location":"releases.html","title":"Releases","text":""},{"location":"releases.html#207","title":"2.0.7","text":"
        1. Support Triton-TensorRT for GPU instances and Triton-vllm for AWS Chips.
        2. Misc. bug fixes.
        "},{"location":"releases.html#206","title":"2.0.6","text":"
        1. Run multiple model copies with the DJL serving container and an Nginx load balancer on Amazon EC2.
        2. Config files for Llama3.1-8b on g5, p4de and p5 Amazon EC2 instance types.
        3. Better analytics for creating internal leaderboards.
        "},{"location":"releases.html#205","title":"2.0.5","text":"
        1. Support for Intel CPU based instances such as c5.18xlarge and m5.16xlarge.
        "},{"location":"releases.html#204","title":"2.0.4","text":"
        1. Support for AMD CPU based instances such as m7a.
        "},{"location":"releases.html#203","title":"2.0.3","text":"
        1. Support for a EFA directory for benchmarking on EC2.
        "},{"location":"releases.html#202","title":"2.0.2","text":"
        1. Code cleanup, minor bug fixes and report improvements.
        "},{"location":"releases.html#200","title":"2.0.0","text":"
        1. \ud83d\udea8 Model evaluations done by a Panel of LLM Evaluators \ud83d\udea8
        "},{"location":"releases.html#v1052","title":"v1.0.52","text":"
        1. Compile for AWS Chips (Trainium, Inferentia) and deploy to SageMaker directly through FMBench.
        2. Llama3.1-8b and Llama3.1-70b config files for AWS Chips (Trainium, Inferentia).
        3. Misc. bug fixes.
        "},{"location":"releases.html#v1051","title":"v1.0.51","text":"
        1. FMBench has a website now. Rework the README file to make it lightweight.
        2. Llama3.1 config files for Bedrock.
        "},{"location":"releases.html#v1050","title":"v1.0.50","text":"
        1. Llama3-8b on Amazon EC2 inf2.48xlarge config file.
        2. Update to new version of DJL LMI (0.28.0).
        "},{"location":"releases.html#v1049","title":"v1.0.49","text":"
        1. Streaming support for Amazon SageMaker and Amazon Bedrock.
        2. Per-token latency metrics such as time to first token (TTFT) and mean time per-output token (TPOT).
        3. Misc. bug fixes.
        "},{"location":"releases.html#v1048","title":"v1.0.48","text":"
        1. Faster result file download at the end of a test run.
        2. Phi-3-mini-4k-instruct configuration file.
        3. Tokenizer and misc. bug fixes.
        "},{"location":"releases.html#v1047","title":"v1.0.47","text":"
        1. Run FMBench as a Docker container.
        2. Bug fixes for GovCloud support.
        3. Updated README for EKS cluster creation.
        "},{"location":"releases.html#v1046","title":"v1.0.46","text":"
        1. Native model deployment support for EC2 and EKS (i.e. you can now deploy and benchmark models on EC2 and EKS).
        2. FMBench is now available in GovCloud.
        3. Update to latest version of several packages.
        "},{"location":"releases.html#v1045","title":"v1.0.45","text":"
        1. Analytics for results across multiple runs.
        2. Llama3-70b config files for g5.48xlarge instances.
        "},{"location":"releases.html#v1044","title":"v1.0.44","text":"
        1. Endpoint metrics (CPU/GPU utilization, memory utiliztion, model latency) and invocation metrics (including errors) for SageMaker Endpoints.
        2. Llama3-8b config files for g6 instances.
        "},{"location":"releases.html#v1042","title":"v1.0.42","text":"
        1. Config file for running Llama3-8b on all instance types except p5.
        2. Fix bug with business summary chart.
        3. Fix bug with deploying model using a DJL DeepSpeed container in the no S3 dependency mode.
        "},{"location":"releases.html#v1040","title":"v1.0.40","text":"
        1. Make it easy to run in the Amazon EC2 without any dependency on Amazon S3 dependency mode.
        "},{"location":"releases.html#v1039","title":"v1.0.39","text":"
        1. Add an internal FMBench website.
        "},{"location":"releases.html#v1038","title":"v1.0.38","text":"
        1. Support for running FMBench on Amazon EC2 without any dependency on Amazon S3.
        2. Llama3-8b-Instruct config file for ml.p5.48xlarge.
        "},{"location":"releases.html#v1037","title":"v1.0.37","text":"
        1. g5/p4d/inf2/trn1 specific config files for Llama3-8b-Instruct.
          1. p4d config file for both vllm and lmi-dist.
        "},{"location":"releases.html#v1036","title":"v1.0.36","text":"
        1. Fix bug at higher concurrency levels (20 and above).
        2. Support for instance count > 1.
        "},{"location":"releases.html#v1035","title":"v1.0.35","text":"
        1. Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral.
        "},{"location":"releases.html#v1034","title":"v1.0.34","text":"
        1. Don't delete endpoints for the bring your own endpoint case.
        2. Fix bug with business summary chart.
        "},{"location":"releases.html#v1032","title":"v1.0.32","text":"
        1. Report enhancements: New business summary chart, config file embedded in the report, version numbering and others.

        2. Additional config files: Meta Llama3 on Inf2, Mistral instruct with lmi-dist on p4d and p5 instances.

        "},{"location":"resources.html","title":"Resources","text":""},{"location":"resources.html#pending-enhancements","title":"Pending enhancements","text":"

        View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.

        "},{"location":"resources.html#security","title":"Security","text":"

        See CONTRIBUTING for more information.

        "},{"location":"resources.html#license","title":"License","text":"

        This library is licensed under the MIT-0 License. See the LICENSE file.

        "},{"location":"results.html","title":"Results","text":"

        Depending upon the experiments in the config file, the FMBench run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the local results-* folder in the directory from where FMBench was run. The rpeort and metrics are also written to the write S3 bucket set in the config file.

        Here is a screenshot of the report.md file generated by FMBench.

        "},{"location":"run_as_container.html","title":"Run FMBench as a Docker container","text":"

        You can now run FMBench on any platform where you can run a Docker container, for example on an EC2 VM, SageMaker Notebook etc. The advantage is that you do not have to install anything locally, so no conda installs needed anymore. Here are the steps to do that.

        1. Create local directory structure needed for FMBench and copy all publicly available dependencies from the AWS S3 bucket for FMBench. This is done by running the copy_s3_content.sh script available as part of the FMBench repo. You can place model specific tokenizers and any new configuration files you create in the /tmp/fmbench-read directory that is created after running the following command.

          curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n
        2. That's it! You are now ready to run the container.

          # set the config file path to point to the config file of interest\nCONFIG_FILE=https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama2/7b/config-llama2-7b-g5-quick.yml\ndocker run -v $(pwd)/fmbench:/app \\\n  -v /tmp/fmbench-read:/tmp/fmbench-read \\\n  -v /tmp/fmbench-write:/tmp/fmbench-write \\\n  aarora79/fmbench:v1.0.47 \\\n \"fmbench --config-file ${CONFIG_FILE} --local-mode yes --write-bucket placeholder > fmbench.log 2>&1\"\n
        3. The above command will create a fmbench directory inside the current working directory. This directory contains the fmbench.log and the results-* folder that is created once the run finished.

        "},{"location":"website.html","title":"Create a website for FMBench reports","text":"

        When you use FMBench as a tool for benchmarking your foundation models you would soon want to have an easy way to view all the reports in one place and search through the results, for example, \"Llama3.1-8b results on trn1.32xlarge\". An FMBench website provides a simple way of viewing these results.

        Here are the steps to setup a website using mkdocs and nginx. The steps below generate a self-signed certificate for SSL and use username and password for authentication. It is strongly recommended that you use a valid SSL cert and a better authentication mechanism than username and password for your FMBench website.

        1. Start an Amazon EC2 machine which will host the FMBench website. A t3.xlarge machine with an Ubuntu AMI say ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20240801 and 50GB storage is good enough. Allow SSH and TCP port 443 traffic from anywhere into that machine.

        2. SSH into that machine and install conda.

          wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n
        3. Install docker-compose.

          sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\nsudo usermod -a -G docker $USER\nnewgrp docker\ndocker compose version \n
        4. Setup the fmbench_python311 conda environment and clone FMBench repo.

          conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311\npip install -U fmbench mkdocs mkdocs-material mknotebooks\ngit clone https://github.com/aws-samples/foundation-model-benchmarking-tool.git\n
        5. Get the FMBench results data from Amazon S3 or whichever storage system you used to store all the results.

          curl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nsudo apt-get install unzip -y\nunzip awscliv2.zip\nsudo ./aws/install\nFMBENCH_S3_BUCKET=your-fmbench-s3-bucket-name-here\naws s3 sync s3://$FMBENCH_S3_BUCKET $HOME/fmbench_data --exclude \"*.json\"\n
        6. Create a directory for the FMBench website contents.

          mkdir $HOME/fmbench_site\nmkdir $HOME/fmbench_site/ssl\n
          1. Setup SSL certs (we strongly encourage you to not use self-signed certs, this step here is just for demo purposes, get SSL certs the same way you get them for your current production workloads).

          sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $HOME/fmbench_site/ssl/nginx-selfsigned.key -out $HOME/fmbench_site/ssl/nginx-selfsigned.crt\n
        7. Create an .httpasswd file. The FMBench website will use the fmbench_admin as a username and a password that you enter as part of the command below to allow login to the website.

          sudo apt-get install apache2-utils -y\nhtpasswd -c $HOME/fmbench_site/.htpasswd fmbench_admin\n
        8. Create the mkdocs.yml file for the website.

          cd foundation-model-benchmarking-tool\ncp website/index.md $HOME/fmbench_data/\ncp -r img $HOME/fmbench_data/\npython website/create_fmbench_website.py\nmkdocs build -f website/mkdocs.yml --site-dir $HOME/fmbench_site/site\n
        9. Update nginx.conf file. Note the hostname that is printed out below, the FMBench website would be served at this address.

          TOKEN=`curl -X PUT \"http://169.254.169.254/latest/api/token\" -H \"X-aws-ec2-metadata-token-ttl-seconds: 21600\"`\nHOSTNAME=`curl -H \"X-aws-ec2-metadata-token: $TOKEN\" http://169.254.169.254/latest/meta-data/public-hostname`\necho \"hostname is: $HOSTNAME\"\nsed \"s/__HOSTNAME__/$HOSTNAME/g\" website/nginx.conf.template > $HOME/fmbench_site/nginx.conf\n
        10. Serve the website.

          docker run --name fmbench-nginx -d -p 80:80 -p 443:443   -v $HOME/fmbench_site/site:/usr/share/nginx/html   -v $HOME/fmbench_site/nginx.conf:/etc/nginx/nginx.conf   -v $HOME/fmbench_site/ssl:/etc/nginx/ssl   -v $HOME/fmbench_site/.htpasswd:/etc/nginx/.htpasswd   nginx\n
        11. Open a web browser and navigate to the hostname you noted in the step above, for example https://<your-ec2-hostname>.us-west-2.compute.amazonaws.com, ignore the security warnings if you used a self-signed SSL cert (replace this with a cert that you would normally use in your production websites) and then enter the username and password (the username would be fmbench_admin and password would be what you had set when running the htpasswd command). You should see a website as shown in the screenshot below.

        "},{"location":"workflow.html","title":"Workflow for FMBench","text":"

        The workflow for FMBench is as follows:

        Create configuration file\n        |\n        |-----> Deploy model on SageMaker/Use models on Bedrock/Bring your own endpoint\n                    |\n                    |-----> Run inference against deployed endpoint(s)\n                                     |\n                                     |------> Create a benchmarking report\n
        1. Create a dataset of different prompt sizes and select one or more such datasets for running the tests.

          1. Currently FMBench supports datasets from LongBench and filter out individual items from the dataset based on their size in tokens (for example, prompts less than 500 tokens, between 500 to 1000 tokens and so on and so forth). Alternatively, you can download the folder from this link to load the data.
        2. Deploy any model that is deployable on SageMaker on any supported instance type (g5, p4d, Inf2).

          1. Models could be either available via SageMaker JumpStart (list available here) as well as models not available via JumpStart but still deployable on SageMaker through the low level boto3 (Python) SDK (Bring Your Own Script).
          2. Model deployment is completely configurable in terms of the inference container to use, environment variable to set, setting.properties file to provide (for inference containers such as DJL that use it) and instance type to use.
        3. Benchmark FM performance in terms of inference latency, transactions per minute and dollar cost per transaction for any FM that can be deployed on SageMaker.

          1. Tests are run for each combination of the configured concurrency levels i.e. transactions (inference requests) sent to the endpoint in parallel and dataset. For example, run multiple datasets of say prompt sizes between 3000 to 4000 tokens at concurrency levels of 1, 2, 4, 6, 8 etc. so as to test how many transactions of what token length can the endpoint handle while still maintaining an acceptable level of inference latency.
        4. Generate a report that compares and contrasts the performance of the model over different test configurations and stores the reports in an Amazon S3 bucket.

          1. The report is generated in the Markdown format and consists of plots, tables and text that highlight the key results and provide an overall recommendation on what is the best combination of instance type and serving stack to use for the model under stack for a dataset of interest.
          2. The report is created as an artifact of reproducible research so that anyone having access to the model, instance type and serving stack can run the code and recreate the same results and report.
        5. Multiple configuration files that can be used as reference for benchmarking new models and instance types.

        "},{"location":"misc/ec2_instance_creation_steps.html","title":"Create an EC2 instance suitable for an LMI (Large Model Inference)","text":"

        Follow the steps below to create an EC2 instance for hosting a model in an LMI.

        1. On the homepage of AWS Console go to \u2018EC2\u2019 - it is likely in recently visited:

        2. If not found, go to the search bar on the top of the page. Type ec2 into the search box and click the entry that pops up with name EC2 :

        3. Click \u201cInstances\u201d:

        4. Click \"Launch Instances\":

        5. Type in a name for your instance (recommended to include your alias in the name), and then scroll down. Search for \u2018deep learning ami\u2019 in the box. Select the one that says Deep Learning OSS Nvidia Driver AMI GPU PyTorch for a GPU instance type, select Deep Learning AMI Neuron (Ubuntu 22.04) for an Inferential/Trainium instance type. Your version number might be different.

        6. Name your instance FMBenchInstance.

        7. Add a fmbench-version tag to your instance.

        8. Scroll down to Instance Type. For large model inference, the g5.12xlarge is recommended.

        1. Make a key pair by clicking Create new key pair. Give it a name, keep all settings as is, and then click \u201cCreate key pair\u201d.

        2. Skip over Network settings (leave it as it is), going straight to Configure storage. 45 GB, the suggested amount, is not nearly enough, and using that will cause the LMI docker container to download for an arbitrarily long time and then error out. Change it to 100 GB or more:

        3. Create an IAM role to your instance called FMBenchEC2Role. Attach the following permission policies: AmazonSageMakerFullAccess, AmazonBedrockFullAccess.

          Edit the trust policy to be the following:

          {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"ec2.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"sagemaker.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"bedrock.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        }\n    ]\n}\n
          Select this role in the IAM instance profile setting of your instance.

        4. Then, we\u2019re done with the settings of the instance. Click Launch Instance to finish. You can connect to your EC2 instance using any of these option

        "},{"location":"misc/eks_cluster-creation_steps.html","title":"EKS cluster creation steps","text":"

        The steps below create an EKS cluster called trainium-inferentia.

        1. Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your machine: aws-cli, kubectl and terraform. We use the DoEKS repository as a guide to deploy the cluster infrastructure in an AWS account.

        2. Ensue that your account has enough Inf2 on-demand VCPUs as most of the DoEKS blueprints utilize this specific instance. To increase service quota navigate to the service quota page for the region you are in service quota. Then select services under the left side menu and search for Amazon Elastic Compute Cloud (Amazon EC2). This will bring up the service quota page, here search for inf and there should be an option for Running On-Demand Inf instances. Increase this quota to 300.

        3. Clone the DoEKS repository

          git clone https://github.com/awslabs/data-on-eks.git\n
        4. Ensure that the region names are correct in variables.tf file before running the cluster creation script.

        5. Ensure that the ELB to be created would be external facing. Change the helm value from internal to internet-facing here.

        6. Ensure that the IAM role you are using has the permissions needed to create the cluster. While we expect the following set of permissions to work but the current recommendation is to also add the AdminstratorAccess permission to the IAM role. At a later date you could remove the AdminstratorAccess and experiment with cluster creation without it.

          1. Attach the following managed policies: AmazonEKSClusterPolicy, AmazonEKS_CNI_Policy, and AmazonEKSWorkerNodePolicy.
          2. In addition to the managed policies add the following as inline policy. Replace your-account-id with the actual value of the AWS account id you are using.

            {\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n    {\n        \"Sid\": \"VisualEditor0\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateVpc\",\n            \"ec2:DeleteVpc\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor1\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:ModifyVpcAttribute\",\n            \"ec2:DescribeVpcAttribute\"\n        ],\n        \"Resource\": \"arn:aws:ec2:*:<your-account-id>:vpc/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor2\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AssociateVpcCidrBlock\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor3\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:DescribeSecurityGroupRules\",\n            \"ec2:DescribeNatGateways\",\n            \"ec2:DescribeAddressesAttribute\"\n        ],\n        \"Resource\": \"*\"\n    },\n    {\n        \"Sid\": \"VisualEditor4\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateInternetGateway\",\n            \"ec2:RevokeSecurityGroupEgress\",\n            \"ec2:CreateRouteTable\",\n            \"ec2:CreateSubnet\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:security-group/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor5\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:AttachInternetGateway\",\n            \"ec2:AssociateRouteTable\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:vpn-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor6\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AllocateAddress\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv4pool-ec2/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor7\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:ReleaseAddress\",\n        \"Resource\": \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor8\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:CreateNatGateway\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:natgateway/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    }\n]\n}\n
            1. Add the Role ARN and name here in the variables.tf file by updating these lines. Move the structure inside the defaut list and replace the role ARN and name values with the values for the role you are using.

        7. Navigate into the ai-ml/trainium-inferentia/ directory and run install.sh script.

          cd data-on-eks/ai-ml/trainium-inferentia/\n./install.sh\n

          Note: This step takes about 12-15 minutes to deploy the EKS infrastructure and cluster in the AWS account. To view more details on cluster creation, view an example here: Deploy Llama3 on EKS in the prerequisites section.

        8. After the cluster is created, navigate to the Karpenter EC2 node IAM role called karpenter-trainium-inferentia-XXXXXXXXXXXXXXXXXXXXXXXXX. Attach the following inline policy to the role:

          {\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Sid\": \"Statement1\",\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"iam:CreateServiceLinkedRole\"\n            ],\n            \"Resource\": \"*\"\n        }\n    ]\n}\n
        "},{"location":"misc/the-diy-version-w-gory-details.html","title":"The diy version w gory details","text":""},{"location":"misc/the-diy-version-w-gory-details.html#the-diy-version-with-gory-details","title":"The DIY version (with gory details)","text":"

        Follow the prerequisites below to set up your environment before running the code:

        1. Python 3.11: Setup a Python 3.11 virtual environment and install FMBench.

          python -m venv .fmbench\npip install fmbench\n
        2. S3 buckets for test data, scripts, and results: Create two buckets within your AWS account:

          • Read bucket: This bucket contains tokenizer files, prompt template, source data and deployment scripts stored in a directory structure as shown below. FMBench needs to have read access to this bucket.

            s3://<read-bucket-name>\n    \u251c\u2500\u2500 source_data/\n    \u251c\u2500\u2500 source_data/<source-data-file-name>.json\n    \u251c\u2500\u2500 prompt_template/\n    \u251c\u2500\u2500 prompt_template/prompt_template.txt\n    \u251c\u2500\u2500 scripts/\n    \u251c\u2500\u2500 scripts/<deployment-script-name>.py\n    \u251c\u2500\u2500 tokenizer/\n    \u251c\u2500\u2500 tokenizer/tokenizer.json\n    \u251c\u2500\u2500 tokenizer/config.json\n
            • The details of the bucket structure is as follows:

              1. Source Data Directory: Create a source_data directory that stores the dataset you want to benchmark with. FMBench uses Q&A datasets from the LongBench dataset or alternatively from this link. Support for bring your own dataset will be added soon.

                • Download the different files specified in the LongBench dataset into the source_data directory. Following is a good list to get started with:

                  • 2wikimqa
                  • hotpotqa
                  • narrativeqa
                  • triviaqa

                  Store these files in the source_data directory.

              2. Prompt Template Directory: Create a prompt_template directory that contains a prompt_template.txt file. This .txt file contains the prompt template that your specific model supports. FMBench already supports the prompt template compatible with Llama models.

              3. Scripts Directory: FMBench also supports a bring your own script (BYOS) mode for deploying models that are not natively available via SageMaker JumpStart i.e. anything not included in this list. Here are the steps to use BYOS.

                1. Create a Python script to deploy your model on a SageMaker endpoint. This script needs to have a deploy function that 2_deploy_model.ipynb can invoke. See p4d_hf_tgi.py for reference.

                2. Place your deployment script in the scripts directory in your read bucket. If your script deploys a model directly from HuggingFace and needs to have access to a HuggingFace auth token, then create a file called hf_token.txt and put the auth token in that file. The .gitignore file in this repo has rules to not commit the hf_token.txt to the repo. Today, FMBench provides inference scripts for:

                  • All SageMaker Jumpstart Models
                  • Text-Generation-Inference (TGI) container supported models
                  • Deep Java Library DeepSpeed container supported models

                  Deployment scripts for the options above are available in the scripts directory, you can use these as reference for creating your own deployment scripts as well.

              4. Tokenizer Directory: Place the tokenizer.json, config.json and any other files required for your model's tokenizer in the tokenizer directory. The tokenizer for your model should be compatible with the tokenizers package. FMBench uses AutoTokenizer.from_pretrained to load the tokenizer. >As an example, to use the Llama 2 Tokenizer for counting prompt and generation tokens for the Llama 2 family of models: Accept the License here: meta approval form and download the tokenizer.json and config.json files from Hugging Face website and place them in the tokenizer directory.

          • Write bucket: All prompt payloads, model endpoint and metrics generated by FMBench are stored in this bucket. FMBench requires write permissions to store the results in this bucket. No directory structure needs to be pre-created in this bucket, everything is created by FMBench at runtime.

            ```{.bash} s3:// \u251c\u2500\u2500 \u251c\u2500\u2500 /data \u251c\u2500\u2500 /data/metrics \u251c\u2500\u2500 /data/models \u251c\u2500\u2500 /data/prompts ````"}]} \ No newline at end of file diff --git a/sitemap.xml.gz b/sitemap.xml.gz index 2db7c08b..084fc14c 100755 Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ