diff --git a/benchmarking_on_ec2.html b/benchmarking_on_ec2.html
index 25657c70..1fc079c9 100755
--- a/benchmarking_on_ec2.html
+++ b/benchmarking_on_ec2.html
@@ -1460,8 +1460,23 @@ <h2 id="benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips">Benchmar
 <h2 id="benchmarking-on-an-instance-type-with-nvidia-gpu-and-the-triton-inference-server">Benchmarking on an instance type with NVIDIA GPU and the Triton inference server<a class="headerlink" href="#benchmarking-on-an-instance-type-with-nvidia-gpu-and-the-triton-inference-server" title="Permanent link">&para;</a></h2>
 <ol>
 <li>
-<p>No special procedure needed, just follow steps in the <a href="#benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips">Benchmarking on an instance type with NVIDIA GPUs or AWS Chips</a> section and then run <code>FMBench</code> with a config file for Triton. For example for benchmarking <code>Llama3-8b</code> model on a <code>g5.12xlarge</code> use the following command (after completing the steps for setting up <code>FMBench</code>).</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-8-1"><a id="__codelineno-8-1" name="__codelineno-8-1" href="#__codelineno-8-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
+<p>Follow steps in the <a href="#benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips">Benchmarking on an instance type with NVIDIA GPUs or AWS Chips</a> section to install <code>FMBench</code> but do not run any benchmarking tests yet.</p>
+</li>
+<li>
+<p>Once <code>FMBench</code> is installed then install the following additional dependencies for Triton.</p>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-8-1"><a id="__codelineno-8-1" name="__codelineno-8-1" href="#__codelineno-8-1"></a><span class="nb">cd</span><span class="w"> </span>~
+</span><span id="__span-8-2"><a id="__codelineno-8-2" name="__codelineno-8-2" href="#__codelineno-8-2"></a>git<span class="w"> </span>clone<span class="w"> </span>https://github.com/triton-inference-server/tensorrtllm_backend.git<span class="w">  </span>--branch<span class="w"> </span>v0.12.0
+</span><span id="__span-8-3"><a id="__codelineno-8-3" name="__codelineno-8-3" href="#__codelineno-8-3"></a><span class="c1"># Update the submodules</span>
+</span><span id="__span-8-4"><a id="__codelineno-8-4" name="__codelineno-8-4" href="#__codelineno-8-4"></a><span class="nb">cd</span><span class="w"> </span>tensorrtllm_backend
+</span><span id="__span-8-5"><a id="__codelineno-8-5" name="__codelineno-8-5" href="#__codelineno-8-5"></a><span class="c1"># Install git-lfs if needed</span>
+</span><span id="__span-8-6"><a id="__codelineno-8-6" name="__codelineno-8-6" href="#__codelineno-8-6"></a>apt-get<span class="w"> </span>update<span class="w"> </span><span class="o">&amp;&amp;</span><span class="w"> </span>apt-get<span class="w"> </span>install<span class="w"> </span>git-lfs<span class="w"> </span>-y<span class="w"> </span>--no-install-recommends
+</span><span id="__span-8-7"><a id="__codelineno-8-7" name="__codelineno-8-7" href="#__codelineno-8-7"></a>git<span class="w"> </span>lfs<span class="w"> </span>install
+</span><span id="__span-8-8"><a id="__codelineno-8-8" name="__codelineno-8-8" href="#__codelineno-8-8"></a>git<span class="w"> </span>submodule<span class="w"> </span>update<span class="w"> </span>--init<span class="w"> </span>--recursive
+</span></code></pre></div>
+</li>
+<li>
+<p>Now you are ready to run benchmarking with Triton. For example for benchmarking <code>Llama3-8b</code> model on a <code>g5.12xlarge</code> use the following command:</p>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-9-1"><a id="__codelineno-9-1" name="__codelineno-9-1" href="#__codelineno-9-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
 </span></code></pre></div>
 </li>
 </ol>
@@ -1470,64 +1485,64 @@ <h2 id="benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference
 <ol>
 <li>
 <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda <a href="https://www.anaconda.com/download">here</a>. (Note: <strong><em>Configure the storage of your EC2 instance to 500GB for this test</em></strong>)</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-9-1"><a id="__codelineno-9-1" name="__codelineno-9-1" href="#__codelineno-9-1"></a><span class="c1"># Install Docker and Git using the YUM package manager</span>
-</span><span id="__span-9-2"><a id="__codelineno-9-2" name="__codelineno-9-2" href="#__codelineno-9-2"></a>sudo<span class="w"> </span>yum<span class="w"> </span>install<span class="w"> </span>docker<span class="w"> </span>git<span class="w"> </span>-y
-</span><span id="__span-9-3"><a id="__codelineno-9-3" name="__codelineno-9-3" href="#__codelineno-9-3"></a>
-</span><span id="__span-9-4"><a id="__codelineno-9-4" name="__codelineno-9-4" href="#__codelineno-9-4"></a><span class="c1"># Start the Docker service</span>
-</span><span id="__span-9-5"><a id="__codelineno-9-5" name="__codelineno-9-5" href="#__codelineno-9-5"></a>sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span class="w"> </span>docker
-</span><span id="__span-9-6"><a id="__codelineno-9-6" name="__codelineno-9-6" href="#__codelineno-9-6"></a>
-</span><span id="__span-9-7"><a id="__codelineno-9-7" name="__codelineno-9-7" href="#__codelineno-9-7"></a><span class="c1"># Download the Miniconda installer for Linux</span>
-</span><span id="__span-9-8"><a id="__codelineno-9-8" name="__codelineno-9-8" href="#__codelineno-9-8"></a>wget<span class="w"> </span>https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-</span><span id="__span-9-9"><a id="__codelineno-9-9" name="__codelineno-9-9" href="#__codelineno-9-9"></a>bash<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w"> </span>-b<span class="w">  </span><span class="c1"># Run the Miniconda installer in batch mode (no manual intervention)</span>
-</span><span id="__span-9-10"><a id="__codelineno-9-10" name="__codelineno-9-10" href="#__codelineno-9-10"></a>rm<span class="w"> </span>-f<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w">    </span><span class="c1"># Remove the installer script after installation</span>
-</span><span id="__span-9-11"><a id="__codelineno-9-11" name="__codelineno-9-11" href="#__codelineno-9-11"></a><span class="nb">eval</span><span class="w"> </span><span class="s2">&quot;</span><span class="k">$(</span>/home/<span class="nv">$USER</span>/miniconda3/bin/conda<span class="w"> </span>shell.bash<span class="w"> </span>hook<span class="k">)</span><span class="s2">&quot;</span><span class="w"> </span><span class="c1"># Initialize conda for bash shell</span>
-</span><span id="__span-9-12"><a id="__codelineno-9-12" name="__codelineno-9-12" href="#__codelineno-9-12"></a>conda<span class="w"> </span>init<span class="w">  </span><span class="c1"># Initialize conda, adding it to the shell</span>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-10-1"><a id="__codelineno-10-1" name="__codelineno-10-1" href="#__codelineno-10-1"></a><span class="c1"># Install Docker and Git using the YUM package manager</span>
+</span><span id="__span-10-2"><a id="__codelineno-10-2" name="__codelineno-10-2" href="#__codelineno-10-2"></a>sudo<span class="w"> </span>yum<span class="w"> </span>install<span class="w"> </span>docker<span class="w"> </span>git<span class="w"> </span>-y
+</span><span id="__span-10-3"><a id="__codelineno-10-3" name="__codelineno-10-3" href="#__codelineno-10-3"></a>
+</span><span id="__span-10-4"><a id="__codelineno-10-4" name="__codelineno-10-4" href="#__codelineno-10-4"></a><span class="c1"># Start the Docker service</span>
+</span><span id="__span-10-5"><a id="__codelineno-10-5" name="__codelineno-10-5" href="#__codelineno-10-5"></a>sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span class="w"> </span>docker
+</span><span id="__span-10-6"><a id="__codelineno-10-6" name="__codelineno-10-6" href="#__codelineno-10-6"></a>
+</span><span id="__span-10-7"><a id="__codelineno-10-7" name="__codelineno-10-7" href="#__codelineno-10-7"></a><span class="c1"># Download the Miniconda installer for Linux</span>
+</span><span id="__span-10-8"><a id="__codelineno-10-8" name="__codelineno-10-8" href="#__codelineno-10-8"></a>wget<span class="w"> </span>https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+</span><span id="__span-10-9"><a id="__codelineno-10-9" name="__codelineno-10-9" href="#__codelineno-10-9"></a>bash<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w"> </span>-b<span class="w">  </span><span class="c1"># Run the Miniconda installer in batch mode (no manual intervention)</span>
+</span><span id="__span-10-10"><a id="__codelineno-10-10" name="__codelineno-10-10" href="#__codelineno-10-10"></a>rm<span class="w"> </span>-f<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w">    </span><span class="c1"># Remove the installer script after installation</span>
+</span><span id="__span-10-11"><a id="__codelineno-10-11" name="__codelineno-10-11" href="#__codelineno-10-11"></a><span class="nb">eval</span><span class="w"> </span><span class="s2">&quot;</span><span class="k">$(</span>/home/<span class="nv">$USER</span>/miniconda3/bin/conda<span class="w"> </span>shell.bash<span class="w"> </span>hook<span class="k">)</span><span class="s2">&quot;</span><span class="w"> </span><span class="c1"># Initialize conda for bash shell</span>
+</span><span id="__span-10-12"><a id="__codelineno-10-12" name="__codelineno-10-12" href="#__codelineno-10-12"></a>conda<span class="w"> </span>init<span class="w">  </span><span class="c1"># Initialize conda, adding it to the shell</span>
 </span></code></pre></div>
 </li>
 <li>
 <p>Setup the <code>fmbench_python311</code> conda environment.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-10-1"><a id="__codelineno-10-1" name="__codelineno-10-1" href="#__codelineno-10-1"></a><span class="c1"># Create a new conda environment named &#39;fmbench_python311&#39; with Python 3.11 and ipykernel</span>
-</span><span id="__span-10-2"><a id="__codelineno-10-2" name="__codelineno-10-2" href="#__codelineno-10-2"></a>conda<span class="w"> </span>create<span class="w"> </span>--name<span class="w"> </span>fmbench_python311<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.11<span class="w"> </span>ipykernel
-</span><span id="__span-10-3"><a id="__codelineno-10-3" name="__codelineno-10-3" href="#__codelineno-10-3"></a>
-</span><span id="__span-10-4"><a id="__codelineno-10-4" name="__codelineno-10-4" href="#__codelineno-10-4"></a><span class="c1"># Activate the newly created conda environment</span>
-</span><span id="__span-10-5"><a id="__codelineno-10-5" name="__codelineno-10-5" href="#__codelineno-10-5"></a><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>fmbench_python311
-</span><span id="__span-10-6"><a id="__codelineno-10-6" name="__codelineno-10-6" href="#__codelineno-10-6"></a>
-</span><span id="__span-10-7"><a id="__codelineno-10-7" name="__codelineno-10-7" href="#__codelineno-10-7"></a><span class="c1"># Upgrade pip and install the fmbench package</span>
-</span><span id="__span-10-8"><a id="__codelineno-10-8" name="__codelineno-10-8" href="#__codelineno-10-8"></a>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span>fmbench
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-11-1"><a id="__codelineno-11-1" name="__codelineno-11-1" href="#__codelineno-11-1"></a><span class="c1"># Create a new conda environment named &#39;fmbench_python311&#39; with Python 3.11 and ipykernel</span>
+</span><span id="__span-11-2"><a id="__codelineno-11-2" name="__codelineno-11-2" href="#__codelineno-11-2"></a>conda<span class="w"> </span>create<span class="w"> </span>--name<span class="w"> </span>fmbench_python311<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.11<span class="w"> </span>ipykernel
+</span><span id="__span-11-3"><a id="__codelineno-11-3" name="__codelineno-11-3" href="#__codelineno-11-3"></a>
+</span><span id="__span-11-4"><a id="__codelineno-11-4" name="__codelineno-11-4" href="#__codelineno-11-4"></a><span class="c1"># Activate the newly created conda environment</span>
+</span><span id="__span-11-5"><a id="__codelineno-11-5" name="__codelineno-11-5" href="#__codelineno-11-5"></a><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>fmbench_python311
+</span><span id="__span-11-6"><a id="__codelineno-11-6" name="__codelineno-11-6" href="#__codelineno-11-6"></a>
+</span><span id="__span-11-7"><a id="__codelineno-11-7" name="__codelineno-11-7" href="#__codelineno-11-7"></a><span class="c1"># Upgrade pip and install the fmbench package</span>
+</span><span id="__span-11-8"><a id="__codelineno-11-8" name="__codelineno-11-8" href="#__codelineno-11-8"></a>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span>fmbench
 </span></code></pre></div>
 </li>
 <li>
 <p>First we need to build the required docker image for <code>triton</code>, and push it locally. To do this, curl the <code>Triton Dockerfile</code> and the script to build and push the triton image locally:</p>
-<p><div class="language-bash highlight"><pre><span></span><code><span id="__span-11-1"><a id="__codelineno-11-1" name="__codelineno-11-1" href="#__codelineno-11-1"></a><span class="w">    </span><span class="c1"># curl the docker file for triton</span>
-</span><span id="__span-11-2"><a id="__codelineno-11-2" name="__codelineno-11-2" href="#__codelineno-11-2"></a><span class="w">    </span>curl<span class="w"> </span>-o<span class="w"> </span>./Dockerfile_triton<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton
-</span><span id="__span-11-3"><a id="__codelineno-11-3" name="__codelineno-11-3" href="#__codelineno-11-3"></a>
-</span><span id="__span-11-4"><a id="__codelineno-11-4" name="__codelineno-11-4" href="#__codelineno-11-4"></a><span class="w">    </span><span class="c1"># curl the script that builds and pushes the triton image locally</span>
-</span><span id="__span-11-5"><a id="__codelineno-11-5" name="__codelineno-11-5" href="#__codelineno-11-5"></a><span class="w">    </span>curl<span class="w"> </span>-o<span class="w"> </span>build_and_push_triton.sh<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh
-</span><span id="__span-11-6"><a id="__codelineno-11-6" name="__codelineno-11-6" href="#__codelineno-11-6"></a>
-</span><span id="__span-11-7"><a id="__codelineno-11-7" name="__codelineno-11-7" href="#__codelineno-11-7"></a><span class="w">    </span><span class="c1"># Make the triton build and push script executable, and run it</span>
-</span><span id="__span-11-8"><a id="__codelineno-11-8" name="__codelineno-11-8" href="#__codelineno-11-8"></a><span class="w">    </span>chmod<span class="w"> </span>+x<span class="w"> </span>build_and_push_triton.sh
-</span><span id="__span-11-9"><a id="__codelineno-11-9" name="__codelineno-11-9" href="#__codelineno-11-9"></a><span class="w">    </span>./build_and_push_triton.sh
+<p><div class="language-bash highlight"><pre><span></span><code><span id="__span-12-1"><a id="__codelineno-12-1" name="__codelineno-12-1" href="#__codelineno-12-1"></a><span class="w">    </span><span class="c1"># curl the docker file for triton</span>
+</span><span id="__span-12-2"><a id="__codelineno-12-2" name="__codelineno-12-2" href="#__codelineno-12-2"></a><span class="w">    </span>curl<span class="w"> </span>-o<span class="w"> </span>./Dockerfile_triton<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton
+</span><span id="__span-12-3"><a id="__codelineno-12-3" name="__codelineno-12-3" href="#__codelineno-12-3"></a>
+</span><span id="__span-12-4"><a id="__codelineno-12-4" name="__codelineno-12-4" href="#__codelineno-12-4"></a><span class="w">    </span><span class="c1"># curl the script that builds and pushes the triton image locally</span>
+</span><span id="__span-12-5"><a id="__codelineno-12-5" name="__codelineno-12-5" href="#__codelineno-12-5"></a><span class="w">    </span>curl<span class="w"> </span>-o<span class="w"> </span>build_and_push_triton.sh<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh
+</span><span id="__span-12-6"><a id="__codelineno-12-6" name="__codelineno-12-6" href="#__codelineno-12-6"></a>
+</span><span id="__span-12-7"><a id="__codelineno-12-7" name="__codelineno-12-7" href="#__codelineno-12-7"></a><span class="w">    </span><span class="c1"># Make the triton build and push script executable, and run it</span>
+</span><span id="__span-12-8"><a id="__codelineno-12-8" name="__codelineno-12-8" href="#__codelineno-12-8"></a><span class="w">    </span>chmod<span class="w"> </span>+x<span class="w"> </span>build_and_push_triton.sh
+</span><span id="__span-12-9"><a id="__codelineno-12-9" name="__codelineno-12-9" href="#__codelineno-12-9"></a><span class="w">    </span>./build_and_push_triton.sh
 </span></code></pre></div>
    - Now wait until the docker image is saved locally and then follow the instructions below to start a benchmarking test.</p>
 </li>
 <li>
 <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-12-1"><a id="__codelineno-12-1" name="__codelineno-12-1" href="#__codelineno-12-1"></a>curl<span class="w"> </span>-s<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh<span class="w"> </span><span class="p">|</span><span class="w"> </span>sh<span class="w"> </span>-s<span class="w"> </span>--<span class="w"> </span>/tmp
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-13-1"><a id="__codelineno-13-1" name="__codelineno-13-1" href="#__codelineno-13-1"></a>curl<span class="w"> </span>-s<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh<span class="w"> </span><span class="p">|</span><span class="w"> </span>sh<span class="w"> </span>-s<span class="w"> </span>--<span class="w"> </span>/tmp
 </span></code></pre></div>
 </li>
 <li>
 <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-13-1"><a id="__codelineno-13-1" name="__codelineno-13-1" href="#__codelineno-13-1"></a><span class="nb">echo</span><span class="w"> </span>hf_yourtokenstring<span class="w"> </span>&gt;<span class="w"> </span>/tmp/fmbench-read/scripts/hf_token.txt
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-14-1"><a id="__codelineno-14-1" name="__codelineno-14-1" href="#__codelineno-14-1"></a><span class="nb">echo</span><span class="w"> </span>hf_yourtokenstring<span class="w"> </span>&gt;<span class="w"> </span>/tmp/fmbench-read/scripts/hf_token.txt
 </span></code></pre></div>
 </li>
 <li>
 <p>Run <code>FMBench</code> with a packaged or a custom config file. <strong><em>This step will also deploy the model on the EC2 instance</em></strong>. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-14-1"><a id="__codelineno-14-1" name="__codelineno-14-1" href="#__codelineno-14-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-15-1"><a id="__codelineno-15-1" name="__codelineno-15-1" href="#__codelineno-15-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
 </span></code></pre></div>
 </li>
 <li>
 <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-15-1"><a id="__codelineno-15-1" name="__codelineno-15-1" href="#__codelineno-15-1"></a>tail<span class="w"> </span>-f<span class="w"> </span>fmbench.log
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-16-1"><a id="__codelineno-16-1" name="__codelineno-16-1" href="#__codelineno-16-1"></a>tail<span class="w"> </span>-f<span class="w"> </span>fmbench.log
 </span></code></pre></div>
 </li>
 <li>
@@ -1539,30 +1554,30 @@ <h2 id="benchmarking-on-an-cpu-instance-type-with-amd-processors">Benchmarking o
 <ol>
 <li>
 <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda <a href="https://www.anaconda.com/download">here</a></p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-16-1"><a id="__codelineno-16-1" name="__codelineno-16-1" href="#__codelineno-16-1"></a><span class="c1"># Install Docker and Git using the YUM package manager</span>
-</span><span id="__span-16-2"><a id="__codelineno-16-2" name="__codelineno-16-2" href="#__codelineno-16-2"></a>sudo<span class="w"> </span>yum<span class="w"> </span>install<span class="w"> </span>docker<span class="w"> </span>git<span class="w"> </span>-y
-</span><span id="__span-16-3"><a id="__codelineno-16-3" name="__codelineno-16-3" href="#__codelineno-16-3"></a>
-</span><span id="__span-16-4"><a id="__codelineno-16-4" name="__codelineno-16-4" href="#__codelineno-16-4"></a><span class="c1"># Start the Docker service</span>
-</span><span id="__span-16-5"><a id="__codelineno-16-5" name="__codelineno-16-5" href="#__codelineno-16-5"></a>sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span class="w"> </span>docker
-</span><span id="__span-16-6"><a id="__codelineno-16-6" name="__codelineno-16-6" href="#__codelineno-16-6"></a>
-</span><span id="__span-16-7"><a id="__codelineno-16-7" name="__codelineno-16-7" href="#__codelineno-16-7"></a><span class="c1"># Download the Miniconda installer for Linux</span>
-</span><span id="__span-16-8"><a id="__codelineno-16-8" name="__codelineno-16-8" href="#__codelineno-16-8"></a>wget<span class="w"> </span>https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-</span><span id="__span-16-9"><a id="__codelineno-16-9" name="__codelineno-16-9" href="#__codelineno-16-9"></a>bash<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w"> </span>-b<span class="w">  </span><span class="c1"># Run the Miniconda installer in batch mode (no manual intervention)</span>
-</span><span id="__span-16-10"><a id="__codelineno-16-10" name="__codelineno-16-10" href="#__codelineno-16-10"></a>rm<span class="w"> </span>-f<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w">    </span><span class="c1"># Remove the installer script after installation</span>
-</span><span id="__span-16-11"><a id="__codelineno-16-11" name="__codelineno-16-11" href="#__codelineno-16-11"></a><span class="nb">eval</span><span class="w"> </span><span class="s2">&quot;</span><span class="k">$(</span>/home/<span class="nv">$USER</span>/miniconda3/bin/conda<span class="w"> </span>shell.bash<span class="w"> </span>hook<span class="k">)</span><span class="s2">&quot;</span><span class="w"> </span><span class="c1"># Initialize conda for bash shell</span>
-</span><span id="__span-16-12"><a id="__codelineno-16-12" name="__codelineno-16-12" href="#__codelineno-16-12"></a>conda<span class="w"> </span>init<span class="w">  </span><span class="c1"># Initialize conda, adding it to the shell</span>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-17-1"><a id="__codelineno-17-1" name="__codelineno-17-1" href="#__codelineno-17-1"></a><span class="c1"># Install Docker and Git using the YUM package manager</span>
+</span><span id="__span-17-2"><a id="__codelineno-17-2" name="__codelineno-17-2" href="#__codelineno-17-2"></a>sudo<span class="w"> </span>yum<span class="w"> </span>install<span class="w"> </span>docker<span class="w"> </span>git<span class="w"> </span>-y
+</span><span id="__span-17-3"><a id="__codelineno-17-3" name="__codelineno-17-3" href="#__codelineno-17-3"></a>
+</span><span id="__span-17-4"><a id="__codelineno-17-4" name="__codelineno-17-4" href="#__codelineno-17-4"></a><span class="c1"># Start the Docker service</span>
+</span><span id="__span-17-5"><a id="__codelineno-17-5" name="__codelineno-17-5" href="#__codelineno-17-5"></a>sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span class="w"> </span>docker
+</span><span id="__span-17-6"><a id="__codelineno-17-6" name="__codelineno-17-6" href="#__codelineno-17-6"></a>
+</span><span id="__span-17-7"><a id="__codelineno-17-7" name="__codelineno-17-7" href="#__codelineno-17-7"></a><span class="c1"># Download the Miniconda installer for Linux</span>
+</span><span id="__span-17-8"><a id="__codelineno-17-8" name="__codelineno-17-8" href="#__codelineno-17-8"></a>wget<span class="w"> </span>https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+</span><span id="__span-17-9"><a id="__codelineno-17-9" name="__codelineno-17-9" href="#__codelineno-17-9"></a>bash<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w"> </span>-b<span class="w">  </span><span class="c1"># Run the Miniconda installer in batch mode (no manual intervention)</span>
+</span><span id="__span-17-10"><a id="__codelineno-17-10" name="__codelineno-17-10" href="#__codelineno-17-10"></a>rm<span class="w"> </span>-f<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w">    </span><span class="c1"># Remove the installer script after installation</span>
+</span><span id="__span-17-11"><a id="__codelineno-17-11" name="__codelineno-17-11" href="#__codelineno-17-11"></a><span class="nb">eval</span><span class="w"> </span><span class="s2">&quot;</span><span class="k">$(</span>/home/<span class="nv">$USER</span>/miniconda3/bin/conda<span class="w"> </span>shell.bash<span class="w"> </span>hook<span class="k">)</span><span class="s2">&quot;</span><span class="w"> </span><span class="c1"># Initialize conda for bash shell</span>
+</span><span id="__span-17-12"><a id="__codelineno-17-12" name="__codelineno-17-12" href="#__codelineno-17-12"></a>conda<span class="w"> </span>init<span class="w">  </span><span class="c1"># Initialize conda, adding it to the shell</span>
 </span></code></pre></div>
 </li>
 <li>
 <p>Setup the <code>fmbench_python311</code> conda environment.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-17-1"><a id="__codelineno-17-1" name="__codelineno-17-1" href="#__codelineno-17-1"></a><span class="c1"># Create a new conda environment named &#39;fmbench_python311&#39; with Python 3.11 and ipykernel</span>
-</span><span id="__span-17-2"><a id="__codelineno-17-2" name="__codelineno-17-2" href="#__codelineno-17-2"></a>conda<span class="w"> </span>create<span class="w"> </span>--name<span class="w"> </span>fmbench_python311<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.11<span class="w"> </span>ipykernel
-</span><span id="__span-17-3"><a id="__codelineno-17-3" name="__codelineno-17-3" href="#__codelineno-17-3"></a>
-</span><span id="__span-17-4"><a id="__codelineno-17-4" name="__codelineno-17-4" href="#__codelineno-17-4"></a><span class="c1"># Activate the newly created conda environment</span>
-</span><span id="__span-17-5"><a id="__codelineno-17-5" name="__codelineno-17-5" href="#__codelineno-17-5"></a><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>fmbench_python311
-</span><span id="__span-17-6"><a id="__codelineno-17-6" name="__codelineno-17-6" href="#__codelineno-17-6"></a>
-</span><span id="__span-17-7"><a id="__codelineno-17-7" name="__codelineno-17-7" href="#__codelineno-17-7"></a><span class="c1"># Upgrade pip and install the fmbench package</span>
-</span><span id="__span-17-8"><a id="__codelineno-17-8" name="__codelineno-17-8" href="#__codelineno-17-8"></a>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span>fmbench
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-18-1"><a id="__codelineno-18-1" name="__codelineno-18-1" href="#__codelineno-18-1"></a><span class="c1"># Create a new conda environment named &#39;fmbench_python311&#39; with Python 3.11 and ipykernel</span>
+</span><span id="__span-18-2"><a id="__codelineno-18-2" name="__codelineno-18-2" href="#__codelineno-18-2"></a>conda<span class="w"> </span>create<span class="w"> </span>--name<span class="w"> </span>fmbench_python311<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.11<span class="w"> </span>ipykernel
+</span><span id="__span-18-3"><a id="__codelineno-18-3" name="__codelineno-18-3" href="#__codelineno-18-3"></a>
+</span><span id="__span-18-4"><a id="__codelineno-18-4" name="__codelineno-18-4" href="#__codelineno-18-4"></a><span class="c1"># Activate the newly created conda environment</span>
+</span><span id="__span-18-5"><a id="__codelineno-18-5" name="__codelineno-18-5" href="#__codelineno-18-5"></a><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>fmbench_python311
+</span><span id="__span-18-6"><a id="__codelineno-18-6" name="__codelineno-18-6" href="#__codelineno-18-6"></a>
+</span><span id="__span-18-7"><a id="__codelineno-18-7" name="__codelineno-18-7" href="#__codelineno-18-7"></a><span class="c1"># Upgrade pip and install the fmbench package</span>
+</span><span id="__span-18-8"><a id="__codelineno-18-8" name="__codelineno-18-8" href="#__codelineno-18-8"></a>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span>fmbench
 </span></code></pre></div>
 </li>
 <li>
@@ -1573,51 +1588,51 @@ <h2 id="benchmarking-on-an-cpu-instance-type-with-amd-processors">Benchmarking o
 </li>
 <li>
 <p>The container being build is for CPU only (GPU support might be added in future).</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-18-1"><a id="__codelineno-18-1" name="__codelineno-18-1" href="#__codelineno-18-1"></a><span class="c1"># Clone the vLLM project repository from GitHub</span>
-</span><span id="__span-18-2"><a id="__codelineno-18-2" name="__codelineno-18-2" href="#__codelineno-18-2"></a>git<span class="w"> </span>clone<span class="w"> </span>https://github.com/vllm-project/vllm.git
-</span><span id="__span-18-3"><a id="__codelineno-18-3" name="__codelineno-18-3" href="#__codelineno-18-3"></a>
-</span><span id="__span-18-4"><a id="__codelineno-18-4" name="__codelineno-18-4" href="#__codelineno-18-4"></a><span class="c1"># Change the directory to the cloned vLLM project</span>
-</span><span id="__span-18-5"><a id="__codelineno-18-5" name="__codelineno-18-5" href="#__codelineno-18-5"></a><span class="nb">cd</span><span class="w"> </span>vllm
-</span><span id="__span-18-6"><a id="__codelineno-18-6" name="__codelineno-18-6" href="#__codelineno-18-6"></a>
-</span><span id="__span-18-7"><a id="__codelineno-18-7" name="__codelineno-18-7" href="#__codelineno-18-7"></a><span class="c1"># Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB</span>
-</span><span id="__span-18-8"><a id="__codelineno-18-8" name="__codelineno-18-8" href="#__codelineno-18-8"></a>sudo<span class="w"> </span>docker<span class="w"> </span>build<span class="w"> </span>-f<span class="w"> </span>Dockerfile.cpu<span class="w"> </span>-t<span class="w"> </span>vllm-cpu-env<span class="w"> </span>--shm-size<span class="o">=</span>4g<span class="w"> </span>.
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-19-1"><a id="__codelineno-19-1" name="__codelineno-19-1" href="#__codelineno-19-1"></a><span class="c1"># Clone the vLLM project repository from GitHub</span>
+</span><span id="__span-19-2"><a id="__codelineno-19-2" name="__codelineno-19-2" href="#__codelineno-19-2"></a>git<span class="w"> </span>clone<span class="w"> </span>https://github.com/vllm-project/vllm.git
+</span><span id="__span-19-3"><a id="__codelineno-19-3" name="__codelineno-19-3" href="#__codelineno-19-3"></a>
+</span><span id="__span-19-4"><a id="__codelineno-19-4" name="__codelineno-19-4" href="#__codelineno-19-4"></a><span class="c1"># Change the directory to the cloned vLLM project</span>
+</span><span id="__span-19-5"><a id="__codelineno-19-5" name="__codelineno-19-5" href="#__codelineno-19-5"></a><span class="nb">cd</span><span class="w"> </span>vllm
+</span><span id="__span-19-6"><a id="__codelineno-19-6" name="__codelineno-19-6" href="#__codelineno-19-6"></a>
+</span><span id="__span-19-7"><a id="__codelineno-19-7" name="__codelineno-19-7" href="#__codelineno-19-7"></a><span class="c1"># Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB</span>
+</span><span id="__span-19-8"><a id="__codelineno-19-8" name="__codelineno-19-8" href="#__codelineno-19-8"></a>sudo<span class="w"> </span>docker<span class="w"> </span>build<span class="w"> </span>-f<span class="w"> </span>Dockerfile.cpu<span class="w"> </span>-t<span class="w"> </span>vllm-cpu-env<span class="w"> </span>--shm-size<span class="o">=</span>4g<span class="w"> </span>.
 </span></code></pre></div>
 </li>
 </ol>
 </li>
 <li>
 <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-19-1"><a id="__codelineno-19-1" name="__codelineno-19-1" href="#__codelineno-19-1"></a>curl<span class="w"> </span>-s<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh<span class="w"> </span><span class="p">|</span><span class="w"> </span>sh<span class="w"> </span>-s<span class="w"> </span>--<span class="w"> </span>/tmp
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-20-1"><a id="__codelineno-20-1" name="__codelineno-20-1" href="#__codelineno-20-1"></a>curl<span class="w"> </span>-s<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh<span class="w"> </span><span class="p">|</span><span class="w"> </span>sh<span class="w"> </span>-s<span class="w"> </span>--<span class="w"> </span>/tmp
 </span></code></pre></div>
 </li>
 <li>
 <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-20-1"><a id="__codelineno-20-1" name="__codelineno-20-1" href="#__codelineno-20-1"></a><span class="nb">echo</span><span class="w"> </span>hf_yourtokenstring<span class="w"> </span>&gt;<span class="w"> </span>/tmp/fmbench-read/scripts/hf_token.txt
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-21-1"><a id="__codelineno-21-1" name="__codelineno-21-1" href="#__codelineno-21-1"></a><span class="nb">echo</span><span class="w"> </span>hf_yourtokenstring<span class="w"> </span>&gt;<span class="w"> </span>/tmp/fmbench-read/scripts/hf_token.txt
 </span></code></pre></div>
 </li>
 <li>
 <p>Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use <code>sudo</code> each time.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-21-1"><a id="__codelineno-21-1" name="__codelineno-21-1" href="#__codelineno-21-1"></a>sudo<span class="w"> </span>usermod<span class="w"> </span>-a<span class="w"> </span>-G<span class="w"> </span>docker<span class="w"> </span><span class="nv">$USER</span>
-</span><span id="__span-21-2"><a id="__codelineno-21-2" name="__codelineno-21-2" href="#__codelineno-21-2"></a>newgrp<span class="w"> </span>docker
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-22-1"><a id="__codelineno-22-1" name="__codelineno-22-1" href="#__codelineno-22-1"></a>sudo<span class="w"> </span>usermod<span class="w"> </span>-a<span class="w"> </span>-G<span class="w"> </span>docker<span class="w"> </span><span class="nv">$USER</span>
+</span><span id="__span-22-2"><a id="__codelineno-22-2" name="__codelineno-22-2" href="#__codelineno-22-2"></a>newgrp<span class="w"> </span>docker
 </span></code></pre></div>
 </li>
 <li>
 <p>Install <code>docker-compose</code>.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-22-1"><a id="__codelineno-22-1" name="__codelineno-22-1" href="#__codelineno-22-1"></a><span class="nv">DOCKER_CONFIG</span><span class="o">=</span><span class="si">${</span><span class="nv">DOCKER_CONFIG</span><span class="k">:-</span><span class="nv">$HOME</span><span class="p">/.docker</span><span class="si">}</span>
-</span><span id="__span-22-2"><a id="__codelineno-22-2" name="__codelineno-22-2" href="#__codelineno-22-2"></a>mkdir<span class="w"> </span>-p<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins
-</span><span id="__span-22-3"><a id="__codelineno-22-3" name="__codelineno-22-3" href="#__codelineno-22-3"></a>sudo<span class="w"> </span>curl<span class="w"> </span>-L<span class="w"> </span>https://github.com/docker/compose/releases/latest/download/docker-compose-<span class="k">$(</span>uname<span class="w"> </span>-s<span class="k">)</span>-<span class="k">$(</span>uname<span class="w"> </span>-m<span class="k">)</span><span class="w"> </span>-o<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
-</span><span id="__span-22-4"><a id="__codelineno-22-4" name="__codelineno-22-4" href="#__codelineno-22-4"></a>sudo<span class="w"> </span>chmod<span class="w"> </span>+x<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
-</span><span id="__span-22-5"><a id="__codelineno-22-5" name="__codelineno-22-5" href="#__codelineno-22-5"></a>docker<span class="w"> </span>compose<span class="w"> </span>version
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-23-1"><a id="__codelineno-23-1" name="__codelineno-23-1" href="#__codelineno-23-1"></a><span class="nv">DOCKER_CONFIG</span><span class="o">=</span><span class="si">${</span><span class="nv">DOCKER_CONFIG</span><span class="k">:-</span><span class="nv">$HOME</span><span class="p">/.docker</span><span class="si">}</span>
+</span><span id="__span-23-2"><a id="__codelineno-23-2" name="__codelineno-23-2" href="#__codelineno-23-2"></a>mkdir<span class="w"> </span>-p<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins
+</span><span id="__span-23-3"><a id="__codelineno-23-3" name="__codelineno-23-3" href="#__codelineno-23-3"></a>sudo<span class="w"> </span>curl<span class="w"> </span>-L<span class="w"> </span>https://github.com/docker/compose/releases/latest/download/docker-compose-<span class="k">$(</span>uname<span class="w"> </span>-s<span class="k">)</span>-<span class="k">$(</span>uname<span class="w"> </span>-m<span class="k">)</span><span class="w"> </span>-o<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
+</span><span id="__span-23-4"><a id="__codelineno-23-4" name="__codelineno-23-4" href="#__codelineno-23-4"></a>sudo<span class="w"> </span>chmod<span class="w"> </span>+x<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
+</span><span id="__span-23-5"><a id="__codelineno-23-5" name="__codelineno-23-5" href="#__codelineno-23-5"></a>docker<span class="w"> </span>compose<span class="w"> </span>version
 </span></code></pre></div>
 </li>
 <li>
 <p>Run <code>FMBench</code> with a packaged or a custom config file. <strong><em>This step will also deploy the model on the EC2 instance</em></strong>. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-23-1"><a id="__codelineno-23-1" name="__codelineno-23-1" href="#__codelineno-23-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-24-1"><a id="__codelineno-24-1" name="__codelineno-24-1" href="#__codelineno-24-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
 </span></code></pre></div>
 </li>
 <li>
 <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-24-1"><a id="__codelineno-24-1" name="__codelineno-24-1" href="#__codelineno-24-1"></a>tail<span class="w"> </span>-f<span class="w"> </span>fmbench.log
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-25-1"><a id="__codelineno-25-1" name="__codelineno-25-1" href="#__codelineno-25-1"></a>tail<span class="w"> </span>-f<span class="w"> </span>fmbench.log
 </span></code></pre></div>
 </li>
 <li>
@@ -1629,30 +1644,30 @@ <h2 id="benchmarking-on-an-cpu-instance-type-with-intel-processors">Benchmarking
 <ol>
 <li>
 <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda <a href="https://www.anaconda.com/download">here</a></p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-25-1"><a id="__codelineno-25-1" name="__codelineno-25-1" href="#__codelineno-25-1"></a><span class="c1"># Install Docker and Git using the YUM package manager</span>
-</span><span id="__span-25-2"><a id="__codelineno-25-2" name="__codelineno-25-2" href="#__codelineno-25-2"></a>sudo<span class="w"> </span>yum<span class="w"> </span>install<span class="w"> </span>docker<span class="w"> </span>git<span class="w"> </span>-y
-</span><span id="__span-25-3"><a id="__codelineno-25-3" name="__codelineno-25-3" href="#__codelineno-25-3"></a>
-</span><span id="__span-25-4"><a id="__codelineno-25-4" name="__codelineno-25-4" href="#__codelineno-25-4"></a><span class="c1"># Start the Docker service</span>
-</span><span id="__span-25-5"><a id="__codelineno-25-5" name="__codelineno-25-5" href="#__codelineno-25-5"></a>sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span class="w"> </span>docker
-</span><span id="__span-25-6"><a id="__codelineno-25-6" name="__codelineno-25-6" href="#__codelineno-25-6"></a>
-</span><span id="__span-25-7"><a id="__codelineno-25-7" name="__codelineno-25-7" href="#__codelineno-25-7"></a><span class="c1"># Download the Miniconda installer for Linux</span>
-</span><span id="__span-25-8"><a id="__codelineno-25-8" name="__codelineno-25-8" href="#__codelineno-25-8"></a>wget<span class="w"> </span>https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-</span><span id="__span-25-9"><a id="__codelineno-25-9" name="__codelineno-25-9" href="#__codelineno-25-9"></a>bash<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w"> </span>-b<span class="w"> </span><span class="c1"># Run the Miniconda installer in batch mode (no manual intervention)</span>
-</span><span id="__span-25-10"><a id="__codelineno-25-10" name="__codelineno-25-10" href="#__codelineno-25-10"></a>rm<span class="w"> </span>-f<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w">    </span><span class="c1"># Remove the installer script after installation</span>
-</span><span id="__span-25-11"><a id="__codelineno-25-11" name="__codelineno-25-11" href="#__codelineno-25-11"></a><span class="nb">eval</span><span class="w"> </span><span class="s2">&quot;</span><span class="k">$(</span>/home/<span class="nv">$USER</span>/miniconda3/bin/conda<span class="w"> </span>shell.bash<span class="w"> </span>hook<span class="k">)</span><span class="s2">&quot;</span><span class="w"> </span><span class="c1"># Initialize conda for bash shell</span>
-</span><span id="__span-25-12"><a id="__codelineno-25-12" name="__codelineno-25-12" href="#__codelineno-25-12"></a>conda<span class="w"> </span>init<span class="w">  </span><span class="c1"># Initialize conda, adding it to the shell</span>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-26-1"><a id="__codelineno-26-1" name="__codelineno-26-1" href="#__codelineno-26-1"></a><span class="c1"># Install Docker and Git using the YUM package manager</span>
+</span><span id="__span-26-2"><a id="__codelineno-26-2" name="__codelineno-26-2" href="#__codelineno-26-2"></a>sudo<span class="w"> </span>yum<span class="w"> </span>install<span class="w"> </span>docker<span class="w"> </span>git<span class="w"> </span>-y
+</span><span id="__span-26-3"><a id="__codelineno-26-3" name="__codelineno-26-3" href="#__codelineno-26-3"></a>
+</span><span id="__span-26-4"><a id="__codelineno-26-4" name="__codelineno-26-4" href="#__codelineno-26-4"></a><span class="c1"># Start the Docker service</span>
+</span><span id="__span-26-5"><a id="__codelineno-26-5" name="__codelineno-26-5" href="#__codelineno-26-5"></a>sudo<span class="w"> </span>systemctl<span class="w"> </span>start<span class="w"> </span>docker
+</span><span id="__span-26-6"><a id="__codelineno-26-6" name="__codelineno-26-6" href="#__codelineno-26-6"></a>
+</span><span id="__span-26-7"><a id="__codelineno-26-7" name="__codelineno-26-7" href="#__codelineno-26-7"></a><span class="c1"># Download the Miniconda installer for Linux</span>
+</span><span id="__span-26-8"><a id="__codelineno-26-8" name="__codelineno-26-8" href="#__codelineno-26-8"></a>wget<span class="w"> </span>https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+</span><span id="__span-26-9"><a id="__codelineno-26-9" name="__codelineno-26-9" href="#__codelineno-26-9"></a>bash<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w"> </span>-b<span class="w"> </span><span class="c1"># Run the Miniconda installer in batch mode (no manual intervention)</span>
+</span><span id="__span-26-10"><a id="__codelineno-26-10" name="__codelineno-26-10" href="#__codelineno-26-10"></a>rm<span class="w"> </span>-f<span class="w"> </span>Miniconda3-latest-Linux-x86_64.sh<span class="w">    </span><span class="c1"># Remove the installer script after installation</span>
+</span><span id="__span-26-11"><a id="__codelineno-26-11" name="__codelineno-26-11" href="#__codelineno-26-11"></a><span class="nb">eval</span><span class="w"> </span><span class="s2">&quot;</span><span class="k">$(</span>/home/<span class="nv">$USER</span>/miniconda3/bin/conda<span class="w"> </span>shell.bash<span class="w"> </span>hook<span class="k">)</span><span class="s2">&quot;</span><span class="w"> </span><span class="c1"># Initialize conda for bash shell</span>
+</span><span id="__span-26-12"><a id="__codelineno-26-12" name="__codelineno-26-12" href="#__codelineno-26-12"></a>conda<span class="w"> </span>init<span class="w">  </span><span class="c1"># Initialize conda, adding it to the shell</span>
 </span></code></pre></div>
 </li>
 <li>
 <p>Setup the <code>fmbench_python311</code> conda environment.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-26-1"><a id="__codelineno-26-1" name="__codelineno-26-1" href="#__codelineno-26-1"></a><span class="c1"># Create a new conda environment named &#39;fmbench_python311&#39; with Python 3.11 and ipykernel</span>
-</span><span id="__span-26-2"><a id="__codelineno-26-2" name="__codelineno-26-2" href="#__codelineno-26-2"></a>conda<span class="w"> </span>create<span class="w"> </span>--name<span class="w"> </span>fmbench_python311<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.11<span class="w"> </span>ipykernel
-</span><span id="__span-26-3"><a id="__codelineno-26-3" name="__codelineno-26-3" href="#__codelineno-26-3"></a>
-</span><span id="__span-26-4"><a id="__codelineno-26-4" name="__codelineno-26-4" href="#__codelineno-26-4"></a><span class="c1"># Activate the newly created conda environment</span>
-</span><span id="__span-26-5"><a id="__codelineno-26-5" name="__codelineno-26-5" href="#__codelineno-26-5"></a><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>fmbench_python311
-</span><span id="__span-26-6"><a id="__codelineno-26-6" name="__codelineno-26-6" href="#__codelineno-26-6"></a>
-</span><span id="__span-26-7"><a id="__codelineno-26-7" name="__codelineno-26-7" href="#__codelineno-26-7"></a><span class="c1"># Upgrade pip and install the fmbench package</span>
-</span><span id="__span-26-8"><a id="__codelineno-26-8" name="__codelineno-26-8" href="#__codelineno-26-8"></a>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span>fmbench
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-27-1"><a id="__codelineno-27-1" name="__codelineno-27-1" href="#__codelineno-27-1"></a><span class="c1"># Create a new conda environment named &#39;fmbench_python311&#39; with Python 3.11 and ipykernel</span>
+</span><span id="__span-27-2"><a id="__codelineno-27-2" name="__codelineno-27-2" href="#__codelineno-27-2"></a>conda<span class="w"> </span>create<span class="w"> </span>--name<span class="w"> </span>fmbench_python311<span class="w"> </span>-y<span class="w"> </span><span class="nv">python</span><span class="o">=</span><span class="m">3</span>.11<span class="w"> </span>ipykernel
+</span><span id="__span-27-3"><a id="__codelineno-27-3" name="__codelineno-27-3" href="#__codelineno-27-3"></a>
+</span><span id="__span-27-4"><a id="__codelineno-27-4" name="__codelineno-27-4" href="#__codelineno-27-4"></a><span class="c1"># Activate the newly created conda environment</span>
+</span><span id="__span-27-5"><a id="__codelineno-27-5" name="__codelineno-27-5" href="#__codelineno-27-5"></a><span class="nb">source</span><span class="w"> </span>activate<span class="w"> </span>fmbench_python311
+</span><span id="__span-27-6"><a id="__codelineno-27-6" name="__codelineno-27-6" href="#__codelineno-27-6"></a>
+</span><span id="__span-27-7"><a id="__codelineno-27-7" name="__codelineno-27-7" href="#__codelineno-27-7"></a><span class="c1"># Upgrade pip and install the fmbench package</span>
+</span><span id="__span-27-8"><a id="__codelineno-27-8" name="__codelineno-27-8" href="#__codelineno-27-8"></a>pip<span class="w"> </span>install<span class="w"> </span>-U<span class="w"> </span>fmbench
 </span></code></pre></div>
 </li>
 <li>
@@ -1663,51 +1678,51 @@ <h2 id="benchmarking-on-an-cpu-instance-type-with-intel-processors">Benchmarking
 </li>
 <li>
 <p>The container being build is for CPU only (GPU support might be added in future).</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-27-1"><a id="__codelineno-27-1" name="__codelineno-27-1" href="#__codelineno-27-1"></a><span class="c1"># Clone the vLLM project repository from GitHub</span>
-</span><span id="__span-27-2"><a id="__codelineno-27-2" name="__codelineno-27-2" href="#__codelineno-27-2"></a>git<span class="w"> </span>clone<span class="w"> </span>https://github.com/vllm-project/vllm.git
-</span><span id="__span-27-3"><a id="__codelineno-27-3" name="__codelineno-27-3" href="#__codelineno-27-3"></a>
-</span><span id="__span-27-4"><a id="__codelineno-27-4" name="__codelineno-27-4" href="#__codelineno-27-4"></a><span class="c1"># Change the directory to the cloned vLLM project</span>
-</span><span id="__span-27-5"><a id="__codelineno-27-5" name="__codelineno-27-5" href="#__codelineno-27-5"></a><span class="nb">cd</span><span class="w"> </span>vllm
-</span><span id="__span-27-6"><a id="__codelineno-27-6" name="__codelineno-27-6" href="#__codelineno-27-6"></a>
-</span><span id="__span-27-7"><a id="__codelineno-27-7" name="__codelineno-27-7" href="#__codelineno-27-7"></a><span class="c1"># Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB</span>
-</span><span id="__span-27-8"><a id="__codelineno-27-8" name="__codelineno-27-8" href="#__codelineno-27-8"></a>sudo<span class="w"> </span>docker<span class="w"> </span>build<span class="w"> </span>-f<span class="w"> </span>Dockerfile.cpu<span class="w"> </span>-t<span class="w"> </span>vllm-cpu-env<span class="w"> </span>--shm-size<span class="o">=</span>12g<span class="w"> </span>.
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-28-1"><a id="__codelineno-28-1" name="__codelineno-28-1" href="#__codelineno-28-1"></a><span class="c1"># Clone the vLLM project repository from GitHub</span>
+</span><span id="__span-28-2"><a id="__codelineno-28-2" name="__codelineno-28-2" href="#__codelineno-28-2"></a>git<span class="w"> </span>clone<span class="w"> </span>https://github.com/vllm-project/vllm.git
+</span><span id="__span-28-3"><a id="__codelineno-28-3" name="__codelineno-28-3" href="#__codelineno-28-3"></a>
+</span><span id="__span-28-4"><a id="__codelineno-28-4" name="__codelineno-28-4" href="#__codelineno-28-4"></a><span class="c1"># Change the directory to the cloned vLLM project</span>
+</span><span id="__span-28-5"><a id="__codelineno-28-5" name="__codelineno-28-5" href="#__codelineno-28-5"></a><span class="nb">cd</span><span class="w"> </span>vllm
+</span><span id="__span-28-6"><a id="__codelineno-28-6" name="__codelineno-28-6" href="#__codelineno-28-6"></a>
+</span><span id="__span-28-7"><a id="__codelineno-28-7" name="__codelineno-28-7" href="#__codelineno-28-7"></a><span class="c1"># Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB</span>
+</span><span id="__span-28-8"><a id="__codelineno-28-8" name="__codelineno-28-8" href="#__codelineno-28-8"></a>sudo<span class="w"> </span>docker<span class="w"> </span>build<span class="w"> </span>-f<span class="w"> </span>Dockerfile.cpu<span class="w"> </span>-t<span class="w"> </span>vllm-cpu-env<span class="w"> </span>--shm-size<span class="o">=</span>12g<span class="w"> </span>.
 </span></code></pre></div>
 </li>
 </ol>
 </li>
 <li>
 <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-28-1"><a id="__codelineno-28-1" name="__codelineno-28-1" href="#__codelineno-28-1"></a>curl<span class="w"> </span>-s<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh<span class="w"> </span><span class="p">|</span><span class="w"> </span>sh<span class="w"> </span>-s<span class="w"> </span>--<span class="w"> </span>/tmp
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-29-1"><a id="__codelineno-29-1" name="__codelineno-29-1" href="#__codelineno-29-1"></a>curl<span class="w"> </span>-s<span class="w"> </span>https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh<span class="w"> </span><span class="p">|</span><span class="w"> </span>sh<span class="w"> </span>-s<span class="w"> </span>--<span class="w"> </span>/tmp
 </span></code></pre></div>
 </li>
 <li>
 <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-29-1"><a id="__codelineno-29-1" name="__codelineno-29-1" href="#__codelineno-29-1"></a><span class="nb">echo</span><span class="w"> </span>hf_yourtokenstring<span class="w"> </span>&gt;<span class="w"> </span>/tmp/fmbench-read/scripts/hf_token.txt
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-30-1"><a id="__codelineno-30-1" name="__codelineno-30-1" href="#__codelineno-30-1"></a><span class="nb">echo</span><span class="w"> </span>hf_yourtokenstring<span class="w"> </span>&gt;<span class="w"> </span>/tmp/fmbench-read/scripts/hf_token.txt
 </span></code></pre></div>
 </li>
 <li>
 <p>Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use <code>sudo</code> each time.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-30-1"><a id="__codelineno-30-1" name="__codelineno-30-1" href="#__codelineno-30-1"></a>sudo<span class="w"> </span>usermod<span class="w"> </span>-a<span class="w"> </span>-G<span class="w"> </span>docker<span class="w"> </span><span class="nv">$USER</span>
-</span><span id="__span-30-2"><a id="__codelineno-30-2" name="__codelineno-30-2" href="#__codelineno-30-2"></a>newgrp<span class="w"> </span>docker
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-31-1"><a id="__codelineno-31-1" name="__codelineno-31-1" href="#__codelineno-31-1"></a>sudo<span class="w"> </span>usermod<span class="w"> </span>-a<span class="w"> </span>-G<span class="w"> </span>docker<span class="w"> </span><span class="nv">$USER</span>
+</span><span id="__span-31-2"><a id="__codelineno-31-2" name="__codelineno-31-2" href="#__codelineno-31-2"></a>newgrp<span class="w"> </span>docker
 </span></code></pre></div>
 </li>
 <li>
 <p>Install <code>docker-compose</code>.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-31-1"><a id="__codelineno-31-1" name="__codelineno-31-1" href="#__codelineno-31-1"></a><span class="nv">DOCKER_CONFIG</span><span class="o">=</span><span class="si">${</span><span class="nv">DOCKER_CONFIG</span><span class="k">:-</span><span class="nv">$HOME</span><span class="p">/.docker</span><span class="si">}</span>
-</span><span id="__span-31-2"><a id="__codelineno-31-2" name="__codelineno-31-2" href="#__codelineno-31-2"></a>mkdir<span class="w"> </span>-p<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins
-</span><span id="__span-31-3"><a id="__codelineno-31-3" name="__codelineno-31-3" href="#__codelineno-31-3"></a>sudo<span class="w"> </span>curl<span class="w"> </span>-L<span class="w"> </span>https://github.com/docker/compose/releases/latest/download/docker-compose-<span class="k">$(</span>uname<span class="w"> </span>-s<span class="k">)</span>-<span class="k">$(</span>uname<span class="w"> </span>-m<span class="k">)</span><span class="w"> </span>-o<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
-</span><span id="__span-31-4"><a id="__codelineno-31-4" name="__codelineno-31-4" href="#__codelineno-31-4"></a>sudo<span class="w"> </span>chmod<span class="w"> </span>+x<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
-</span><span id="__span-31-5"><a id="__codelineno-31-5" name="__codelineno-31-5" href="#__codelineno-31-5"></a>docker<span class="w"> </span>compose<span class="w"> </span>version
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-32-1"><a id="__codelineno-32-1" name="__codelineno-32-1" href="#__codelineno-32-1"></a><span class="nv">DOCKER_CONFIG</span><span class="o">=</span><span class="si">${</span><span class="nv">DOCKER_CONFIG</span><span class="k">:-</span><span class="nv">$HOME</span><span class="p">/.docker</span><span class="si">}</span>
+</span><span id="__span-32-2"><a id="__codelineno-32-2" name="__codelineno-32-2" href="#__codelineno-32-2"></a>mkdir<span class="w"> </span>-p<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins
+</span><span id="__span-32-3"><a id="__codelineno-32-3" name="__codelineno-32-3" href="#__codelineno-32-3"></a>sudo<span class="w"> </span>curl<span class="w"> </span>-L<span class="w"> </span>https://github.com/docker/compose/releases/latest/download/docker-compose-<span class="k">$(</span>uname<span class="w"> </span>-s<span class="k">)</span>-<span class="k">$(</span>uname<span class="w"> </span>-m<span class="k">)</span><span class="w"> </span>-o<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
+</span><span id="__span-32-4"><a id="__codelineno-32-4" name="__codelineno-32-4" href="#__codelineno-32-4"></a>sudo<span class="w"> </span>chmod<span class="w"> </span>+x<span class="w"> </span><span class="nv">$DOCKER_CONFIG</span>/cli-plugins/docker-compose
+</span><span id="__span-32-5"><a id="__codelineno-32-5" name="__codelineno-32-5" href="#__codelineno-32-5"></a>docker<span class="w"> </span>compose<span class="w"> </span>version
 </span></code></pre></div>
 </li>
 <li>
 <p>Run <code>FMBench</code> with a packaged or a custom config file. <strong><em>This step will also deploy the model on the EC2 instance</em></strong>. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-32-1"><a id="__codelineno-32-1" name="__codelineno-32-1" href="#__codelineno-32-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-33-1"><a id="__codelineno-33-1" name="__codelineno-33-1" href="#__codelineno-33-1"></a>fmbench<span class="w"> </span>--config-file<span class="w"> </span>/tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml<span class="w"> </span>--local-mode<span class="w"> </span>yes<span class="w"> </span>--write-bucket<span class="w"> </span>placeholder<span class="w"> </span>--tmp-dir<span class="w"> </span>/tmp<span class="w"> </span>&gt;<span class="w"> </span>fmbench.log<span class="w"> </span><span class="m">2</span>&gt;<span class="p">&amp;</span><span class="m">1</span>
 </span></code></pre></div>
 </li>
 <li>
 <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p>
-<div class="language-bash highlight"><pre><span></span><code><span id="__span-33-1"><a id="__codelineno-33-1" name="__codelineno-33-1" href="#__codelineno-33-1"></a>tail<span class="w"> </span>-f<span class="w"> </span>fmbench.log
+<div class="language-bash highlight"><pre><span></span><code><span id="__span-34-1"><a id="__codelineno-34-1" name="__codelineno-34-1" href="#__codelineno-34-1"></a>tail<span class="w"> </span>-f<span class="w"> </span>fmbench.log
 </span></code></pre></div>
 </li>
 <li>
diff --git a/search/search_index.json b/search/search_index.json
index 03df44a7..b6a92637 100755
--- a/search/search_index.json
+++ b/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Benchmark foundation models on AWS","text":"<p><code>FMBench</code> is a Python package for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. The FMs could be deployed on these platforms either directly through <code>FMbench</code>, or, if they are already deployed then also they could be benchmarked through the Bring your own endpoint mode supported by <code>FMBench</code>. </p> <p>Here are some salient features of <code>FMBench</code>:</p> <ol> <li> <p>Highly flexible: in that it allows for using any combinations of instance types (<code>g5</code>, <code>p4d</code>, <code>p5</code>, <code>Inf2</code>), inference containers (<code>DeepSpeed</code>, <code>TensorRT</code>, <code>HuggingFace TGI</code> and others) and parameters such as tensor parallelism, rolling batch etc. as long as those are supported by the underlying platform. </p> </li> <li> <p>Benchmark any model: it can be used to be benchmark open-source models, third party models, and proprietary models trained by enterprises on their own data.</p> </li> <li> <p>Run anywhere: it can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.</p> </li> </ol>"},{"location":"index.html#the-need-for-benchmarking","title":"The need for benchmarking","text":"<p>Customers often wonder what is the best AWS service to run FMs for my specific use-case and my specific price performance requirements. While model evaluation metrics are available on several leaderboards (<code>HELM</code>, <code>LMSys</code>), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as (<code>LongBench</code>, <code>QMSum</code>). This is the problem that <code>FMBench</code> solves.</p>"},{"location":"index.html#fmbench-an-open-source-python-package-for-fm-benchmarking-on-aws","title":"<code>FMBench</code>: an open-source Python package for FM benchmarking on AWS","text":"<p><code>FMBench</code> runs inference requests against endpoints that are either deployed through <code>FMBench</code> itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given FM for a given use-case.</p> <p>The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the <code>Llama2-13b</code> model on different instance types available on SageMaker using prompts for Q&amp;A task created from the <code>LongBench</code> dataset, these prompts are between 3000 to 3840 tokens in length. Note that the numbers are hidden in this figure but you would be able to see them when you run <code>FMBench</code> yourself.</p> <p></p> <p>The following table (also included in the report) provides information about the best available instance type for that experiment<sup>1</sup>.</p> Information Value experiment_name llama2-13b-inf2.24xlarge payload_file payload_en_3000-3840.jsonl instance_type ml.inf2.24xlarge concurrency ** error_rate ** prompt_token_count_mean 3394 prompt_token_throughput 2400 completion_token_count_mean 31 completion_token_throughput 15 latency_mean ** latency_p50 ** latency_p95 ** latency_p99 ** transactions_per_minute ** price_per_txn ** <p><sup>1</sup> ** values hidden on purpose, these are available when you run the tool yourself.</p> <p>The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types).</p> <p></p>"},{"location":"index.html#determine-the-optimal-model-for-your-generative-ai-workload","title":"Determine the optimal model for your generative AI workload","text":"<p>Use <code>FMBench</code> to determine model accuracy using a panel of LLM evaluators (PoLL [1]). Here is one of the plots generated by <code>FMBench</code> to help answer the accuracy question for various FMs on Amazon Bedrock (the model ids in the charts have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).</p> <p></p> <p></p>"},{"location":"index.html#references","title":"References","text":"<p>[1] Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\",    arXiv:2404.18796, 2024.</p>"},{"location":"accuracy.html","title":"Model evaluations using panel of LLM evaluators","text":"<p><code>FMBench</code> release 2.0.0 adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators (PoLL). It gathers quantitative metrics such as Cosine Similarity and overall majority voting accuracy metrics to measure the similarity and accuracy of model responses compared to the ground truth. </p> <p>Accuracy is defined as percentage of responses generated by the LLM that match the ground truth included in the dataset (as a separate column). In order to determine if an LLM generated response matches the ground truth we ask other LLMs called the evaluator LLMs to compare the LLM output and the ground truth and provide a verdict if the LLM generated ground truth is correct or not given the ground truth. Here is the link to the Anthropic Claude 3 Sonnet model prompt being used as an evaluator (or a judge model). A combination of the cosine similarity and the LLM evaluator verdict decides if the LLM generated response is correct or incorrect. Finally, one LLM evaluator could be biased, could have inaccuracies so instead of relying on the judgement of a single evaluator, we rely on the majority vote of 3 different LLM evaluators. By default we use the Anthropic Claude 3 Sonnet, Meta Llama3-70b and the Cohere Command R plus model as LLM evaluators. See  Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\",    arXiv:2404.18796, 2024. for more details on using a Panel of LLM Evaluators (PoLL).</p>"},{"location":"accuracy.html#evaluation-flow","title":"Evaluation Flow","text":"<ol> <li> <p>Provide a dataset that includes ground truth responses for each sample. <code>FMBench</code> uses the LongBench dataset by default. </p> </li> <li> <p>Configure the candidate models to be evaluated in the <code>FMBench</code> config file. See this config file for an example that runs evaluations for multiple models available via Amazon Bedrock. Running evaluations only requires the following two changes to the config file:</p> <ul> <li>Set the <code>4_get_evaluations.ipynb: yes</code>, see this line.</li> <li>Set the <code>ground_truth_col_key: answers</code> and <code>question_col_key: input</code> parameters, see this line. The value of <code>ground_truth_col_key</code> and the <code>question_col_key</code> is set to the name of the column in the dataset that contains the ground truth and question respectively.</li> </ul> </li> <li> <p>Run FMBench, which will: </p> </li> <li> <p>Fetch the inference results containing the model responses </p> </li> <li> <p>Calculate quantitative metrics (Cosine Similarity) </p> </li> <li> <p>Use a Panel of LLM Evaluators to compare each model response to the ground truth </p> </li> <li> <p>Each LLM evaluator will provide a binary verdict (correct/incorrect) and an explanation </p> </li> <li> <p>Validate the LLM evaluations using Cosine Similarity thresholds </p> </li> <li> <p>Categorize the final evaluation for each response as correctly correct, correctly incorrect, or needs further evaluation </p> </li> <li> <p>Review the <code>FMBench</code> report to analyze the evaluation results and compare the performance of the candidate models. The report contains tables and charts that provide insights into model accuracy.</p> </li> </ol> <p>By leveraging ground truth data and a Panel of LLM Evaluators, FMBench provides a comprehensive and efficient way to assess the quality of generative AI models. The majority voting approach, combined with quantitative metrics, enables a robust evaluation that reduces bias and latency while maintaining consistency across responses.</p>"},{"location":"advanced.html","title":"Advanced","text":"<p>Beyond running <code>FMBench</code> with the configuration files provided, you may want try out bringing your own dataset or endpoint to <code>FMBench</code>.</p>"},{"location":"analytics.html","title":"Generate downstream summarized reports for further analysis","text":"<p>You can use several results from various <code>FMBench</code> runs to generate a summarized report of all runs based on your cost, latency, and concurrency budgets. This report helps answer the following question:</p> <p>What is the minimum number of instances N, of most cost optimal instance type T, that are needed to serve a real-time workload W while keeping the average transaction latency under L seconds?\u201d</p> <pre><code>W: = {R transactions per-minute, average prompt token length P, average generation token length G}\n</code></pre> <ul> <li>With this summarized report, we test the following hypothesis: At the low end of the total number of requests/minute smaller instances which provide good inference latency at low concurrencies would suffice (said another way, the larger more expensive instances are an overkill at this stage) but as the number of requests/minute increase there comes an inflection point beyond which the number of smaller instances required would be so much that it would be more economical to use fewer instances of the larger more expensive instances.</li> </ul>"},{"location":"analytics.html#an-example-report-that-gets-generated-is-as-follows","title":"An example report that gets generated is as follows:","text":""},{"location":"analytics.html#summary-for-payload-payload_en_x-y","title":"Summary for payload: payload_en_x-y","text":"<ul> <li> <p>The metrics below in the table are examples and do not represent any specific model or instance type. This table can be used to make analysis on the cost and instance maintenance perspective based on the use case. For example, <code>instance_type_1</code> costs 10 dollars and requires 1 instance to host <code>model_1</code> until it can handle 100 requests per minute. As the requests scale to a 1,000 requests per minute, 5 instances are required and cost 50 dollars. As the requests scale to 10,000 requests per minute, the number of instances to maintain scale to 30, and the cost becomes 450 dollars. </p> </li> <li> <p>On the other hand, <code>instance_type_2</code> is more costly, with a price of $499 for 10,000 requests per minute to host the same model, but only requires 22 instances to maintain, which is 8 less than when the model is hosted on <code>instance_type_1</code>. </p> </li> <li> <p>Based on these summaries, users can make decisions based on their use case priorities. For a real time and latency sensitive application, a user might select <code>instance_type_2</code> to host <code>model_1</code> since the user would have to maintain 8 lesser instances than hosting the model on <code>instance_type_1</code>. Hosting the model on <code>instance_type_2</code> would also maintain the <code>p_95 latency</code> (0.5s), which is half compared to <code>instance_type_1</code> (<code>p_95 latency</code>: 1s) even though it costs more than <code>instance_type_1</code>. On the other hand, if the application is cost sensitive, and the user is flexible to maintain more instances at a higher latency, they might want to shift gears to using <code>instance_type_1</code>.</p> </li> <li> <p>Note: Based on varying needs for prompt size, cost, and latency, the table might change.</p> </li> </ul> experiment_name instance_type concurrency latency_p95 transactions_per_minute instance_count_and_cost_1_rpm instance_count_and_cost_10_rpm instance_count_and_cost_100_rpm instance_count_and_cost_1000_rpm instance_count_and_cost_10000_rpm model_1 instance_type_1 1 1.0 _ (1, 10) (1, 10) (1, 10) (5, 50) (30, 450) model_1 instance_type_2 1 0.5 _ (1, 10) (1, 20) (1, 20) (6, 47) (22, 499)"},{"location":"analytics.html#fmbench-heatmap","title":"FMBench Heatmap","text":"<p>This step also generates a heatmap that contains information about each instance, and how much it costs with per <code>request-per-minute</code> (<code>rpm</code>) breakdown. The default breakdown is [1 <code>rpm</code>, 10 <code>rpm</code>, 100 <code>rpm</code>, 1000 <code>rpm</code>, 10000 <code>rpm</code>]. View an example of a heatmap below. The model name, instance type, is masked but can be generated for your specific use case/requirements.</p> <p></p>"},{"location":"analytics.html#steps-to-run-analytics","title":"Steps to run analytics","text":"<ol> <li> <p>Clone the <code>FMBench</code> repo from GitHub.</p> </li> <li> <p>Place all of the <code>result-{model-id}-...</code> folders that are generated from various runs in the top level directory.</p> </li> <li> <p>Run the following command to generate downstream analytics and summarized tables. Replace <code>x</code>, <code>y</code>, <code>z</code> and <code>model_id</code> with the latency, concurrency thresholds, payload file of interest (for example <code>payload_en_1000-2000.jsonl</code>) and the <code>model_id</code> respectively. The <code>model_id</code> would have to be appended to the <code>results-{model-id}</code> folders so the analytics.py file can generate a report for all of those respective result folders. </p> <pre><code>python analytics/analytics.py --latency-threshold x --concurrency-threshold y  --payload-file z --model-id model_id\n</code></pre> </li> </ol>"},{"location":"announcement.html","title":"Release 2.0 announcement","text":"<p>We are excited to share news about a major FMBench release, we now have release 2.0 for FMBench that supports model evaluations through a panel of LLM evaluators\ud83c\udf89. With the recent feature additions to FMBench we are already seeing increased interest from customers and hope to reach even more customers and have an even greater impact. Check out all the latest and greatest features from FMBench on the FMBench website.</p> <p>Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.</p> <p>Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).</p> <p>Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.</p> <p>Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.</p>"},{"location":"benchmarking.html","title":"Benchmark models deployed on different AWS Generative AI services","text":"<p><code>FMBench</code> comes packaged with configuration files for benchmarking models on different AWS Generative AI services. </p>"},{"location":"benchmarking.html#full-list-of-benchmarked-models","title":"Full list of benchmarked models","text":"Model EC2 g5 EC2 p4 EC2 p5 EC2 Inf2/Trn1 SageMaker g4dn/g5/p3 SageMaker Inf2/Trn1 SageMaker P4 SageMaker P5 Bedrock On-demand throughput Bedrock provisioned throughput Anthropic Claude-3 Sonnet \u2705 \u2705 Anthropic Claude-3 Haiku \u2705 Mistral-7b-instruct \u2705 \u2705 \u2705 \u2705 \u2705 Mistral-7b-AWQ \u2705 Mixtral-8x7b-instruct \u2705 Llama3.1-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3.1-70b instruct \u2705 \u2705 \u2705 Llama3-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3-70b instruct \u2705 \u2705 \u2705 \u2705 \u2705 Llama2-13b chat \u2705 \u2705 \u2705 \u2705 Llama2-70b chat \u2705 \u2705 \u2705 \u2705 Amazon Titan text lite \u2705 Amazon Titan text express \u2705 Cohere Command text \u2705 Cohere Command light text \u2705 AI21 J2 Mid \u2705 AI21 J2 Ultra \u2705 Gemma-2b \u2705 Phi-3-mini-4k-instruct \u2705 distilbert-base-uncased \u2705"},{"location":"benchmarking_on_bedrock.html","title":"Benchmark models on Bedrock","text":"<p>Choose any config file from the <code>bedrock</code> folder and either run these directly or use them as templates for creating new config files specific to your use-case. Here is an example for benchmarking the <code>Llama3</code> models on Bedrock.</p> <pre><code>fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/bedrock/config-bedrock-llama3.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre>"},{"location":"benchmarking_on_ec2.html","title":"Benchmark models on EC2","text":"<p>You can use <code>FMBench</code> to benchmark models on hosted on EC2. This can be done in one of two ways:</p> <ul> <li>Deploy the model on your EC2 instance independantly of <code>FMBench</code> and then benchmark it through the Bring your own endpoint mode.</li> <li>Deploy the model on your EC2 instance through <code>FMBench</code> and then benchmark it.</li> </ul> <p>The steps for deploying the model on your EC2 instance are described below. </p> <p>\ud83d\udc49 In this configuration both the model being benchmarked and <code>FMBench</code> are deployed on the same EC2 instance.</p> <p>Create a new EC2 instance suitable for hosting an LMI as per the steps described here. Note that you will need to select the correct AMI based on your instance type, this is called out in the instructions.</p> <p>The steps for benchmarking on different types of EC2 instances (GPU/CPU/Neuron) and different inference containers differ slightly. These are all described below.</p>"},{"location":"benchmarking_on_ec2.html#benchmarking-options-on-ec2","title":"Benchmarking options on EC2","text":"<ul> <li>Benchmarking on an instance type with NVIDIA GPUs or AWS Chips</li> <li>Benchmarking on an instance type with NVIDIA GPU and the Triton inference server</li> <li>Benchmarking on an instance type with AWS Chips and the Triton inference server</li> <li>Benchmarking on an CPU instance type with AMD processors</li> <li> <p>Benchmarking on an CPU instance type with Intel processors</p> </li> <li> <p>Benchmarking the Triton inference server</p> </li> </ul>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips","title":"Benchmarking on an instance type with NVIDIA GPUs or AWS Chips","text":"<ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>.</p> <pre><code>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\ndocker compose version \n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n</code></pre> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. Skip to the next step if benchmarking for AWS Chips. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>For example, to run <code>FMBench</code> on a <code>llama3-8b-Instruct</code> model on an <code>inf2.48xlarge</code> instance, run the command  command below. The config file for this example can be viewed here.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpu-and-the-triton-inference-server","title":"Benchmarking on an instance type with NVIDIA GPU and the Triton inference server","text":"<ol> <li> <p>No special procedure needed, just follow steps in the Benchmarking on an instance type with NVIDIA GPUs or AWS Chips section and then run <code>FMBench</code> with a config file for Triton. For example for benchmarking <code>Llama3-8b</code> model on a <code>g5.12xlarge</code> use the following command (after completing the steps for setting up <code>FMBench</code>).</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference-server","title":"Benchmarking on an instance type with AWS Chips and the Triton inference server","text":"<p>As of 2024-09-26 this has been tested on a <code>trn1.32xlarge</code> instance</p> <ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda here. (Note: Configure the storage of your EC2 instance to 500GB for this test)</p> <pre><code># Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code># Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n</code></pre> </li> <li> <p>First we need to build the required docker image for <code>triton</code>, and push it locally. To do this, curl the <code>Triton Dockerfile</code> and the script to build and push the triton image locally:</p> <p><pre><code>    # curl the docker file for triton\n    curl -o ./Dockerfile_triton https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton\n\n    # curl the script that builds and pushes the triton image locally\n    curl -o build_and_push_triton.sh https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh\n\n    # Make the triton build and push script executable, and run it\n    chmod +x build_and_push_triton.sh\n    ./build_and_push_triton.sh\n</code></pre>    - Now wait until the docker image is saved locally and then follow the instructions below to start a benchmarking test.</p> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-amd-processors","title":"Benchmarking on an CPU instance type with AMD processors","text":"<p>As of 2024-08-27 this has been tested on a <code>m7a.16xlarge</code> instance</p> <ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda here</p> <pre><code># Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code># Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n</code></pre> </li> <li> <p>Build the <code>vllm</code> container for serving the model. </p> <ol> <li> <p>\ud83d\udc49 The <code>vllm</code> container we are building locally is going to be references in the <code>FMBench</code> config file.</p> </li> <li> <p>The container being build is for CPU only (GPU support might be added in future).</p> <pre><code># Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .\n</code></pre> </li> </ol> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use <code>sudo</code> each time.</p> <pre><code>sudo usermod -a -G docker $USER\nnewgrp docker\n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-intel-processors","title":"Benchmarking on an CPU instance type with Intel processors","text":"<p>As of 2024-08-27 this has been tested on <code>c5.18xlarge</code> and <code>m5.16xlarge</code> instances</p> <ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda here</p> <pre><code># Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code># Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n</code></pre> </li> <li> <p>Build the <code>vllm</code> container for serving the model. </p> <ol> <li> <p>\ud83d\udc49 The <code>vllm</code> container we are building locally is going to be references in the <code>FMBench</code> config file.</p> </li> <li> <p>The container being build is for CPU only (GPU support might be added in future).</p> <pre><code># Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=12g .\n</code></pre> </li> </ol> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use <code>sudo</code> each time.</p> <pre><code>sudo usermod -a -G docker $USER\nnewgrp docker\n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_eks.html","title":"Benchmark models on EKS","text":"<p>You can use <code>FMBench</code> to benchmark models on hosted on EKS. This can be done in one of two ways:</p> <ul> <li>Deploy the model on your EKS cluster independantly of <code>FMBench</code> and then benchmark it through the Bring your own endpoint mode.</li> <li>Deploy the model on your EKS cluster through <code>FMBench</code> and then benchmark it.</li> </ul> <p>The steps for deploying the model on your EKS cluster are described below.</p> <p>\ud83d\udc49 EKS cluster creation itself is not a part of the <code>FMBench</code> functionality, the cluster needs to exist before you run the following steps. Steps for cluster creation are provided in this file but it would be best to consult the DoEKS repo on GitHub for comprehensive instructions.</p> <ol> <li> <p>Add the following IAM policies to your existing <code>FMBench</code> Role:</p> <ol> <li> <p>AmazonEKSClusterPolicy: This policy provides Kubernetes the permissions it requires to manage resources on your behalf.</p> </li> <li> <p>AmazonEKS_CNI_Policy: This policy provides the Amazon VPC CNI Plugin (amazon-vpc-cni-k8s) the permissions it requires to modify the IP address configuration on your EKS worker nodes. This permission set allows the CNI to list, describe, and modify Elastic Network Interfaces on your behalf.</p> </li> <li> <p>AmazonEKSWorkerNodePolicy: This policy allows Amazon EKS worker nodes to connect to Amazon EKS Clusters.</p> </li> </ol> </li> <li> <p>Once the EKS cluster is available you can use either the following two files or create your own config files using these files as examples for running benchmarking for these models. These config files require that the EKS cluster has been created as per the steps in these instructions.</p> <ol> <li> <p>config-llama3-8b-eks-inf2.yml: Deploy Llama3 on Trn1/Inf2 instances.</p> </li> <li> <p>config-mistral-7b-eks-inf2.yml: Deploy Mistral 7b on Trn1/Inf2 instances.</p> </li> </ol> <p>For more information about the blueprints used by FMBench to deploy these models, view: DoEKS docs gen-ai.</p> </li> <li> <p>Run the <code>Llama3-8b</code> benchmarking using the command below (replace the config file as needed for a different model). This will first deploy the model on your EKS cluster and then run benchmarking on the deployed model.</p> <pre><code>fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-eks-inf2.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>As the model is getting deployed you might want to run the following <code>kubectl</code> commands to monitor the deployment progress. Set the model_namespace to <code>llama3</code> or <code>mistral</code> or a different model as appropriate.</p> <ol> <li><code>kubectl get pods -n &lt;model_namespace&gt; -w</code>: Watch the pods in the model specific namespace.</li> <li><code>kubectl -n karpenter get pods</code>: Get the pods in the karpenter namespace.</li> <li><code>kubectl describe pod -n &lt;model_namespace&gt; &lt;pod-name&gt;</code>: Describe a specific pod in the mistral namespace to view the live logs.</li> </ol> </li> </ol>"},{"location":"benchmarking_on_sagemaker.html","title":"Benchmark models on SageMaker","text":"<p>Choose any config file from the model specific folders, for example the <code>Llama3</code> folder for <code>Llama3</code> family of models. These configuration files also include instructions for <code>FMBench</code> to first deploy the model on SageMaker using your configured instance type and inference parameters of choice and then run the benchmarking. Here is an example for benchmarking <code>Llama3-8b</code> model on an <code>ml.inf2.24xlarge</code> and <code>ml.g5.12xlarge</code> instance. </p> <pre><code>fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-inf2-g5.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre>"},{"location":"build.html","title":"Building the <code>FMBench</code> Python package","text":"<p>If you would like to build a dev version of <code>FMBench</code> for your own development and testing purposes, the following steps describe how to do that.</p> <ol> <li> <p>Clone the <code>FMBench</code> repo from GitHub.</p> </li> <li> <p>Make any code changes as needed.</p> </li> <li> <p>Install <code>poetry</code>.</p> <pre><code>pip install poetry mkdocs-material mknotebooks\n</code></pre> </li> <li> <p>Change directory to the <code>FMBench</code> repo directory and run poetry build.</p> <pre><code>poetry build\n</code></pre> </li> <li> <p>The <code>.whl</code> file is generated in the <code>dist</code> folder. Install the <code>.whl</code> in your current Python environment.</p> <pre><code>pip install dist/fmbench-X.Y.Z-py3-none-any.whl\n</code></pre> </li> <li> <p>Run <code>FMBench</code> as usual through the <code>FMBench</code> CLI command.</p> </li> <li> <p>You may have added new config files as part of your work, to make sure these files are called out in the <code>manifest.txt</code> run the following command. This command will overwrite the existing <code>manifest.txt</code> and <code>manifest.md</code> files. Both these files need to be committed to the repo. Reach out to the maintainers of this repo so that they can add new or modified config files to the blogs bucket (the CloudFormation stack would fail if a new file is added to the manifest but is not available for download through the S3 bucket).</p> <pre><code>python create_manifest.py\n</code></pre> </li> <li> <p>To create updated documentation run the following command. You need to be added as a contributor to the <code>FMBench</code> repo to be able to publish to the website, so this command would not work for you if you are not added as a contributor to the repo.</p> <pre><code>mkdocs gh-deploy\n</code></pre> </li> </ol>"},{"location":"byo_dataset.html","title":"Bring your own dataset","text":"<p>By default <code>FMBench</code> uses the <code>LongBench dataset</code> dataset for testing the models, but this is not the only dataset you can test with. You may want to test with other datasets available on HuggingFace or use your own datasets for testing. You can do this by converting your dataset to the <code>JSON lines</code> format. We provide a code sample for converting any HuggingFace dataset into JSON lines format and uploading it to the S3 bucket used by <code>FMBench</code> in the <code>bring_your_own_dataset</code> notebook. Follow the steps described in the notebook to bring your own dataset for testing with <code>FMBench</code>.</p>"},{"location":"byo_dataset.html#support-for-open-orca-dataset","title":"Support for Open-Orca dataset","text":"<p>Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral, see:</p> <ol> <li>bring_your_own_dataset.ipynb</li> <li>prompt templates</li> <li>Llama3 config file with OpenOrca</li> </ol>"},{"location":"byo_rest_predictor.html","title":"Bring your own <code>REST Predictor</code> (<code>data-on-eks</code> version)","text":"<p><code>FMBench</code> now provides an example of bringing your own endpoint as a <code>REST Predictor</code> for benchmarking. View this <code>script</code> as an example. This script is an inference file for the <code>NousResearch/Llama-2-13b-chat-hf</code> model deployed on an Amazon EKS cluster using Ray Serve. The model is deployed via <code>data-on-eks</code> which is a comprehensive resource for scaling your data and machine learning workloads on Amazon EKS and unlocking the power of Gen AI. Using <code>data-on-eks</code>, you can harness the capabilities of AWS Trainium, AWS Inferentia and NVIDIA GPUs to scale and optimize your Gen AI workloads and benchmark those models on FMBench with ease. </p>"},{"location":"byoe.html","title":"Bring your own endpoint (a.k.a. support for external endpoints)","text":"<p>If you have an endpoint deployed on say <code>Amazon EKS</code> or <code>Amazon EC2</code> or have your models hosted on a fully-managed service such as <code>Amazon Bedrock</code>, you can still bring your endpoint to <code>FMBench</code> and run tests against your endpoint. To do this you need to do the following:</p> <ol> <li> <p>Create a derived class from <code>FMBenchPredictor</code> abstract class and provide implementation for the constructor, the <code>get_predictions</code> method and the <code>endpoint_name</code> property. See <code>SageMakerPredictor</code> for an example. Save this file locally as say <code>my_custom_predictor.py</code>.</p> </li> <li> <p>Upload your new Python file (<code>my_custom_predictor.py</code>) for your custom FMBench predictor to your <code>FMBench</code> read bucket and the scripts prefix specified in the <code>s3_read_data</code> section (<code>read_bucket</code> and <code>scripts_prefix</code>).</p> </li> <li> <p>Edit the configuration file you are using for your <code>FMBench</code> for the following:</p> <ul> <li>Skip the deployment step by setting the <code>2_deploy_model.ipynb</code> step under <code>run_steps</code> to <code>no</code>.</li> <li>Set the <code>inference_script</code> under any experiment in the <code>experiments</code> section for which you want to use your new custom inference script to point to your new Python file (<code>my_custom_predictor.py</code>) that contains your custom predictor.</li> </ul> </li> </ol>"},{"location":"ec2.html","title":"Run <code>FMBench</code> on Amazon EC2","text":"<p>For some enterprise scenarios it might be desirable to run <code>FMBench</code> directly on an EC2 instance with no dependency on S3. Here are the steps to do this:</p> <ol> <li> <p>Have a <code>t3.xlarge</code> (or larger) instance in the <code>Running</code> stage. Make sure that the instance has at least 50GB of disk space and the IAM role associated with your EC2 instance has <code>AmazonSageMakerFullAccess</code> policy associated with it and <code>sagemaker.amazonaws.com</code> added to its Trust relationships.     <pre><code>{\n    \"Effect\": \"Allow\",\n    \"Principal\": {\n        \"Service\": \"sagemaker.amazonaws.com\"\n    },\n    \"Action\": \"sts:AssumeRole\"\n}\n</code></pre></p> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment. This step required conda to be installed on the EC2 instance, see instructions for downloading Anaconda.</p> <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n</code></pre> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a quickstart config file.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama2/7b/config-llama2-7b-g5-quick.yml --local-mode yes &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and navigate to the <code>foundation-model-benchmarking-tool</code> directory and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"features.html","title":"<code>FMBench</code> features","text":"<p>Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.</p> <p>Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).</p> <p>Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.</p> <p>Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.</p>"},{"location":"gettingstarted.html","title":"Getting started with <code>FMBench</code>","text":"<p><code>FMBench</code> is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.</p> <p>While technically you can run <code>FMBench</code> on any AWS compute but practically speaking we either run it on a SageMaker Notebook or on EC2. Both these options are described below.</p> <p>Intro Video</p> <p></p> <p></p>"},{"location":"gettingstarted.html#fmbench-in-a-client-server-configuration-on-amazon-ec2","title":"<code>FMBench</code> in a client-server configuration on Amazon EC2","text":"<p>Often times there might be a need where a platform team would like to have a bunch of LLM endpoints deployed in an account available permanently for data science teams or application teams to benchmark performance and accuracy for their specific use-case. They can take advantage of a special client-server configuration for <code>FMBench</code> where it can be used to deploy models on EC2 instances in one AWS account (called the server account) and run tests against these endpoints from <code>FMBench</code> deployed on EC2 instances in another AWS account (called the client AWS account).</p> <p>This has the advantage that every team that wants to benchmark a set of LLMs does not first have to deploy the LLMs, a platform team can do that for them and have these LLMs available for a longer duration as these teams do their benchmarking, for example for their specific datasets, for their specific cost and performance criteria. Using <code>FMBench</code> in this way makes the process simpler for both teams as the platform team can use <code>FMBench</code> for easily deploying the models with full control on the configuration of the serving stack without having to write any LLM deployment code for EC2 and the data science teams or application teams can test with different datasets, performance criteria and inference parameters. As long as the security groups have an inbound rule to allow access to the model endpoint (typically TCP port 8080) an <code>FMBench</code> installation in the client AWS account should be able to access an endpoint in the server AWS account.</p> <p></p>"},{"location":"manifest.html","title":"Files","text":"<p>Here is a listing of the various configuration files available out-of-the-box with <code>FMBench</code>. Click on any link to view a file. You can use these files as-is or use them as templates to create a custom configuration file for your use-case of interest.</p> <p>bedrock \u251c\u2500\u2500 bedrock/config-bedrock-all-anthropic-models-longbench-data.yml \u251c\u2500\u2500 bedrock/config-bedrock-anthropic-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-claude.yml \u251c\u2500\u2500 bedrock/config-bedrock-evals-only-conc-1.yml \u251c\u2500\u2500 bedrock/config-bedrock-haiku-sonnet-majority-voting.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-70b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-8b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-no-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-titan-text-express.yml \u2514\u2500\u2500 bedrock/config-bedrock.yml bert \u2514\u2500\u2500 bert/config-distilbert-base-uncased.yml byoe \u2514\u2500\u2500 byoe/config-model-byo-sagemaker-endpoint.yml eks_manifests \u251c\u2500\u2500 eks_manifests/llama3-ray-service.yaml \u2514\u2500\u2500 eks_manifests/mistral-ray-service.yaml gemma \u2514\u2500\u2500 gemma/config-gemma-2b-g5.yml llama2 \u251c\u2500\u2500 llama2/13b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-bedrock-sagemaker-llama2.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-byo-rest-ep-llama2-13b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5.yml \u251c\u2500\u2500 llama2/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-ec2-llama2-70b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-tgi.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-trt.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/70b/config-llama2-70b-inf2-g5.yml \u2514\u2500\u2500 llama2/7b \u251c\u2500\u2500 llama2/7b/config-llama2-7b-byo-sagemaker-endpoint.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g4dn-g5-trt.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-no-s3-quick.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-quick.yml \u2514\u2500\u2500 llama2/7b/config-llama2-7b-inf2-g5.yml llama3 \u251c\u2500\u2500 llama3/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-bedrock.yml -&gt; ../../bedrock/config-bedrock.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-llama3-70b-instruct.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-neuron-llama3-70b-inf2-48xl-deploy-sm.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-48xl.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3/70b/config-llama3-70b-instruct-p4d.yml \u2514\u2500\u2500 llama3/8b \u251c\u2500\u2500 llama3/8b/config-bedrock.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-24xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7i-12xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-neuron-trn1-32xl-tp16-sm.yml \u251c\u2500\u2500 config-llama3-8b-trn1-32xl-tp16-bs-4-ec2.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-24xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-48xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-eks-inf2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5-streaming.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-24xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-48xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5-byoe-w-openorca.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-all.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl-4-instances.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-2xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-p4d.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p5-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-16-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-8-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-24xl-byoe-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-48xl-byoe-g5-24xl.yml \u2514\u2500\u2500 llama3/8b/llama3-8b-trn1-32xl-byoe-g5-24xl.yml llama3.1 \u251c\u2500\u2500 llama3.1/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-48xl-deploy-ec2.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-deploy-sm.yml \u2514\u2500\u2500 llama3.1/8b \u251c\u2500\u2500 llama3.1/8b/client-config-ec2-llama3-1-8b.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2-tp24-bs12.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-4-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-8-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p5-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-tp-8-mc-auto-p5.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-trn1-32xl-deploy-ec2-tp32-bs8.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.2xl-tp-1-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-8-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-24-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn1-32xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn32xl-triton-vllm.yml \u2514\u2500\u2500 llama3.1/8b/server-config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml mistral \u251c\u2500\u2500 mistral/config-mistral-7b-eks-inf2.yml \u251c\u2500\u2500 mistral/config-mistral-7b-tgi-g5.yml \u251c\u2500\u2500 mistral/config-mistral-7b-trn1-32xl-triton.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5-byo-ep.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v1-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-trn1-32xl-deploy-ec2-tp32.yml \u2514\u2500\u2500 mistral/config-mistral-v3-inf2-48xl-deploy-ec2-tp24.yml model_eval_all_info.yml phi \u2514\u2500\u2500 phi/config-phi-3-g5.yml pricing.yml </p>"},{"location":"mm_copies.html","title":"Running multiple model copies on Amazon EC2","text":"<p>It is possible to run multiple copies of a model if the tensor parallelism degree and the number of GPUs/Neuron cores on the instance allow it. For example if a model can fit into 2 GPU devices and there are 8 devices available then we could run 4 copies of the model on that instance. Some inference containers, such as the DJL Serving LMI automatically start multiple copies of the model within the same inference container for the scenario described in the example above. However, it is also possible to do this ourselves by running multiple containers and a load balancer through a Docker compose file. <code>FMBench</code> now supports this functionality by adding a single parameter called <code>model_copies</code> in the configuration file.</p> <p>For example, here is a snippet from the config-ec2-llama3-1-8b-p4-tp-2-mc-max config file. The new parameters are <code>model_copies</code>, <code>tp_degree</code> and <code>shm_size</code> in the <code>inference_spec</code> section. Note that the <code>tp_degree</code> in the <code>inference_spec</code> and <code>option.tensor_parallel_degree</code> in the <code>serving.properties</code> section should be set to the same value.</p> <pre><code>    inference_spec:\n      # this should match one of the sections in the inference_parameters section above\n      parameter_set: ec2_djl\n      # how many copies of the model, \"1\", \"2\",..max\n      # set to 1 in the code if not configured,\n      # max: FMBench figures out the max number of model containers to be run\n      #      based on TP degree configured and number of neuron cores/GPUs available.\n      #      For example, if TP=2, GPUs=8 then FMBench will start 4 containers and 1 load balancer,\n      # auto: only supported if the underlying inference container would automatically \n      #       start multiple copies of the model internally based on TP degree and neuron cores/GPUs\n      #       available. In this case only a single container is created, no load balancer is created.\n      #       The DJL serving containers supports auto.  \n      model_copies: max\n      # if you set the model_copies parameter then it is mandatory to set the \n      # tp_degree, shm_size, model_loading_timeout parameters\n      tp_degree: 2\n      shm_size: 12g\n      model_loading_timeout: 2400\n    # modify the serving properties to match your model and requirements\n    serving.properties: |\n      engine=MPI\n      option.tensor_parallel_degree=2\n      option.max_rolling_batch_size=256\n      option.model_id=meta-llama/Meta-Llama-3.1-8B-Instruct\n      option.rolling_batch=lmi-dist\n</code></pre>"},{"location":"mm_copies.html#considerations-while-setting-the-model_copies-parameter","title":"Considerations while setting the <code>model_copies</code> parameter","text":"<ol> <li> <p>The <code>model_copies</code> parameter is an EC2 only parameter, which means that you cannot use it when deploying models on SageMaker for example.</p> </li> <li> <p>If you are looking for the best (lowest) inference latency then you might get better results with setting the <code>tp_degree</code> and <code>option.tensor_parallel_degree</code> to the total number of GPUs/Neuron cores available on your EC2 instance and <code>model_copies</code> to <code>max</code> or <code>auto</code> or <code>1</code>, in other words, the model is being shared across all accelerators and there can be only 1 copy of the model that can run on that instance (therefore setting <code>model_copies</code> to <code>max</code> or <code>auto</code> or <code>1</code> all result in the same thing i.e. a single copy of the model running on that EC2 instance).</p> </li> <li> <p>If you are looking for the best (highest) transaction throughput while keeping the inference latency within a given latency budget then you might want to configure <code>tp_degree</code> and <code>option.tensor_parallel_degree</code> to the least number of GPUs/Neuron cores on which the model can run (for example for <code>Llama3.1-8b</code> that would be 2 GPUs or 4 Neuron cores) and set the <code>model_copies</code> to <code>max</code>. Let us understand this with an example, say you want to run <code>Llama3.1-8b</code> on a <code>p4de.24xlarge</code> instance type, you set <code>tp_degree</code> and <code>option.tensor_parallel_degree</code> to 2 and <code>model_copies</code> to <code>max</code>, <code>FMBench</code> will start 4 containers (as the <code>p4de.24xlarge</code> has 8 GPUs) and an Nginx load balancer that will round-robin the incoming requests to these 4 containers. In case of the DJL serving LMI you can achieve similar results by setting the <code>model_copies</code> to <code>auto</code> in which case <code>FMBench</code> will start a single container (and no load balancer since there is only one container) and then the DJL serving container will internally start 4 copies of the model within the same container and route the requests to these 4 copies internally. Theoretically you should expect the same performance but in our testing we have seen better performance with <code>model_copies</code> set to <code>max</code> and having an external (Nginx) container doing the load balancing.</p> </li> </ol>"},{"location":"neuron.html","title":"Benchmark foundation models for AWS Chips","text":"<p>You can use <code>FMBench</code> for benchmarking foundation model on AWS Chips: Trainium 1, Inferentia 2. This can be done on Amazon SageMaker, Amazon EKS or on Amazon EC2. FMs need to be first compiled for Neuron before they can be deployed on AWS Chips, this is made easier by SageMaker JumpStart which provides most of the FMs as a JumpStart Model that can be deployed on SageMaker directly, you can also compile models for Neuron yourself or do this through <code>FMBench</code> itself. All of these options are described below.</p>"},{"location":"neuron.html#benchmarking-for-aws-chips-on-sagemaker","title":"Benchmarking for AWS Chips on SageMaker","text":"<ol> <li> <p>Several FMs are available through SageMaker JumpStart already compiled for Neuron and ready to deploy. See this link for more details.</p> </li> <li> <p>You can compile the model outside of <code>FMBench</code> using instructions available here and on the Neuron documentation, deploy on SageMaker and use <code>FMBench</code> in the <code>bring your own endpoint</code> mode, see this config file for an example.</p> </li> <li> <p>You can have <code>FMBench</code> compile and deploy the model on SageMaker for you. See this Llama3-8b config file for example or this Llama3.1-70b. Search this website for \"inf2\" or \"trn1\" to find other examples. In this case <code>FMBench</code> will download the model from Hugging Face (you need to provide your HuggingFace token in the <code>/tmp/fmbench-read/scripts/hf_token.txt</code> file, the file simply contains the token without any formatting), compile it for neuron, upload the compiled model to S3 (you specify the bucket in the config file) and then deploy the model to a SageMaker endpoint.</p> </li> </ol>"},{"location":"neuron.html#benchmarking-for-aws-chips-on-ec2","title":"Benchmarking for AWS Chips on EC2","text":"<p>You may want to benchmark models hosted directly on EC2. In this case both <code>FMBench</code> and the model are running on the same EC2 instance. <code>FMBench</code> will deploy the model for you on the EC2 instance. See this Llama3.1-70b file for example or this Llama3-8b file. In this case <code>FMBench</code> will download the model from Hugging Face (you need to provide your HuggingFace token in the <code>/tmp/fmbench-read/scripts/hf_token.txt</code> file, the file simply contains the token without any formatting), pull the inference container from the ECR repo and then run the container with the downloaded model, a local endpoint is provided that is then used by <code>FMBench</code> to run inference.</p>"},{"location":"quickstart.html","title":"Quickstart - run <code>FMBench</code> on SageMaker Notebook","text":"<ol> <li> <p>Each <code>FMBench</code> run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical <code>FMBench</code> workflow involves either directly using an already provided config file from the <code>configs</code> folder in the <code>FMBench</code> GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).</p> <p>\ud83d\udc49 A simple config file with key parameters annotated is included in this repo, see <code>config-llama2-7b-g5-quick.yml</code>. This file benchmarks performance of Llama2-7b on an <code>ml.g5.xlarge</code> instance and an <code>ml.g5.2xlarge</code> instance. You can use this config file as it is for this Quickstart.</p> </li> <li> <p>Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run <code>FMBench</code> and a write S3 bucket is created which will hold the metrics and reports generated by <code>FMBench</code>. The CloudFormation stack takes about 5-minutes to create.</p> </li> </ol> AWS Region Link us-east-1 (N. Virginia) us-west-2 (Oregon) us-gov-west-1 (GovCloud West) <ol> <li> <p>Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the <code>fmbench-notebook</code>.</p> </li> <li> <p>On the <code>fmbench-notebook</code> open a Terminal and run the following commands.     <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n</code></pre></p> </li> <li> <p>Now you are ready to <code>fmbench</code> with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.</p> <ol> <li> <p>We benchmark performance for the <code>Llama2-7b</code> model on a <code>ml.g5.xlarge</code> and a <code>ml.g5.2xlarge</code> instance type, using the <code>huggingface-pytorch-tgi-inference</code> inference container. This test would take about 30 minutes to complete and cost about $0.20.</p> </li> <li> <p>It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the <code>Llama2 tokenizer</code> (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.</p> <pre><code>account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open another terminal window and do a <code>tail -f</code> on the <code>fmbench.log</code> file to see all the traces being generated at runtime.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>\ud83d\udc49 For streaming support on SageMaker and Bedrock checkout these config files:</p> <ol> <li>config-llama3-8b-g5-streaming.yml</li> <li>config-bedrock-llama3-streaming.yml</li> </ol> </li> </ol> </li> <li> <p>The generated reports and metrics are available in the <code>sagemaker-fmbench-write-&lt;replace_w_your_aws_region&gt;-&lt;replace_w_your_aws_account_id&gt;</code> bucket. The metrics and report files are also downloaded locally and in the <code>results</code> directory (created by <code>FMBench</code>) and the benchmarking report is available as a markdown file called <code>report.md</code> in the <code>results</code> directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.</p> </li> </ol>"},{"location":"quickstart.html#fmbench-on-govcloud","title":"<code>FMBench</code> on GovCloud","text":"<p>No special steps are required for running <code>FMBench</code> on GovCloud. The CloudFormation link for <code>us-gov-west-1</code> has been provided in the section above.</p> <ol> <li> <p>Not all models available via Bedrock or other services may be available in GovCloud. The following commands show how to run <code>FMBench</code> to benchmark the Amazon Titan Text Express model in the GovCloud. See the Amazon Bedrock GovCloud page for more details.</p> <pre><code>account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/bedrock/config-bedrock-titan-text-express.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> </ol>"},{"location":"releases.html","title":"Releases","text":""},{"location":"releases.html#207","title":"2.0.7","text":"<ol> <li>Support Triton-TensorRT for GPU instances and Triton-vllm for AWS Chips.</li> <li>Misc. bug fixes.</li> </ol>"},{"location":"releases.html#206","title":"2.0.6","text":"<ol> <li>Run multiple model copies with the DJL serving container and an Nginx load balancer on Amazon EC2.</li> <li>Config files for <code>Llama3.1-8b</code> on <code>g5</code>, <code>p4de</code> and <code>p5</code> Amazon EC2 instance types.</li> <li>Better analytics for creating internal leaderboards.</li> </ol>"},{"location":"releases.html#205","title":"2.0.5","text":"<ol> <li>Support for Intel CPU based instances such as <code>c5.18xlarge</code> and <code>m5.16xlarge</code>.</li> </ol>"},{"location":"releases.html#204","title":"2.0.4","text":"<ol> <li>Support for AMD CPU based instances such as <code>m7a</code>.</li> </ol>"},{"location":"releases.html#203","title":"2.0.3","text":"<ol> <li>Support for a EFA directory for benchmarking on EC2.</li> </ol>"},{"location":"releases.html#202","title":"2.0.2","text":"<ol> <li>Code cleanup, minor bug fixes and report improvements.</li> </ol>"},{"location":"releases.html#200","title":"2.0.0","text":"<ol> <li>\ud83d\udea8 Model evaluations done by a Panel of LLM Evaluators \ud83d\udea8</li> </ol>"},{"location":"releases.html#v1052","title":"v1.0.52","text":"<ol> <li>Compile for AWS Chips (Trainium, Inferentia) and deploy to SageMaker directly through <code>FMBench</code>.</li> <li><code>Llama3.1-8b</code> and <code>Llama3.1-70b</code> config files for AWS Chips (Trainium, Inferentia).</li> <li>Misc. bug fixes.</li> </ol>"},{"location":"releases.html#v1051","title":"v1.0.51","text":"<ol> <li><code>FMBench</code> has a website now. Rework the README file to make it lightweight.</li> <li><code>Llama3.1</code> config files for Bedrock.</li> </ol>"},{"location":"releases.html#v1050","title":"v1.0.50","text":"<ol> <li><code>Llama3-8b</code> on Amazon EC2 <code>inf2.48xlarge</code> config file.</li> <li>Update to new version of DJL LMI (0.28.0).</li> </ol>"},{"location":"releases.html#v1049","title":"v1.0.49","text":"<ol> <li>Streaming support for Amazon SageMaker and Amazon Bedrock.</li> <li>Per-token latency metrics such as time to first token (TTFT) and mean time per-output token (TPOT).</li> <li>Misc. bug fixes.</li> </ol>"},{"location":"releases.html#v1048","title":"v1.0.48","text":"<ol> <li>Faster result file download at the end of a test run.</li> <li><code>Phi-3-mini-4k-instruct</code> configuration file.</li> <li>Tokenizer and misc. bug fixes.</li> </ol>"},{"location":"releases.html#v1047","title":"v1.0.47","text":"<ol> <li>Run <code>FMBench</code> as a Docker container.</li> <li>Bug fixes for GovCloud support.</li> <li>Updated README for EKS cluster creation.</li> </ol>"},{"location":"releases.html#v1046","title":"v1.0.46","text":"<ol> <li>Native model deployment support for EC2 and EKS (i.e. you can now deploy and benchmark models on EC2 and EKS).</li> <li>FMBench is now available in GovCloud.</li> <li>Update to latest version of several packages.</li> </ol>"},{"location":"releases.html#v1045","title":"v1.0.45","text":"<ol> <li>Analytics for results across multiple runs.</li> <li><code>Llama3-70b</code> config files for <code>g5.48xlarge</code> instances.</li> </ol>"},{"location":"releases.html#v1044","title":"v1.0.44","text":"<ol> <li>Endpoint metrics (CPU/GPU utilization, memory utiliztion, model latency) and invocation metrics (including errors) for SageMaker Endpoints.</li> <li><code>Llama3-8b</code> config files for <code>g6</code> instances.</li> </ol>"},{"location":"releases.html#v1042","title":"v1.0.42","text":"<ol> <li>Config file for running <code>Llama3-8b</code> on all instance types except <code>p5</code>.</li> <li>Fix bug with business summary chart.</li> <li>Fix bug with deploying model using a DJL DeepSpeed container in the no S3 dependency mode.</li> </ol>"},{"location":"releases.html#v1040","title":"v1.0.40","text":"<ol> <li>Make it easy to run in the Amazon EC2 without any dependency on Amazon S3 dependency mode.</li> </ol>"},{"location":"releases.html#v1039","title":"v1.0.39","text":"<ol> <li>Add an internal <code>FMBench</code> website.</li> </ol>"},{"location":"releases.html#v1038","title":"v1.0.38","text":"<ol> <li>Support for running <code>FMBench</code> on Amazon EC2 without any dependency on Amazon S3.</li> <li><code>Llama3-8b-Instruct</code> config file for <code>ml.p5.48xlarge</code>.</li> </ol>"},{"location":"releases.html#v1037","title":"v1.0.37","text":"<ol> <li><code>g5</code>/<code>p4d</code>/<code>inf2</code>/<code>trn1</code> specific config files for <code>Llama3-8b-Instruct</code>.<ol> <li><code>p4d</code> config file for both <code>vllm</code> and <code>lmi-dist</code>.</li> </ol> </li> </ol>"},{"location":"releases.html#v1036","title":"v1.0.36","text":"<ol> <li>Fix bug at higher concurrency levels (20 and above).</li> <li>Support for instance count &gt; 1.</li> </ol>"},{"location":"releases.html#v1035","title":"v1.0.35","text":"<ol> <li>Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral.</li> </ol>"},{"location":"releases.html#v1034","title":"v1.0.34","text":"<ol> <li>Don't delete endpoints for the bring your own endpoint case.</li> <li>Fix bug with business summary chart.</li> </ol>"},{"location":"releases.html#v1032","title":"v1.0.32","text":"<ol> <li> <p>Report enhancements: New business summary chart, config file embedded in the report, version numbering and others.</p> </li> <li> <p>Additional config files: Meta Llama3 on Inf2, Mistral instruct with <code>lmi-dist</code> on <code>p4d</code> and <code>p5</code> instances.</p> </li> </ol>"},{"location":"resources.html","title":"Resources","text":""},{"location":"resources.html#pending-enhancements","title":"Pending enhancements","text":"<p>View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.</p>"},{"location":"resources.html#security","title":"Security","text":"<p>See CONTRIBUTING for more information.</p>"},{"location":"resources.html#license","title":"License","text":"<p>This library is licensed under the MIT-0 License. See the LICENSE file.</p>"},{"location":"results.html","title":"Results","text":"<p>Depending upon the experiments in the config file, the <code>FMBench</code> run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the local <code>results-*</code> folder in the directory from where <code>FMBench</code> was run. The rpeort and metrics are also written to the write S3 bucket set in the config file.</p> <p>Here is a screenshot of the <code>report.md</code> file generated by <code>FMBench</code>. </p>"},{"location":"run_as_container.html","title":"Run <code>FMBench</code> as a Docker container","text":"<p>You can now run <code>FMBench</code> on any platform where you can run a Docker container, for example on an EC2 VM, SageMaker Notebook etc. The advantage is that you do not have to install anything locally, so no <code>conda</code> installs needed anymore. Here are the steps to do that.</p> <ol> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. You can place model specific tokenizers and any new configuration files you create in the <code>/tmp/fmbench-read</code> directory that is created after running the following command. </p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n</code></pre> </li> <li> <p>That's it! You are now ready to run the container.</p> <pre><code># set the config file path to point to the config file of interest\nCONFIG_FILE=https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama2/7b/config-llama2-7b-g5-quick.yml\ndocker run -v $(pwd)/fmbench:/app \\\n  -v /tmp/fmbench-read:/tmp/fmbench-read \\\n  -v /tmp/fmbench-write:/tmp/fmbench-write \\\n  aarora79/fmbench:v1.0.47 \\\n \"fmbench --config-file ${CONFIG_FILE} --local-mode yes --write-bucket placeholder &gt; fmbench.log 2&gt;&amp;1\"\n</code></pre> </li> <li> <p>The above command will create a <code>fmbench</code> directory inside the current working directory. This directory contains the <code>fmbench.log</code> and the <code>results-*</code> folder that is created once the run finished.</p> </li> </ol>"},{"location":"website.html","title":"Create a website for <code>FMBench</code> reports","text":"<p>When you use <code>FMBench</code> as a tool for benchmarking your foundation models you would soon want to have an easy way to view all the reports in one place and search through the results, for example, \"<code>Llama3.1-8b</code> results on <code>trn1.32xlarge</code>\". An <code>FMBench</code> website provides a simple way of viewing these results.</p> <p>Here are the steps to setup a website using <code>mkdocs</code> and <code>nginx</code>. The steps below generate a self-signed certificate for SSL and use username and password for authentication. It is strongly recommended that you use a valid SSL cert and a better authentication mechanism than username and password for your <code>FMBench</code> website.</p> <ol> <li> <p>Start an Amazon EC2 machine which will host the <code>FMBench</code> website. A <code>t3.xlarge</code> machine with an Ubuntu AMI say <code>ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20240801</code> and 50GB storage is good enough. Allow SSH and TCP port 443 traffic from anywhere into that machine.</p> </li> <li> <p>SSH into that machine and install <code>conda</code>.</p> <pre><code>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\nsudo usermod -a -G docker $USER\nnewgrp docker\ndocker compose version \n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment and clone <code>FMBench</code> repo.</p> <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311\npip install -U fmbench mkdocs mkdocs-material mknotebooks\ngit clone https://github.com/aws-samples/foundation-model-benchmarking-tool.git\n</code></pre> </li> <li> <p>Get the <code>FMBench</code> results data from Amazon S3 or whichever storage system you used to store all the results.</p> <pre><code>curl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nsudo apt-get install unzip -y\nunzip awscliv2.zip\nsudo ./aws/install\nFMBENCH_S3_BUCKET=your-fmbench-s3-bucket-name-here\naws s3 sync s3://$FMBENCH_S3_BUCKET $HOME/fmbench_data --exclude \"*.json\"\n</code></pre> </li> <li> <p>Create a directory for the <code>FMBench</code> website contents.</p> <p><pre><code>mkdir $HOME/fmbench_site\nmkdir $HOME/fmbench_site/ssl\n</code></pre> 1. Setup SSL certs (we strongly encourage you to not use self-signed certs, this step here is just for demo purposes, get SSL certs the same way you get them for your current production workloads).</p> <pre><code>sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $HOME/fmbench_site/ssl/nginx-selfsigned.key -out $HOME/fmbench_site/ssl/nginx-selfsigned.crt\n</code></pre> </li> <li> <p>Create an <code>.httpasswd</code> file. The <code>FMBench</code> website will use the <code>fmbench_admin</code> as a username and a password that you enter as part of the command below to allow login to the website.</p> <pre><code>sudo apt-get install apache2-utils -y\nhtpasswd -c $HOME/fmbench_site/.htpasswd fmbench_admin\n</code></pre> </li> <li> <p>Create the <code>mkdocs.yml</code> file for the website.</p> <pre><code>cd foundation-model-benchmarking-tool\ncp website/index.md $HOME/fmbench_data/\ncp -r img $HOME/fmbench_data/\npython website/create_fmbench_website.py\nmkdocs build -f website/mkdocs.yml --site-dir $HOME/fmbench_site/site\n</code></pre> </li> <li> <p>Update <code>nginx.conf</code> file. Note the hostname that is printed out below, the <code>FMBench</code> website would be served at this address.</p> <pre><code>TOKEN=`curl -X PUT \"http://169.254.169.254/latest/api/token\" -H \"X-aws-ec2-metadata-token-ttl-seconds: 21600\"`\nHOSTNAME=`curl -H \"X-aws-ec2-metadata-token: $TOKEN\" http://169.254.169.254/latest/meta-data/public-hostname`\necho \"hostname is: $HOSTNAME\"\nsed \"s/__HOSTNAME__/$HOSTNAME/g\" website/nginx.conf.template &gt; $HOME/fmbench_site/nginx.conf\n</code></pre> </li> <li> <p>Serve the website.</p> <pre><code>docker run --name fmbench-nginx -d -p 80:80 -p 443:443   -v $HOME/fmbench_site/site:/usr/share/nginx/html   -v $HOME/fmbench_site/nginx.conf:/etc/nginx/nginx.conf   -v $HOME/fmbench_site/ssl:/etc/nginx/ssl   -v $HOME/fmbench_site/.htpasswd:/etc/nginx/.htpasswd   nginx\n</code></pre> </li> <li> <p>Open a web browser and navigate to the hostname you noted in the step above, for example <code>https://&lt;your-ec2-hostname&gt;.us-west-2.compute.amazonaws.com</code>, ignore the security warnings if you used a self-signed SSL cert (replace this with a cert that you would normally use in your production websites) and then enter the username and password (the username would be <code>fmbench_admin</code> and password would be what you had set when running the <code>htpasswd</code> command). You should see a website as shown in the screenshot below.</p> </li> </ol> <p></p>"},{"location":"workflow.html","title":"Workflow for <code>FMBench</code>","text":"<p>The workflow for <code>FMBench</code> is as follows:</p> <pre><code>Create configuration file\n        |\n        |-----&gt; Deploy model on SageMaker/Use models on Bedrock/Bring your own endpoint\n                    |\n                    |-----&gt; Run inference against deployed endpoint(s)\n                                     |\n                                     |------&gt; Create a benchmarking report\n</code></pre> <ol> <li> <p>Create a dataset of different prompt sizes and select one or more such datasets for running the tests.</p> <ol> <li>Currently <code>FMBench</code> supports datasets from LongBench and filter out individual items from the dataset based on their size in tokens (for example, prompts less than 500 tokens, between 500 to 1000 tokens and so on and so forth). Alternatively, you can download the folder from this link to load the data.</li> </ol> </li> <li> <p>Deploy any model that is deployable on SageMaker on any supported instance type (<code>g5</code>, <code>p4d</code>, <code>Inf2</code>).</p> <ol> <li>Models could be either available via SageMaker JumpStart (list available here) as well as models not available via JumpStart but still deployable on SageMaker through the low level boto3 (Python) SDK (Bring Your  Own Script).</li> <li>Model deployment is completely configurable in terms of the inference container to use, environment variable to set, <code>setting.properties</code> file to provide (for inference containers such as DJL that use it) and instance type to use.</li> </ol> </li> <li> <p>Benchmark FM performance in terms of inference latency, transactions per minute and dollar cost per transaction for any FM that can be deployed on SageMaker.</p> <ol> <li>Tests are run for each combination of the configured concurrency levels i.e. transactions (inference requests) sent to the endpoint in parallel and dataset. For example, run multiple datasets of say prompt sizes between 3000 to 4000 tokens at concurrency levels of 1, 2, 4, 6, 8 etc. so as to test how many transactions of what token length can the endpoint handle while still maintaining an acceptable level of inference latency.</li> </ol> </li> <li> <p>Generate a report that compares and contrasts the performance of the model over different test configurations and stores the reports in an Amazon S3 bucket.</p> <ol> <li>The report is generated in the Markdown format and consists of plots, tables and text that highlight the key results and provide an overall recommendation on what is the best combination of instance type and serving stack to use for the model under stack for a dataset of interest.</li> <li>The report is created as an artifact of reproducible research so that anyone having access to the model, instance type and serving stack can run the code and recreate the same results and report.</li> </ol> </li> <li> <p>Multiple configuration files that can be used as reference for benchmarking new models and instance types.</p> </li> </ol>"},{"location":"misc/ec2_instance_creation_steps.html","title":"Create an EC2 instance suitable for an LMI (Large Model Inference)","text":"<p>Follow the steps below to create an EC2 instance for hosting a model in an LMI.</p> <ol> <li> <p>On the homepage of AWS Console go to \u2018EC2\u2019 - it is likely in recently visited:    </p> </li> <li> <p>If not found, go to the search bar on the top of the page. Type <code>ec2</code> into the search box and click the entry that pops up with name <code>EC2</code> :    </p> </li> <li> <p>Click \u201cInstances\u201d:    </p> </li> <li> <p>Click \"Launch Instances\":    </p> </li> <li> <p>Type in a name for your instance (recommended to include your alias in the name), and then scroll down. Search for \u2018deep learning ami\u2019 in the box. Select the one that says Deep Learning OSS Nvidia Driver AMI GPU PyTorch for a GPU instance type, select Deep Learning AMI Neuron (Ubuntu 22.04) for an Inferential/Trainium instance type. Your version number might be different.      </p> </li> <li> <p>Name your instance FMBenchInstance.</p> </li> <li> <p>Add a fmbench-version tag to your instance.    </p> </li> <li> <p>Scroll down to Instance Type. For large model inference, the g5.12xlarge is recommended.</p> </li> </ol> <p></p> <ol> <li> <p>Make a key pair by clicking Create new key pair. Give it a name, keep all settings as is, and then click \u201cCreate key pair\u201d.    </p> </li> <li> <p>Skip over Network settings (leave it as it is), going straight to Configure storage. 45 GB, the suggested amount, is not nearly enough, and using that will cause the LMI docker container to download for an arbitrarily long time and then error out. Change it to 100 GB or more:     </p> </li> <li> <p>Create an IAM role to your instance called FMBenchEC2Role. Attach the following permission policies: <code>AmazonSageMakerFullAccess</code>, <code>AmazonBedrockFullAccess</code>.</p> <p>Edit the trust policy to be the following: <pre><code>{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"ec2.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"sagemaker.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"bedrock.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        }\n    ]\n}\n</code></pre> Select this role in the IAM instance profile setting of your instance. </p> </li> <li> <p>Then, we\u2019re done with the settings of the instance. Click Launch Instance to finish. You can connect to your EC2 instance using any of these option     </p> </li> </ol>"},{"location":"misc/eks_cluster-creation_steps.html","title":"EKS cluster creation steps","text":"<p>The steps below create an EKS cluster called <code>trainium-inferentia</code>.</p> <ol> <li> <p>Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your machine: aws-cli, kubectl and terraform. We use the <code>DoEKS</code> repository as a guide to deploy the cluster infrastructure in an AWS account.</p> </li> <li> <p>Ensue that your account has enough <code>Inf2</code> on-demand VCPUs as most of the DoEKS blueprints utilize this specific instance. To increase service quota navigate to the service quota page for the region you are in service quota. Then select services under the left side menu and search for Amazon Elastic Compute Cloud (Amazon EC2). This will bring up the service quota page, here search for <code>inf</code> and there should be an option for Running On-Demand Inf instances. Increase this quota to 300. </p> </li> <li> <p>Clone the <code>DoEKS</code> repository</p> <pre><code>git clone https://github.com/awslabs/data-on-eks.git\n</code></pre> </li> <li> <p>Ensure that the region names are correct in <code>variables.tf</code> file before running the cluster creation script.</p> </li> <li> <p>Ensure that the ELB to be created would be external facing. Change the helm value from <code>internal</code> to <code>internet-facing</code> here.</p> </li> <li> <p>Ensure that the IAM role you are using has the permissions needed to create the cluster. While we expect the following set of permissions to work but the current recommendation is to also add the <code>AdminstratorAccess</code> permission to the IAM role. At a later date you could remove the  <code>AdminstratorAccess</code> and experiment with cluster creation without it.</p> <ol> <li>Attach the following managed policies: <code>AmazonEKSClusterPolicy</code>, <code>AmazonEKS_CNI_Policy</code>, and <code>AmazonEKSWorkerNodePolicy</code>.</li> <li> <p>In addition to the managed policies add the following as inline policy. Replace your-account-id with the actual value of the AWS account id you are using.</p> <p><pre><code>{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n    {\n        \"Sid\": \"VisualEditor0\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateVpc\",\n            \"ec2:DeleteVpc\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor1\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:ModifyVpcAttribute\",\n            \"ec2:DescribeVpcAttribute\"\n        ],\n        \"Resource\": \"arn:aws:ec2:*:&lt;your-account-id&gt;:vpc/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor2\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AssociateVpcCidrBlock\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor3\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:DescribeSecurityGroupRules\",\n            \"ec2:DescribeNatGateways\",\n            \"ec2:DescribeAddressesAttribute\"\n        ],\n        \"Resource\": \"*\"\n    },\n    {\n        \"Sid\": \"VisualEditor4\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateInternetGateway\",\n            \"ec2:RevokeSecurityGroupEgress\",\n            \"ec2:CreateRouteTable\",\n            \"ec2:CreateSubnet\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:security-group/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor5\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:AttachInternetGateway\",\n            \"ec2:AssociateRouteTable\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:vpn-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor6\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AllocateAddress\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv4pool-ec2/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor7\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:ReleaseAddress\",\n        \"Resource\": \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor8\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:CreateNatGateway\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:natgateway/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    }\n]\n}\n</code></pre> 1. Add the Role ARN and name here in the <code>variables.tf</code> file by updating these lines. Move the structure inside the <code>defaut</code> list and replace the role ARN and name values with the values for the role you are using.</p> </li> </ol> </li> <li> <p>Navigate into the <code>ai-ml/trainium-inferentia/</code> directory and run install.sh script.</p> <pre><code>cd data-on-eks/ai-ml/trainium-inferentia/\n./install.sh\n</code></pre> <p>Note: This step takes about 12-15 minutes to deploy the EKS infrastructure and cluster in the AWS account. To view more details on cluster creation, view an example here: Deploy Llama3 on EKS in the prerequisites section.</p> </li> <li> <p>After the cluster is created, navigate to the Karpenter EC2 node IAM role called <code>karpenter-trainium-inferentia-XXXXXXXXXXXXXXXXXXXXXXXXX</code>. Attach the following inline policy to the role:</p> <pre><code>{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Sid\": \"Statement1\",\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"iam:CreateServiceLinkedRole\"\n            ],\n            \"Resource\": \"*\"\n        }\n    ]\n}\n</code></pre> </li> </ol>"},{"location":"misc/the-diy-version-w-gory-details.html","title":"The diy version w gory details","text":""},{"location":"misc/the-diy-version-w-gory-details.html#the-diy-version-with-gory-details","title":"The DIY version (with gory details)","text":"<p>Follow the prerequisites below to set up your environment before running the code:</p> <ol> <li> <p>Python 3.11: Setup a Python 3.11 virtual environment and install <code>FMBench</code>.</p> <pre><code>python -m venv .fmbench\npip install fmbench\n</code></pre> </li> <li> <p>S3 buckets for test data, scripts, and results: Create two buckets within your AWS account:</p> <ul> <li> <p>Read bucket: This bucket contains <code>tokenizer files</code>, <code>prompt template</code>, <code>source data</code> and <code>deployment scripts</code> stored in a directory structure as shown below. <code>FMBench</code> needs to have read access to this bucket.</p> <pre><code>s3://&lt;read-bucket-name&gt;\n    \u251c\u2500\u2500 source_data/\n    \u251c\u2500\u2500 source_data/&lt;source-data-file-name&gt;.json\n    \u251c\u2500\u2500 prompt_template/\n    \u251c\u2500\u2500 prompt_template/prompt_template.txt\n    \u251c\u2500\u2500 scripts/\n    \u251c\u2500\u2500 scripts/&lt;deployment-script-name&gt;.py\n    \u251c\u2500\u2500 tokenizer/\n    \u251c\u2500\u2500 tokenizer/tokenizer.json\n    \u251c\u2500\u2500 tokenizer/config.json\n</code></pre> <ul> <li> <p>The details of the bucket structure is as follows:</p> <ol> <li> <p>Source Data Directory: Create a <code>source_data</code> directory that stores the dataset you want to benchmark with. <code>FMBench</code> uses <code>Q&amp;A</code> datasets from the <code>LongBench dataset</code> or alternatively from this link. Support for bring your own dataset will be added soon.</p> <ul> <li> <p>Download the different files specified in the LongBench dataset into the <code>source_data</code> directory. Following is a good list to get started with:</p> <ul> <li><code>2wikimqa</code></li> <li><code>hotpotqa</code></li> <li><code>narrativeqa</code></li> <li><code>triviaqa</code></li> </ul> <p>Store these files in the <code>source_data</code> directory.</p> </li> </ul> </li> <li> <p>Prompt Template Directory: Create a <code>prompt_template</code> directory that contains a <code>prompt_template.txt</code> file. This <code>.txt</code> file contains the prompt template that your specific model supports. <code>FMBench</code> already supports the prompt template compatible with <code>Llama</code> models.</p> </li> <li> <p>Scripts Directory: <code>FMBench</code> also supports a <code>bring your own script (BYOS)</code> mode for deploying models that are not natively available via SageMaker JumpStart i.e. anything not included in this list. Here are the steps to use BYOS.</p> <ol> <li> <p>Create a Python script to deploy your model on a SageMaker endpoint. This script needs to have a <code>deploy</code> function that <code>2_deploy_model.ipynb</code> can invoke. See <code>p4d_hf_tgi.py</code> for reference.</p> </li> <li> <p>Place your deployment script in the <code>scripts</code> directory in your read bucket. If your script deploys a model directly from HuggingFace and needs to have access to a HuggingFace auth token, then create a file called <code>hf_token.txt</code> and put the auth token in that file. The <code>.gitignore</code> file in this repo has rules to not commit the <code>hf_token.txt</code> to the repo. Today, <code>FMBench</code> provides inference scripts for:</p> <ul> <li>All SageMaker Jumpstart Models</li> <li>Text-Generation-Inference (TGI) container supported models</li> <li>Deep Java Library DeepSpeed container supported models</li> </ul> <p>Deployment scripts for the options above are available in the scripts directory, you can use these as reference for creating your own deployment scripts as well.</p> </li> </ol> </li> <li> <p>Tokenizer Directory: Place the <code>tokenizer.json</code>, <code>config.json</code> and any other files required for your model's tokenizer in the <code>tokenizer</code> directory. The tokenizer for your model should be compatible with the <code>tokenizers</code> package. <code>FMBench</code> uses <code>AutoTokenizer.from_pretrained</code> to load the tokenizer.     &gt;As an example, to use the <code>Llama 2 Tokenizer</code> for counting prompt and generation tokens for the <code>Llama 2</code> family of models: Accept the License here: meta approval form and download the <code>tokenizer.json</code> and <code>config.json</code> files from Hugging Face website and place them in the <code>tokenizer</code> directory.</p> </li> </ol> </li> </ul> </li> <li> <p>Write bucket: All prompt payloads, model endpoint and metrics generated by <code>FMBench</code> are stored in this bucket. <code>FMBench</code> requires write permissions to store the results in this bucket. No directory structure needs to be pre-created in this bucket, everything is created by <code>FMBench</code> at runtime.</p> <p>```{.bash} s3://     \u251c\u2500\u2500      \u251c\u2500\u2500 /data     \u251c\u2500\u2500 /data/metrics     \u251c\u2500\u2500 /data/models     \u251c\u2500\u2500 /data/prompts ````"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"index.html","title":"Benchmark foundation models on AWS","text":"<p><code>FMBench</code> is a Python package for running performance benchmarks for any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. The FMs could be deployed on these platforms either directly through <code>FMbench</code>, or, if they are already deployed then also they could be benchmarked through the Bring your own endpoint mode supported by <code>FMBench</code>. </p> <p>Here are some salient features of <code>FMBench</code>:</p> <ol> <li> <p>Highly flexible: in that it allows for using any combinations of instance types (<code>g5</code>, <code>p4d</code>, <code>p5</code>, <code>Inf2</code>), inference containers (<code>DeepSpeed</code>, <code>TensorRT</code>, <code>HuggingFace TGI</code> and others) and parameters such as tensor parallelism, rolling batch etc. as long as those are supported by the underlying platform. </p> </li> <li> <p>Benchmark any model: it can be used to be benchmark open-source models, third party models, and proprietary models trained by enterprises on their own data.</p> </li> <li> <p>Run anywhere: it can be run on any AWS platform where we can run Python, such as Amazon EC2, Amazon SageMaker, or even the AWS CloudShell. It is important to run this tool on an AWS platform so that internet round trip time does not get included in the end-to-end response time latency.</p> </li> </ol>"},{"location":"index.html#the-need-for-benchmarking","title":"The need for benchmarking","text":"<p>Customers often wonder what is the best AWS service to run FMs for my specific use-case and my specific price performance requirements. While model evaluation metrics are available on several leaderboards (<code>HELM</code>, <code>LMSys</code>), but the price performance comparison can be notoriously hard to find and even more harder to trust. In such a scenario, we think it is best to be able to run performance benchmarking yourself on either on your own dataset or on a similar (task wise, prompt size wise) open-source datasets such as (<code>LongBench</code>, <code>QMSum</code>). This is the problem that <code>FMBench</code> solves.</p>"},{"location":"index.html#fmbench-an-open-source-python-package-for-fm-benchmarking-on-aws","title":"<code>FMBench</code>: an open-source Python package for FM benchmarking on AWS","text":"<p><code>FMBench</code> runs inference requests against endpoints that are either deployed through <code>FMBench</code> itself (as in the case of SageMaker) or are available either as a fully-managed endpoint (as in the case of Bedrock) or as bring your own endpoint. The metrics such as inference latency, transactions per-minute, error rates and cost per transactions are captured and presented in the form of a Markdown report containing explanatory text, tables and figures. The figures and tables in the report provide insights into what might be the best serving stack (instance type, inference container and configuration parameters) for a given FM for a given use-case.</p> <p>The following figure gives an example of the price performance numbers that include inference latency, transactions per-minute and concurrency level for running the <code>Llama2-13b</code> model on different instance types available on SageMaker using prompts for Q&amp;A task created from the <code>LongBench</code> dataset, these prompts are between 3000 to 3840 tokens in length. Note that the numbers are hidden in this figure but you would be able to see them when you run <code>FMBench</code> yourself.</p> <p></p> <p>The following table (also included in the report) provides information about the best available instance type for that experiment<sup>1</sup>.</p> Information Value experiment_name llama2-13b-inf2.24xlarge payload_file payload_en_3000-3840.jsonl instance_type ml.inf2.24xlarge concurrency ** error_rate ** prompt_token_count_mean 3394 prompt_token_throughput 2400 completion_token_count_mean 31 completion_token_throughput 15 latency_mean ** latency_p50 ** latency_p95 ** latency_p99 ** transactions_per_minute ** price_per_txn ** <p><sup>1</sup> ** values hidden on purpose, these are available when you run the tool yourself.</p> <p>The report also includes latency Vs prompt size charts for different concurrency levels. As expected, inference latency increases as prompt size increases but what is interesting to note is that the increase is much more at higher concurrency levels (and this behavior varies with instance types).</p> <p></p>"},{"location":"index.html#determine-the-optimal-model-for-your-generative-ai-workload","title":"Determine the optimal model for your generative AI workload","text":"<p>Use <code>FMBench</code> to determine model accuracy using a panel of LLM evaluators (PoLL [1]). Here is one of the plots generated by <code>FMBench</code> to help answer the accuracy question for various FMs on Amazon Bedrock (the model ids in the charts have been blurred out on purpose, you can find them in the actual plot generated on running FMBench).</p> <p></p> <p></p>"},{"location":"index.html#references","title":"References","text":"<p>[1] Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\",    arXiv:2404.18796, 2024.</p>"},{"location":"accuracy.html","title":"Model evaluations using panel of LLM evaluators","text":"<p><code>FMBench</code> release 2.0.0 adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators (PoLL). It gathers quantitative metrics such as Cosine Similarity and overall majority voting accuracy metrics to measure the similarity and accuracy of model responses compared to the ground truth. </p> <p>Accuracy is defined as percentage of responses generated by the LLM that match the ground truth included in the dataset (as a separate column). In order to determine if an LLM generated response matches the ground truth we ask other LLMs called the evaluator LLMs to compare the LLM output and the ground truth and provide a verdict if the LLM generated ground truth is correct or not given the ground truth. Here is the link to the Anthropic Claude 3 Sonnet model prompt being used as an evaluator (or a judge model). A combination of the cosine similarity and the LLM evaluator verdict decides if the LLM generated response is correct or incorrect. Finally, one LLM evaluator could be biased, could have inaccuracies so instead of relying on the judgement of a single evaluator, we rely on the majority vote of 3 different LLM evaluators. By default we use the Anthropic Claude 3 Sonnet, Meta Llama3-70b and the Cohere Command R plus model as LLM evaluators. See  Pat Verga et al., \"Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models\",    arXiv:2404.18796, 2024. for more details on using a Panel of LLM Evaluators (PoLL).</p>"},{"location":"accuracy.html#evaluation-flow","title":"Evaluation Flow","text":"<ol> <li> <p>Provide a dataset that includes ground truth responses for each sample. <code>FMBench</code> uses the LongBench dataset by default. </p> </li> <li> <p>Configure the candidate models to be evaluated in the <code>FMBench</code> config file. See this config file for an example that runs evaluations for multiple models available via Amazon Bedrock. Running evaluations only requires the following two changes to the config file:</p> <ul> <li>Set the <code>4_get_evaluations.ipynb: yes</code>, see this line.</li> <li>Set the <code>ground_truth_col_key: answers</code> and <code>question_col_key: input</code> parameters, see this line. The value of <code>ground_truth_col_key</code> and the <code>question_col_key</code> is set to the name of the column in the dataset that contains the ground truth and question respectively.</li> </ul> </li> <li> <p>Run FMBench, which will: </p> </li> <li> <p>Fetch the inference results containing the model responses </p> </li> <li> <p>Calculate quantitative metrics (Cosine Similarity) </p> </li> <li> <p>Use a Panel of LLM Evaluators to compare each model response to the ground truth </p> </li> <li> <p>Each LLM evaluator will provide a binary verdict (correct/incorrect) and an explanation </p> </li> <li> <p>Validate the LLM evaluations using Cosine Similarity thresholds </p> </li> <li> <p>Categorize the final evaluation for each response as correctly correct, correctly incorrect, or needs further evaluation </p> </li> <li> <p>Review the <code>FMBench</code> report to analyze the evaluation results and compare the performance of the candidate models. The report contains tables and charts that provide insights into model accuracy.</p> </li> </ol> <p>By leveraging ground truth data and a Panel of LLM Evaluators, FMBench provides a comprehensive and efficient way to assess the quality of generative AI models. The majority voting approach, combined with quantitative metrics, enables a robust evaluation that reduces bias and latency while maintaining consistency across responses.</p>"},{"location":"advanced.html","title":"Advanced","text":"<p>Beyond running <code>FMBench</code> with the configuration files provided, you may want try out bringing your own dataset or endpoint to <code>FMBench</code>.</p>"},{"location":"analytics.html","title":"Generate downstream summarized reports for further analysis","text":"<p>You can use several results from various <code>FMBench</code> runs to generate a summarized report of all runs based on your cost, latency, and concurrency budgets. This report helps answer the following question:</p> <p>What is the minimum number of instances N, of most cost optimal instance type T, that are needed to serve a real-time workload W while keeping the average transaction latency under L seconds?\u201d</p> <pre><code>W: = {R transactions per-minute, average prompt token length P, average generation token length G}\n</code></pre> <ul> <li>With this summarized report, we test the following hypothesis: At the low end of the total number of requests/minute smaller instances which provide good inference latency at low concurrencies would suffice (said another way, the larger more expensive instances are an overkill at this stage) but as the number of requests/minute increase there comes an inflection point beyond which the number of smaller instances required would be so much that it would be more economical to use fewer instances of the larger more expensive instances.</li> </ul>"},{"location":"analytics.html#an-example-report-that-gets-generated-is-as-follows","title":"An example report that gets generated is as follows:","text":""},{"location":"analytics.html#summary-for-payload-payload_en_x-y","title":"Summary for payload: payload_en_x-y","text":"<ul> <li> <p>The metrics below in the table are examples and do not represent any specific model or instance type. This table can be used to make analysis on the cost and instance maintenance perspective based on the use case. For example, <code>instance_type_1</code> costs 10 dollars and requires 1 instance to host <code>model_1</code> until it can handle 100 requests per minute. As the requests scale to a 1,000 requests per minute, 5 instances are required and cost 50 dollars. As the requests scale to 10,000 requests per minute, the number of instances to maintain scale to 30, and the cost becomes 450 dollars. </p> </li> <li> <p>On the other hand, <code>instance_type_2</code> is more costly, with a price of $499 for 10,000 requests per minute to host the same model, but only requires 22 instances to maintain, which is 8 less than when the model is hosted on <code>instance_type_1</code>. </p> </li> <li> <p>Based on these summaries, users can make decisions based on their use case priorities. For a real time and latency sensitive application, a user might select <code>instance_type_2</code> to host <code>model_1</code> since the user would have to maintain 8 lesser instances than hosting the model on <code>instance_type_1</code>. Hosting the model on <code>instance_type_2</code> would also maintain the <code>p_95 latency</code> (0.5s), which is half compared to <code>instance_type_1</code> (<code>p_95 latency</code>: 1s) even though it costs more than <code>instance_type_1</code>. On the other hand, if the application is cost sensitive, and the user is flexible to maintain more instances at a higher latency, they might want to shift gears to using <code>instance_type_1</code>.</p> </li> <li> <p>Note: Based on varying needs for prompt size, cost, and latency, the table might change.</p> </li> </ul> experiment_name instance_type concurrency latency_p95 transactions_per_minute instance_count_and_cost_1_rpm instance_count_and_cost_10_rpm instance_count_and_cost_100_rpm instance_count_and_cost_1000_rpm instance_count_and_cost_10000_rpm model_1 instance_type_1 1 1.0 _ (1, 10) (1, 10) (1, 10) (5, 50) (30, 450) model_1 instance_type_2 1 0.5 _ (1, 10) (1, 20) (1, 20) (6, 47) (22, 499)"},{"location":"analytics.html#fmbench-heatmap","title":"FMBench Heatmap","text":"<p>This step also generates a heatmap that contains information about each instance, and how much it costs with per <code>request-per-minute</code> (<code>rpm</code>) breakdown. The default breakdown is [1 <code>rpm</code>, 10 <code>rpm</code>, 100 <code>rpm</code>, 1000 <code>rpm</code>, 10000 <code>rpm</code>]. View an example of a heatmap below. The model name, instance type, is masked but can be generated for your specific use case/requirements.</p> <p></p>"},{"location":"analytics.html#steps-to-run-analytics","title":"Steps to run analytics","text":"<ol> <li> <p>Clone the <code>FMBench</code> repo from GitHub.</p> </li> <li> <p>Place all of the <code>result-{model-id}-...</code> folders that are generated from various runs in the top level directory.</p> </li> <li> <p>Run the following command to generate downstream analytics and summarized tables. Replace <code>x</code>, <code>y</code>, <code>z</code> and <code>model_id</code> with the latency, concurrency thresholds, payload file of interest (for example <code>payload_en_1000-2000.jsonl</code>) and the <code>model_id</code> respectively. The <code>model_id</code> would have to be appended to the <code>results-{model-id}</code> folders so the analytics.py file can generate a report for all of those respective result folders. </p> <pre><code>python analytics/analytics.py --latency-threshold x --concurrency-threshold y  --payload-file z --model-id model_id\n</code></pre> </li> </ol>"},{"location":"announcement.html","title":"Release 2.0 announcement","text":"<p>We are excited to share news about a major FMBench release, we now have release 2.0 for FMBench that supports model evaluations through a panel of LLM evaluators\ud83c\udf89. With the recent feature additions to FMBench we are already seeing increased interest from customers and hope to reach even more customers and have an even greater impact. Check out all the latest and greatest features from FMBench on the FMBench website.</p> <p>Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.</p> <p>Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).</p> <p>Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.</p> <p>Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.</p>"},{"location":"benchmarking.html","title":"Benchmark models deployed on different AWS Generative AI services","text":"<p><code>FMBench</code> comes packaged with configuration files for benchmarking models on different AWS Generative AI services. </p>"},{"location":"benchmarking.html#full-list-of-benchmarked-models","title":"Full list of benchmarked models","text":"Model EC2 g5 EC2 p4 EC2 p5 EC2 Inf2/Trn1 SageMaker g4dn/g5/p3 SageMaker Inf2/Trn1 SageMaker P4 SageMaker P5 Bedrock On-demand throughput Bedrock provisioned throughput Anthropic Claude-3 Sonnet \u2705 \u2705 Anthropic Claude-3 Haiku \u2705 Mistral-7b-instruct \u2705 \u2705 \u2705 \u2705 \u2705 Mistral-7b-AWQ \u2705 Mixtral-8x7b-instruct \u2705 Llama3.1-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3.1-70b instruct \u2705 \u2705 \u2705 Llama3-8b instruct \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 \u2705 Llama3-70b instruct \u2705 \u2705 \u2705 \u2705 \u2705 Llama2-13b chat \u2705 \u2705 \u2705 \u2705 Llama2-70b chat \u2705 \u2705 \u2705 \u2705 Amazon Titan text lite \u2705 Amazon Titan text express \u2705 Cohere Command text \u2705 Cohere Command light text \u2705 AI21 J2 Mid \u2705 AI21 J2 Ultra \u2705 Gemma-2b \u2705 Phi-3-mini-4k-instruct \u2705 distilbert-base-uncased \u2705"},{"location":"benchmarking_on_bedrock.html","title":"Benchmark models on Bedrock","text":"<p>Choose any config file from the <code>bedrock</code> folder and either run these directly or use them as templates for creating new config files specific to your use-case. Here is an example for benchmarking the <code>Llama3</code> models on Bedrock.</p> <pre><code>fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/bedrock/config-bedrock-llama3.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre>"},{"location":"benchmarking_on_ec2.html","title":"Benchmark models on EC2","text":"<p>You can use <code>FMBench</code> to benchmark models on hosted on EC2. This can be done in one of two ways:</p> <ul> <li>Deploy the model on your EC2 instance independantly of <code>FMBench</code> and then benchmark it through the Bring your own endpoint mode.</li> <li>Deploy the model on your EC2 instance through <code>FMBench</code> and then benchmark it.</li> </ul> <p>The steps for deploying the model on your EC2 instance are described below. </p> <p>\ud83d\udc49 In this configuration both the model being benchmarked and <code>FMBench</code> are deployed on the same EC2 instance.</p> <p>Create a new EC2 instance suitable for hosting an LMI as per the steps described here. Note that you will need to select the correct AMI based on your instance type, this is called out in the instructions.</p> <p>The steps for benchmarking on different types of EC2 instances (GPU/CPU/Neuron) and different inference containers differ slightly. These are all described below.</p>"},{"location":"benchmarking_on_ec2.html#benchmarking-options-on-ec2","title":"Benchmarking options on EC2","text":"<ul> <li>Benchmarking on an instance type with NVIDIA GPUs or AWS Chips</li> <li>Benchmarking on an instance type with NVIDIA GPU and the Triton inference server</li> <li>Benchmarking on an instance type with AWS Chips and the Triton inference server</li> <li>Benchmarking on an CPU instance type with AMD processors</li> <li> <p>Benchmarking on an CPU instance type with Intel processors</p> </li> <li> <p>Benchmarking the Triton inference server</p> </li> </ul>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpus-or-aws-chips","title":"Benchmarking on an instance type with NVIDIA GPUs or AWS Chips","text":"<ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>.</p> <pre><code>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\ndocker compose version \n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n</code></pre> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. Skip to the next step if benchmarking for AWS Chips. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>For example, to run <code>FMBench</code> on a <code>llama3-8b-Instruct</code> model on an <code>inf2.48xlarge</code> instance, run the command  command below. The config file for this example can be viewed here.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-nvidia-gpu-and-the-triton-inference-server","title":"Benchmarking on an instance type with NVIDIA GPU and the Triton inference server","text":"<ol> <li> <p>Follow steps in the Benchmarking on an instance type with NVIDIA GPUs or AWS Chips section to install <code>FMBench</code> but do not run any benchmarking tests yet.</p> </li> <li> <p>Once <code>FMBench</code> is installed then install the following additional dependencies for Triton.</p> <pre><code>cd ~\ngit clone https://github.com/triton-inference-server/tensorrtllm_backend.git  --branch v0.12.0\n# Update the submodules\ncd tensorrtllm_backend\n# Install git-lfs if needed\napt-get update &amp;&amp; apt-get install git-lfs -y --no-install-recommends\ngit lfs install\ngit submodule update --init --recursive\n</code></pre> </li> <li> <p>Now you are ready to run benchmarking with Triton. For example for benchmarking <code>Llama3-8b</code> model on a <code>g5.12xlarge</code> use the following command:</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-instance-type-with-aws-chips-and-the-triton-inference-server","title":"Benchmarking on an instance type with AWS Chips and the Triton inference server","text":"<p>As of 2024-09-26 this has been tested on a <code>trn1.32xlarge</code> instance</p> <ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda here. (Note: Configure the storage of your EC2 instance to 500GB for this test)</p> <pre><code># Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code># Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n</code></pre> </li> <li> <p>First we need to build the required docker image for <code>triton</code>, and push it locally. To do this, curl the <code>Triton Dockerfile</code> and the script to build and push the triton image locally:</p> <p><pre><code>    # curl the docker file for triton\n    curl -o ./Dockerfile_triton https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/Dockerfile_triton\n\n    # curl the script that builds and pushes the triton image locally\n    curl -o build_and_push_triton.sh https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/scripts/triton/build_and_push_triton.sh\n\n    # Make the triton build and push script executable, and run it\n    chmod +x build_and_push_triton.sh\n    ./build_and_push_triton.sh\n</code></pre>    - Now wait until the docker image is saved locally and then follow the instructions below to start a benchmarking test.</p> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-amd-processors","title":"Benchmarking on an CPU instance type with AMD processors","text":"<p>As of 2024-08-27 this has been tested on a <code>m7a.16xlarge</code> instance</p> <ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda here</p> <pre><code># Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code># Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n</code></pre> </li> <li> <p>Build the <code>vllm</code> container for serving the model. </p> <ol> <li> <p>\ud83d\udc49 The <code>vllm</code> container we are building locally is going to be references in the <code>FMBench</code> config file.</p> </li> <li> <p>The container being build is for CPU only (GPU support might be added in future).</p> <pre><code># Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 4GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=4g .\n</code></pre> </li> </ol> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use <code>sudo</code> each time.</p> <pre><code>sudo usermod -a -G docker $USER\nnewgrp docker\n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_ec2.html#benchmarking-on-an-cpu-instance-type-with-intel-processors","title":"Benchmarking on an CPU instance type with Intel processors","text":"<p>As of 2024-08-27 this has been tested on <code>c5.18xlarge</code> and <code>m5.16xlarge</code> instances</p> <ol> <li> <p>Connect to your instance using any of the options in EC2 (SSH/EC2 Connect), run the following in the EC2 terminal. This command installs Anaconda on the instance which is then used to create a new <code>conda</code> environment for <code>FMBench</code>. See instructions for downloading anaconda here</p> <pre><code># Install Docker and Git using the YUM package manager\nsudo yum install docker git -y\n\n# Start the Docker service\nsudo systemctl start docker\n\n# Download the Miniconda installer for Linux\nwget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell\n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment.</p> <pre><code># Create a new conda environment named 'fmbench_python311' with Python 3.11 and ipykernel\nconda create --name fmbench_python311 -y python=3.11 ipykernel\n\n# Activate the newly created conda environment\nsource activate fmbench_python311\n\n# Upgrade pip and install the fmbench package\npip install -U fmbench\n</code></pre> </li> <li> <p>Build the <code>vllm</code> container for serving the model. </p> <ol> <li> <p>\ud83d\udc49 The <code>vllm</code> container we are building locally is going to be references in the <code>FMBench</code> config file.</p> </li> <li> <p>The container being build is for CPU only (GPU support might be added in future).</p> <pre><code># Clone the vLLM project repository from GitHub\ngit clone https://github.com/vllm-project/vllm.git\n\n# Change the directory to the cloned vLLM project\ncd vllm\n\n# Build a Docker image using the provided Dockerfile for CPU, with a shared memory size of 12GB\nsudo docker build -f Dockerfile.cpu -t vllm-cpu-env --shm-size=12g .\n</code></pre> </li> </ol> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. Replace <code>/tmp</code> in the command below with a different path if you want to store the config files and the <code>FMBench</code> generated data in a different directory.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh -s -- /tmp\n</code></pre> </li> <li> <p>To download the model files from HuggingFace, create a <code>hf_token.txt</code> file in the <code>/tmp/fmbench-read/scripts/</code> directory containing the Hugging Face token you would like to use. In the command below replace the <code>hf_yourtokenstring</code> with your Hugging Face token.</p> <pre><code>echo hf_yourtokenstring &gt; /tmp/fmbench-read/scripts/hf_token.txt\n</code></pre> </li> <li> <p>Before running FMBench, add the current user to the docker group. Run the following commands to run Docker without needing to use <code>sudo</code> each time.</p> <pre><code>sudo usermod -a -G docker $USER\nnewgrp docker\n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>DOCKER_CONFIG=${DOCKER_CONFIG:-$HOME/.docker}\nmkdir -p $DOCKER_CONFIG/cli-plugins\nsudo curl -L https://github.com/docker/compose/releases/latest/download/docker-compose-$(uname -s)-$(uname -m) -o $DOCKER_CONFIG/cli-plugins/docker-compose\nsudo chmod +x $DOCKER_CONFIG/cli-plugins/docker-compose\ndocker compose version\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a packaged or a custom config file. This step will also deploy the model on the EC2 instance. The <code>--write-bucket</code> parameter value is just a placeholder and an actual S3 bucket is not required. You could set the <code>--tmp-dir</code> flag to an EFA path instead of <code>/tmp</code> if using a shared path for storing config files and reports.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml --local-mode yes --write-bucket placeholder --tmp-dir /tmp &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"benchmarking_on_eks.html","title":"Benchmark models on EKS","text":"<p>You can use <code>FMBench</code> to benchmark models on hosted on EKS. This can be done in one of two ways:</p> <ul> <li>Deploy the model on your EKS cluster independantly of <code>FMBench</code> and then benchmark it through the Bring your own endpoint mode.</li> <li>Deploy the model on your EKS cluster through <code>FMBench</code> and then benchmark it.</li> </ul> <p>The steps for deploying the model on your EKS cluster are described below.</p> <p>\ud83d\udc49 EKS cluster creation itself is not a part of the <code>FMBench</code> functionality, the cluster needs to exist before you run the following steps. Steps for cluster creation are provided in this file but it would be best to consult the DoEKS repo on GitHub for comprehensive instructions.</p> <ol> <li> <p>Add the following IAM policies to your existing <code>FMBench</code> Role:</p> <ol> <li> <p>AmazonEKSClusterPolicy: This policy provides Kubernetes the permissions it requires to manage resources on your behalf.</p> </li> <li> <p>AmazonEKS_CNI_Policy: This policy provides the Amazon VPC CNI Plugin (amazon-vpc-cni-k8s) the permissions it requires to modify the IP address configuration on your EKS worker nodes. This permission set allows the CNI to list, describe, and modify Elastic Network Interfaces on your behalf.</p> </li> <li> <p>AmazonEKSWorkerNodePolicy: This policy allows Amazon EKS worker nodes to connect to Amazon EKS Clusters.</p> </li> </ol> </li> <li> <p>Once the EKS cluster is available you can use either the following two files or create your own config files using these files as examples for running benchmarking for these models. These config files require that the EKS cluster has been created as per the steps in these instructions.</p> <ol> <li> <p>config-llama3-8b-eks-inf2.yml: Deploy Llama3 on Trn1/Inf2 instances.</p> </li> <li> <p>config-mistral-7b-eks-inf2.yml: Deploy Mistral 7b on Trn1/Inf2 instances.</p> </li> </ol> <p>For more information about the blueprints used by FMBench to deploy these models, view: DoEKS docs gen-ai.</p> </li> <li> <p>Run the <code>Llama3-8b</code> benchmarking using the command below (replace the config file as needed for a different model). This will first deploy the model on your EKS cluster and then run benchmarking on the deployed model.</p> <pre><code>fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-eks-inf2.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>As the model is getting deployed you might want to run the following <code>kubectl</code> commands to monitor the deployment progress. Set the model_namespace to <code>llama3</code> or <code>mistral</code> or a different model as appropriate.</p> <ol> <li><code>kubectl get pods -n &lt;model_namespace&gt; -w</code>: Watch the pods in the model specific namespace.</li> <li><code>kubectl -n karpenter get pods</code>: Get the pods in the karpenter namespace.</li> <li><code>kubectl describe pod -n &lt;model_namespace&gt; &lt;pod-name&gt;</code>: Describe a specific pod in the mistral namespace to view the live logs.</li> </ol> </li> </ol>"},{"location":"benchmarking_on_sagemaker.html","title":"Benchmark models on SageMaker","text":"<p>Choose any config file from the model specific folders, for example the <code>Llama3</code> folder for <code>Llama3</code> family of models. These configuration files also include instructions for <code>FMBench</code> to first deploy the model on SageMaker using your configured instance type and inference parameters of choice and then run the benchmarking. Here is an example for benchmarking <code>Llama3-8b</code> model on an <code>ml.inf2.24xlarge</code> and <code>ml.g5.12xlarge</code> instance. </p> <pre><code>fmbench --config-file https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama3/8b/config-llama3-8b-inf2-g5.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre>"},{"location":"build.html","title":"Building the <code>FMBench</code> Python package","text":"<p>If you would like to build a dev version of <code>FMBench</code> for your own development and testing purposes, the following steps describe how to do that.</p> <ol> <li> <p>Clone the <code>FMBench</code> repo from GitHub.</p> </li> <li> <p>Make any code changes as needed.</p> </li> <li> <p>Install <code>poetry</code>.</p> <pre><code>pip install poetry mkdocs-material mknotebooks\n</code></pre> </li> <li> <p>Change directory to the <code>FMBench</code> repo directory and run poetry build.</p> <pre><code>poetry build\n</code></pre> </li> <li> <p>The <code>.whl</code> file is generated in the <code>dist</code> folder. Install the <code>.whl</code> in your current Python environment.</p> <pre><code>pip install dist/fmbench-X.Y.Z-py3-none-any.whl\n</code></pre> </li> <li> <p>Run <code>FMBench</code> as usual through the <code>FMBench</code> CLI command.</p> </li> <li> <p>You may have added new config files as part of your work, to make sure these files are called out in the <code>manifest.txt</code> run the following command. This command will overwrite the existing <code>manifest.txt</code> and <code>manifest.md</code> files. Both these files need to be committed to the repo. Reach out to the maintainers of this repo so that they can add new or modified config files to the blogs bucket (the CloudFormation stack would fail if a new file is added to the manifest but is not available for download through the S3 bucket).</p> <pre><code>python create_manifest.py\n</code></pre> </li> <li> <p>To create updated documentation run the following command. You need to be added as a contributor to the <code>FMBench</code> repo to be able to publish to the website, so this command would not work for you if you are not added as a contributor to the repo.</p> <pre><code>mkdocs gh-deploy\n</code></pre> </li> </ol>"},{"location":"byo_dataset.html","title":"Bring your own dataset","text":"<p>By default <code>FMBench</code> uses the <code>LongBench dataset</code> dataset for testing the models, but this is not the only dataset you can test with. You may want to test with other datasets available on HuggingFace or use your own datasets for testing. You can do this by converting your dataset to the <code>JSON lines</code> format. We provide a code sample for converting any HuggingFace dataset into JSON lines format and uploading it to the S3 bucket used by <code>FMBench</code> in the <code>bring_your_own_dataset</code> notebook. Follow the steps described in the notebook to bring your own dataset for testing with <code>FMBench</code>.</p>"},{"location":"byo_dataset.html#support-for-open-orca-dataset","title":"Support for Open-Orca dataset","text":"<p>Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral, see:</p> <ol> <li>bring_your_own_dataset.ipynb</li> <li>prompt templates</li> <li>Llama3 config file with OpenOrca</li> </ol>"},{"location":"byo_rest_predictor.html","title":"Bring your own <code>REST Predictor</code> (<code>data-on-eks</code> version)","text":"<p><code>FMBench</code> now provides an example of bringing your own endpoint as a <code>REST Predictor</code> for benchmarking. View this <code>script</code> as an example. This script is an inference file for the <code>NousResearch/Llama-2-13b-chat-hf</code> model deployed on an Amazon EKS cluster using Ray Serve. The model is deployed via <code>data-on-eks</code> which is a comprehensive resource for scaling your data and machine learning workloads on Amazon EKS and unlocking the power of Gen AI. Using <code>data-on-eks</code>, you can harness the capabilities of AWS Trainium, AWS Inferentia and NVIDIA GPUs to scale and optimize your Gen AI workloads and benchmark those models on FMBench with ease. </p>"},{"location":"byoe.html","title":"Bring your own endpoint (a.k.a. support for external endpoints)","text":"<p>If you have an endpoint deployed on say <code>Amazon EKS</code> or <code>Amazon EC2</code> or have your models hosted on a fully-managed service such as <code>Amazon Bedrock</code>, you can still bring your endpoint to <code>FMBench</code> and run tests against your endpoint. To do this you need to do the following:</p> <ol> <li> <p>Create a derived class from <code>FMBenchPredictor</code> abstract class and provide implementation for the constructor, the <code>get_predictions</code> method and the <code>endpoint_name</code> property. See <code>SageMakerPredictor</code> for an example. Save this file locally as say <code>my_custom_predictor.py</code>.</p> </li> <li> <p>Upload your new Python file (<code>my_custom_predictor.py</code>) for your custom FMBench predictor to your <code>FMBench</code> read bucket and the scripts prefix specified in the <code>s3_read_data</code> section (<code>read_bucket</code> and <code>scripts_prefix</code>).</p> </li> <li> <p>Edit the configuration file you are using for your <code>FMBench</code> for the following:</p> <ul> <li>Skip the deployment step by setting the <code>2_deploy_model.ipynb</code> step under <code>run_steps</code> to <code>no</code>.</li> <li>Set the <code>inference_script</code> under any experiment in the <code>experiments</code> section for which you want to use your new custom inference script to point to your new Python file (<code>my_custom_predictor.py</code>) that contains your custom predictor.</li> </ul> </li> </ol>"},{"location":"ec2.html","title":"Run <code>FMBench</code> on Amazon EC2","text":"<p>For some enterprise scenarios it might be desirable to run <code>FMBench</code> directly on an EC2 instance with no dependency on S3. Here are the steps to do this:</p> <ol> <li> <p>Have a <code>t3.xlarge</code> (or larger) instance in the <code>Running</code> stage. Make sure that the instance has at least 50GB of disk space and the IAM role associated with your EC2 instance has <code>AmazonSageMakerFullAccess</code> policy associated with it and <code>sagemaker.amazonaws.com</code> added to its Trust relationships.     <pre><code>{\n    \"Effect\": \"Allow\",\n    \"Principal\": {\n        \"Service\": \"sagemaker.amazonaws.com\"\n    },\n    \"Action\": \"sts:AssumeRole\"\n}\n</code></pre></p> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment. This step required conda to be installed on the EC2 instance, see instructions for downloading Anaconda.</p> <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n</code></pre> </li> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo.</p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n</code></pre> </li> <li> <p>Run <code>FMBench</code> with a quickstart config file.</p> <pre><code>fmbench --config-file /tmp/fmbench-read/configs/llama2/7b/config-llama2-7b-g5-quick.yml --local-mode yes &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open a new Terminal and navigate to the <code>foundation-model-benchmarking-tool</code> directory and do a <code>tail</code> on <code>fmbench.log</code> to see a live log of the run.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>All metrics are stored in the <code>/tmp/fmbench-write</code> directory created automatically by the <code>fmbench</code> package. Once the run completes all files are copied locally in a <code>results-*</code> folder as usual.</p> </li> </ol>"},{"location":"features.html","title":"<code>FMBench</code> features","text":"<p>Support for Model Evaluations: FMBench now adds support for evaluating candidate models using Majority Voting with a Panel of LLM Evaluators. Customers can now use FMBench to evaluate model accuracy across open-source and custom datasets, thus FMBench now enables customers to not only measure performance (inference latency, cost, throughput) but also model accuracy.</p> <p>Native support for LLM compilation and deployment on AWS Silicon: FMBench now supports end-to-end compilation and model deployment on AWS Silicon. Customers no longer have to wait for models to be available for AWS Chips via SageMaker JumpStart and neither do they have to go through the process of compiling the model to Neuron themselves, FMBench does it all for them. We can simply put the relevant configuration options in the FMBench config file and it will compile and deploy the model on SageMaker (config) or EC2 (config).</p> <p>Website for better user experience: FMBench has a website now along with an introduction video. The website is fully searchable to ease common tasks such as installation, finding the right config file, benchmarking on various hosting platforms (EC2, EKS, Bedrock, Neuron, Docker), model evaluation, etc. This website was created based on feedback from several internal teams and external customers.</p> <p>Native support for all AWS generative AI services: FMBench now benchmarks and evaluates any Foundation Model (FM) deployed on any AWS Generative AI service, be it Amazon SageMaker, Amazon Bedrock, Amazon EKS, or Amazon EC2. We initially built FMBench for SageMaker, and later extended it to Bedrock and then based on customer requests extended it to support models on EKS and EC2 as well. See list of config files supported out of the box, you can use these config files either as is or as templates for creating your own custom config.</p>"},{"location":"gettingstarted.html","title":"Getting started with <code>FMBench</code>","text":"<p><code>FMBench</code> is available as a Python package on PyPi and is run as a command line tool once it is installed. All data that includes metrics, reports and results are stored in an Amazon S3 bucket.</p> <p>While technically you can run <code>FMBench</code> on any AWS compute but practically speaking we either run it on a SageMaker Notebook or on EC2. Both these options are described below.</p> <p>Intro Video</p> <p></p> <p></p>"},{"location":"gettingstarted.html#fmbench-in-a-client-server-configuration-on-amazon-ec2","title":"<code>FMBench</code> in a client-server configuration on Amazon EC2","text":"<p>Often times there might be a need where a platform team would like to have a bunch of LLM endpoints deployed in an account available permanently for data science teams or application teams to benchmark performance and accuracy for their specific use-case. They can take advantage of a special client-server configuration for <code>FMBench</code> where it can be used to deploy models on EC2 instances in one AWS account (called the server account) and run tests against these endpoints from <code>FMBench</code> deployed on EC2 instances in another AWS account (called the client AWS account).</p> <p>This has the advantage that every team that wants to benchmark a set of LLMs does not first have to deploy the LLMs, a platform team can do that for them and have these LLMs available for a longer duration as these teams do their benchmarking, for example for their specific datasets, for their specific cost and performance criteria. Using <code>FMBench</code> in this way makes the process simpler for both teams as the platform team can use <code>FMBench</code> for easily deploying the models with full control on the configuration of the serving stack without having to write any LLM deployment code for EC2 and the data science teams or application teams can test with different datasets, performance criteria and inference parameters. As long as the security groups have an inbound rule to allow access to the model endpoint (typically TCP port 8080) an <code>FMBench</code> installation in the client AWS account should be able to access an endpoint in the server AWS account.</p> <p></p>"},{"location":"manifest.html","title":"Files","text":"<p>Here is a listing of the various configuration files available out-of-the-box with <code>FMBench</code>. Click on any link to view a file. You can use these files as-is or use them as templates to create a custom configuration file for your use-case of interest.</p> <p>bedrock \u251c\u2500\u2500 bedrock/config-bedrock-all-anthropic-models-longbench-data.yml \u251c\u2500\u2500 bedrock/config-bedrock-anthropic-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-claude.yml \u251c\u2500\u2500 bedrock/config-bedrock-evals-only-conc-1.yml \u251c\u2500\u2500 bedrock/config-bedrock-haiku-sonnet-majority-voting.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-70b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-8b-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-1-no-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-llama3-streaming.yml \u251c\u2500\u2500 bedrock/config-bedrock-models-OpenOrca.yml \u251c\u2500\u2500 bedrock/config-bedrock-titan-text-express.yml \u2514\u2500\u2500 bedrock/config-bedrock.yml bert \u2514\u2500\u2500 bert/config-distilbert-base-uncased.yml byoe \u2514\u2500\u2500 byoe/config-model-byo-sagemaker-endpoint.yml eks_manifests \u251c\u2500\u2500 eks_manifests/llama3-ray-service.yaml \u2514\u2500\u2500 eks_manifests/mistral-ray-service.yaml gemma \u2514\u2500\u2500 gemma/config-gemma-2b-g5.yml llama2 \u251c\u2500\u2500 llama2/13b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-bedrock-sagemaker-llama2.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-byo-rest-ep-llama2-13b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/13b/config-llama2-13b-inf2-g5.yml \u251c\u2500\u2500 llama2/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-ec2-llama2-70b.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-tgi.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama2/70b/config-llama2-70b-g5-p4d-trt.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama2/70b/config-llama2-70b-inf2-g5.yml \u2514\u2500\u2500 llama2/7b \u251c\u2500\u2500 llama2/7b/config-llama2-7b-byo-sagemaker-endpoint.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g4dn-g5-trt.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-no-s3-quick.yml \u251c\u2500\u2500 llama2/7b/config-llama2-7b-g5-quick.yml \u2514\u2500\u2500 llama2/7b/config-llama2-7b-inf2-g5.yml llama3 \u251c\u2500\u2500 llama3/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-bedrock.yml -&gt; ../../bedrock/config-bedrock.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-llama3-70b-instruct.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-ec2-neuron-llama3-70b-inf2-48xl-deploy-sm.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-48xl.yml \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3/70b/config-llama3-70b-instruct-g5-p4d.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3/70b/config-llama3-70b-instruct-p4d.yml \u2514\u2500\u2500 llama3/8b \u251c\u2500\u2500 llama3/8b/config-bedrock.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-c5-18xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-inf2-48xl.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m5-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-16xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7a-24xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-m7i-12xlarge.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b-neuron-trn1-32xl-tp16-sm.yml \u251c\u2500\u2500 config-llama3-8b-trn1-32xl-tp16-bs-4-ec2.yml \u251c\u2500\u2500 llama3/8b/config-ec2-llama3-8b.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-24xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-ec2-neuron-llama3-8b-inf2-48xl-deploy-sm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-eks-inf2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5-streaming.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-2-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-djl-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-g5.12xl-tp-4-mc-max-triton-ec2.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-24xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-48xl-tp=8-bs=4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5-byoe-w-openorca.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-inf2-g5.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-all.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl-4-instances.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-2xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g5-p4d.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-12xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-24xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-g6-48xl.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p4d-djl-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-instruct-p5-djl-lmi-dist.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-16-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xl-tp-8-bs-4-byoe.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1-32xlarge-triton-vllm.yml \u251c\u2500\u2500 llama3/8b/config-llama3-8b-trn1.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-24xl-byoe-g5-12xl.yml \u251c\u2500\u2500 llama3/8b/llama3-8b-inf2-48xl-byoe-g5-24xl.yml \u2514\u2500\u2500 llama3/8b/llama3-8b-trn1-32xl-byoe-g5-24xl.yml llama3.1 \u251c\u2500\u2500 llama3.1/70b \u2502\u00a0\u00a0 \u251c\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-48xl-deploy-ec2.yml \u2502\u00a0\u00a0 \u2514\u2500\u2500 llama3.1/70b/config-ec2-llama3-1-70b-inf2-deploy-sm.yml \u2514\u2500\u2500 llama3.1/8b \u251c\u2500\u2500 llama3.1/8b/client-config-ec2-llama3-1-8b.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2-tp24-bs12.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-inf2.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-4-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p4-tp-8-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-p5-tp-2-mc-max.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-tp-8-mc-auto-p5.yml \u251c\u2500\u2500 llama3.1/8b/config-ec2-llama3-1-8b-trn1-32xl-deploy-ec2-tp32-bs8.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.12xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.24xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.2xl-tp-1-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-auto-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-2-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-4-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.48xl-tp-8-mc-max-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-g5.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-24-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-inf2-48xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn1-32xl-deploy-tp-8-ec2.yml \u251c\u2500\u2500 llama3.1/8b/config-llama3.1-8b-trn32xl-triton-vllm.yml \u2514\u2500\u2500 llama3.1/8b/server-config-ec2-llama3-1-8b-inf2-48xl-deploy-ec2.yml mistral \u251c\u2500\u2500 mistral/config-mistral-7b-eks-inf2.yml \u251c\u2500\u2500 mistral/config-mistral-7b-tgi-g5.yml \u251c\u2500\u2500 mistral/config-mistral-7b-trn1-32xl-triton.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5-byo-ep.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-AWQ-p5.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-p4d.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v1-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p4d-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-lmi-dist.yml \u251c\u2500\u2500 mistral/config-mistral-instruct-v2-p5-trtllm.yml \u251c\u2500\u2500 mistral/config-mistral-trn1-32xl-deploy-ec2-tp32.yml \u2514\u2500\u2500 mistral/config-mistral-v3-inf2-48xl-deploy-ec2-tp24.yml model_eval_all_info.yml phi \u2514\u2500\u2500 phi/config-phi-3-g5.yml pricing.yml </p>"},{"location":"mm_copies.html","title":"Running multiple model copies on Amazon EC2","text":"<p>It is possible to run multiple copies of a model if the tensor parallelism degree and the number of GPUs/Neuron cores on the instance allow it. For example if a model can fit into 2 GPU devices and there are 8 devices available then we could run 4 copies of the model on that instance. Some inference containers, such as the DJL Serving LMI automatically start multiple copies of the model within the same inference container for the scenario described in the example above. However, it is also possible to do this ourselves by running multiple containers and a load balancer through a Docker compose file. <code>FMBench</code> now supports this functionality by adding a single parameter called <code>model_copies</code> in the configuration file.</p> <p>For example, here is a snippet from the config-ec2-llama3-1-8b-p4-tp-2-mc-max config file. The new parameters are <code>model_copies</code>, <code>tp_degree</code> and <code>shm_size</code> in the <code>inference_spec</code> section. Note that the <code>tp_degree</code> in the <code>inference_spec</code> and <code>option.tensor_parallel_degree</code> in the <code>serving.properties</code> section should be set to the same value.</p> <pre><code>    inference_spec:\n      # this should match one of the sections in the inference_parameters section above\n      parameter_set: ec2_djl\n      # how many copies of the model, \"1\", \"2\",..max\n      # set to 1 in the code if not configured,\n      # max: FMBench figures out the max number of model containers to be run\n      #      based on TP degree configured and number of neuron cores/GPUs available.\n      #      For example, if TP=2, GPUs=8 then FMBench will start 4 containers and 1 load balancer,\n      # auto: only supported if the underlying inference container would automatically \n      #       start multiple copies of the model internally based on TP degree and neuron cores/GPUs\n      #       available. In this case only a single container is created, no load balancer is created.\n      #       The DJL serving containers supports auto.  \n      model_copies: max\n      # if you set the model_copies parameter then it is mandatory to set the \n      # tp_degree, shm_size, model_loading_timeout parameters\n      tp_degree: 2\n      shm_size: 12g\n      model_loading_timeout: 2400\n    # modify the serving properties to match your model and requirements\n    serving.properties: |\n      engine=MPI\n      option.tensor_parallel_degree=2\n      option.max_rolling_batch_size=256\n      option.model_id=meta-llama/Meta-Llama-3.1-8B-Instruct\n      option.rolling_batch=lmi-dist\n</code></pre>"},{"location":"mm_copies.html#considerations-while-setting-the-model_copies-parameter","title":"Considerations while setting the <code>model_copies</code> parameter","text":"<ol> <li> <p>The <code>model_copies</code> parameter is an EC2 only parameter, which means that you cannot use it when deploying models on SageMaker for example.</p> </li> <li> <p>If you are looking for the best (lowest) inference latency then you might get better results with setting the <code>tp_degree</code> and <code>option.tensor_parallel_degree</code> to the total number of GPUs/Neuron cores available on your EC2 instance and <code>model_copies</code> to <code>max</code> or <code>auto</code> or <code>1</code>, in other words, the model is being shared across all accelerators and there can be only 1 copy of the model that can run on that instance (therefore setting <code>model_copies</code> to <code>max</code> or <code>auto</code> or <code>1</code> all result in the same thing i.e. a single copy of the model running on that EC2 instance).</p> </li> <li> <p>If you are looking for the best (highest) transaction throughput while keeping the inference latency within a given latency budget then you might want to configure <code>tp_degree</code> and <code>option.tensor_parallel_degree</code> to the least number of GPUs/Neuron cores on which the model can run (for example for <code>Llama3.1-8b</code> that would be 2 GPUs or 4 Neuron cores) and set the <code>model_copies</code> to <code>max</code>. Let us understand this with an example, say you want to run <code>Llama3.1-8b</code> on a <code>p4de.24xlarge</code> instance type, you set <code>tp_degree</code> and <code>option.tensor_parallel_degree</code> to 2 and <code>model_copies</code> to <code>max</code>, <code>FMBench</code> will start 4 containers (as the <code>p4de.24xlarge</code> has 8 GPUs) and an Nginx load balancer that will round-robin the incoming requests to these 4 containers. In case of the DJL serving LMI you can achieve similar results by setting the <code>model_copies</code> to <code>auto</code> in which case <code>FMBench</code> will start a single container (and no load balancer since there is only one container) and then the DJL serving container will internally start 4 copies of the model within the same container and route the requests to these 4 copies internally. Theoretically you should expect the same performance but in our testing we have seen better performance with <code>model_copies</code> set to <code>max</code> and having an external (Nginx) container doing the load balancing.</p> </li> </ol>"},{"location":"neuron.html","title":"Benchmark foundation models for AWS Chips","text":"<p>You can use <code>FMBench</code> for benchmarking foundation model on AWS Chips: Trainium 1, Inferentia 2. This can be done on Amazon SageMaker, Amazon EKS or on Amazon EC2. FMs need to be first compiled for Neuron before they can be deployed on AWS Chips, this is made easier by SageMaker JumpStart which provides most of the FMs as a JumpStart Model that can be deployed on SageMaker directly, you can also compile models for Neuron yourself or do this through <code>FMBench</code> itself. All of these options are described below.</p>"},{"location":"neuron.html#benchmarking-for-aws-chips-on-sagemaker","title":"Benchmarking for AWS Chips on SageMaker","text":"<ol> <li> <p>Several FMs are available through SageMaker JumpStart already compiled for Neuron and ready to deploy. See this link for more details.</p> </li> <li> <p>You can compile the model outside of <code>FMBench</code> using instructions available here and on the Neuron documentation, deploy on SageMaker and use <code>FMBench</code> in the <code>bring your own endpoint</code> mode, see this config file for an example.</p> </li> <li> <p>You can have <code>FMBench</code> compile and deploy the model on SageMaker for you. See this Llama3-8b config file for example or this Llama3.1-70b. Search this website for \"inf2\" or \"trn1\" to find other examples. In this case <code>FMBench</code> will download the model from Hugging Face (you need to provide your HuggingFace token in the <code>/tmp/fmbench-read/scripts/hf_token.txt</code> file, the file simply contains the token without any formatting), compile it for neuron, upload the compiled model to S3 (you specify the bucket in the config file) and then deploy the model to a SageMaker endpoint.</p> </li> </ol>"},{"location":"neuron.html#benchmarking-for-aws-chips-on-ec2","title":"Benchmarking for AWS Chips on EC2","text":"<p>You may want to benchmark models hosted directly on EC2. In this case both <code>FMBench</code> and the model are running on the same EC2 instance. <code>FMBench</code> will deploy the model for you on the EC2 instance. See this Llama3.1-70b file for example or this Llama3-8b file. In this case <code>FMBench</code> will download the model from Hugging Face (you need to provide your HuggingFace token in the <code>/tmp/fmbench-read/scripts/hf_token.txt</code> file, the file simply contains the token without any formatting), pull the inference container from the ECR repo and then run the container with the downloaded model, a local endpoint is provided that is then used by <code>FMBench</code> to run inference.</p>"},{"location":"quickstart.html","title":"Quickstart - run <code>FMBench</code> on SageMaker Notebook","text":"<ol> <li> <p>Each <code>FMBench</code> run works with a configuration file that contains the information about the model, the deployment steps, and the tests to run. A typical <code>FMBench</code> workflow involves either directly using an already provided config file from the <code>configs</code> folder in the <code>FMBench</code> GitHub repo or editing an already provided config file as per your own requirements (say you want to try benchmarking on a different instance type, or a different inference container etc.).</p> <p>\ud83d\udc49 A simple config file with key parameters annotated is included in this repo, see <code>config-llama2-7b-g5-quick.yml</code>. This file benchmarks performance of Llama2-7b on an <code>ml.g5.xlarge</code> instance and an <code>ml.g5.2xlarge</code> instance. You can use this config file as it is for this Quickstart.</p> </li> <li> <p>Launch the AWS CloudFormation template included in this repository using one of the buttons from the table below. The CloudFormation template creates the following resources within your AWS account: Amazon S3 buckets, Amazon IAM role and an Amazon SageMaker Notebook with this repository cloned. A read S3 bucket is created which contains all the files (configuration files, datasets) required to run <code>FMBench</code> and a write S3 bucket is created which will hold the metrics and reports generated by <code>FMBench</code>. The CloudFormation stack takes about 5-minutes to create.</p> </li> </ol> AWS Region Link us-east-1 (N. Virginia) us-west-2 (Oregon) us-gov-west-1 (GovCloud West) <ol> <li> <p>Once the CloudFormation stack is created, navigate to SageMaker Notebooks and open the <code>fmbench-notebook</code>.</p> </li> <li> <p>On the <code>fmbench-notebook</code> open a Terminal and run the following commands.     <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311;\npip install -U fmbench\n</code></pre></p> </li> <li> <p>Now you are ready to <code>fmbench</code> with the following command line. We will use a sample config file placed in the S3 bucket by the CloudFormation stack for a quick first run.</p> <ol> <li> <p>We benchmark performance for the <code>Llama2-7b</code> model on a <code>ml.g5.xlarge</code> and a <code>ml.g5.2xlarge</code> instance type, using the <code>huggingface-pytorch-tgi-inference</code> inference container. This test would take about 30 minutes to complete and cost about $0.20.</p> </li> <li> <p>It uses a simple relationship of 750 words equals 1000 tokens, to get a more accurate representation of token counts use the <code>Llama2 tokenizer</code> (instructions are provided in the next section). It is strongly recommended that for more accurate results on token throughput you use a tokenizer specific to the model you are testing rather than the default tokenizer. See instructions provided later in this document on how to use a custom tokenizer.</p> <pre><code>account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/llama2/7b/config-llama2-7b-g5-quick.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> <li> <p>Open another terminal window and do a <code>tail -f</code> on the <code>fmbench.log</code> file to see all the traces being generated at runtime.</p> <pre><code>tail -f fmbench.log\n</code></pre> </li> <li> <p>\ud83d\udc49 For streaming support on SageMaker and Bedrock checkout these config files:</p> <ol> <li>config-llama3-8b-g5-streaming.yml</li> <li>config-bedrock-llama3-streaming.yml</li> </ol> </li> </ol> </li> <li> <p>The generated reports and metrics are available in the <code>sagemaker-fmbench-write-&lt;replace_w_your_aws_region&gt;-&lt;replace_w_your_aws_account_id&gt;</code> bucket. The metrics and report files are also downloaded locally and in the <code>results</code> directory (created by <code>FMBench</code>) and the benchmarking report is available as a markdown file called <code>report.md</code> in the <code>results</code> directory. You can view the rendered Markdown report in the SageMaker notebook itself or download the metrics and report files to your machine for offline analysis.</p> </li> </ol>"},{"location":"quickstart.html#fmbench-on-govcloud","title":"<code>FMBench</code> on GovCloud","text":"<p>No special steps are required for running <code>FMBench</code> on GovCloud. The CloudFormation link for <code>us-gov-west-1</code> has been provided in the section above.</p> <ol> <li> <p>Not all models available via Bedrock or other services may be available in GovCloud. The following commands show how to run <code>FMBench</code> to benchmark the Amazon Titan Text Express model in the GovCloud. See the Amazon Bedrock GovCloud page for more details.</p> <pre><code>account=`aws sts get-caller-identity | jq .Account | tr -d '\"'`\nregion=`aws configure get region`\nfmbench --config-file s3://sagemaker-fmbench-read-${region}-${account}/configs/bedrock/config-bedrock-titan-text-express.yml &gt; fmbench.log 2&gt;&amp;1\n</code></pre> </li> </ol>"},{"location":"releases.html","title":"Releases","text":""},{"location":"releases.html#207","title":"2.0.7","text":"<ol> <li>Support Triton-TensorRT for GPU instances and Triton-vllm for AWS Chips.</li> <li>Misc. bug fixes.</li> </ol>"},{"location":"releases.html#206","title":"2.0.6","text":"<ol> <li>Run multiple model copies with the DJL serving container and an Nginx load balancer on Amazon EC2.</li> <li>Config files for <code>Llama3.1-8b</code> on <code>g5</code>, <code>p4de</code> and <code>p5</code> Amazon EC2 instance types.</li> <li>Better analytics for creating internal leaderboards.</li> </ol>"},{"location":"releases.html#205","title":"2.0.5","text":"<ol> <li>Support for Intel CPU based instances such as <code>c5.18xlarge</code> and <code>m5.16xlarge</code>.</li> </ol>"},{"location":"releases.html#204","title":"2.0.4","text":"<ol> <li>Support for AMD CPU based instances such as <code>m7a</code>.</li> </ol>"},{"location":"releases.html#203","title":"2.0.3","text":"<ol> <li>Support for a EFA directory for benchmarking on EC2.</li> </ol>"},{"location":"releases.html#202","title":"2.0.2","text":"<ol> <li>Code cleanup, minor bug fixes and report improvements.</li> </ol>"},{"location":"releases.html#200","title":"2.0.0","text":"<ol> <li>\ud83d\udea8 Model evaluations done by a Panel of LLM Evaluators \ud83d\udea8</li> </ol>"},{"location":"releases.html#v1052","title":"v1.0.52","text":"<ol> <li>Compile for AWS Chips (Trainium, Inferentia) and deploy to SageMaker directly through <code>FMBench</code>.</li> <li><code>Llama3.1-8b</code> and <code>Llama3.1-70b</code> config files for AWS Chips (Trainium, Inferentia).</li> <li>Misc. bug fixes.</li> </ol>"},{"location":"releases.html#v1051","title":"v1.0.51","text":"<ol> <li><code>FMBench</code> has a website now. Rework the README file to make it lightweight.</li> <li><code>Llama3.1</code> config files for Bedrock.</li> </ol>"},{"location":"releases.html#v1050","title":"v1.0.50","text":"<ol> <li><code>Llama3-8b</code> on Amazon EC2 <code>inf2.48xlarge</code> config file.</li> <li>Update to new version of DJL LMI (0.28.0).</li> </ol>"},{"location":"releases.html#v1049","title":"v1.0.49","text":"<ol> <li>Streaming support for Amazon SageMaker and Amazon Bedrock.</li> <li>Per-token latency metrics such as time to first token (TTFT) and mean time per-output token (TPOT).</li> <li>Misc. bug fixes.</li> </ol>"},{"location":"releases.html#v1048","title":"v1.0.48","text":"<ol> <li>Faster result file download at the end of a test run.</li> <li><code>Phi-3-mini-4k-instruct</code> configuration file.</li> <li>Tokenizer and misc. bug fixes.</li> </ol>"},{"location":"releases.html#v1047","title":"v1.0.47","text":"<ol> <li>Run <code>FMBench</code> as a Docker container.</li> <li>Bug fixes for GovCloud support.</li> <li>Updated README for EKS cluster creation.</li> </ol>"},{"location":"releases.html#v1046","title":"v1.0.46","text":"<ol> <li>Native model deployment support for EC2 and EKS (i.e. you can now deploy and benchmark models on EC2 and EKS).</li> <li>FMBench is now available in GovCloud.</li> <li>Update to latest version of several packages.</li> </ol>"},{"location":"releases.html#v1045","title":"v1.0.45","text":"<ol> <li>Analytics for results across multiple runs.</li> <li><code>Llama3-70b</code> config files for <code>g5.48xlarge</code> instances.</li> </ol>"},{"location":"releases.html#v1044","title":"v1.0.44","text":"<ol> <li>Endpoint metrics (CPU/GPU utilization, memory utiliztion, model latency) and invocation metrics (including errors) for SageMaker Endpoints.</li> <li><code>Llama3-8b</code> config files for <code>g6</code> instances.</li> </ol>"},{"location":"releases.html#v1042","title":"v1.0.42","text":"<ol> <li>Config file for running <code>Llama3-8b</code> on all instance types except <code>p5</code>.</li> <li>Fix bug with business summary chart.</li> <li>Fix bug with deploying model using a DJL DeepSpeed container in the no S3 dependency mode.</li> </ol>"},{"location":"releases.html#v1040","title":"v1.0.40","text":"<ol> <li>Make it easy to run in the Amazon EC2 without any dependency on Amazon S3 dependency mode.</li> </ol>"},{"location":"releases.html#v1039","title":"v1.0.39","text":"<ol> <li>Add an internal <code>FMBench</code> website.</li> </ol>"},{"location":"releases.html#v1038","title":"v1.0.38","text":"<ol> <li>Support for running <code>FMBench</code> on Amazon EC2 without any dependency on Amazon S3.</li> <li><code>Llama3-8b-Instruct</code> config file for <code>ml.p5.48xlarge</code>.</li> </ol>"},{"location":"releases.html#v1037","title":"v1.0.37","text":"<ol> <li><code>g5</code>/<code>p4d</code>/<code>inf2</code>/<code>trn1</code> specific config files for <code>Llama3-8b-Instruct</code>.<ol> <li><code>p4d</code> config file for both <code>vllm</code> and <code>lmi-dist</code>.</li> </ol> </li> </ol>"},{"location":"releases.html#v1036","title":"v1.0.36","text":"<ol> <li>Fix bug at higher concurrency levels (20 and above).</li> <li>Support for instance count &gt; 1.</li> </ol>"},{"location":"releases.html#v1035","title":"v1.0.35","text":"<ol> <li>Support for Open-Orca dataset and corresponding prompts for Llama3, Llama2 and Mistral.</li> </ol>"},{"location":"releases.html#v1034","title":"v1.0.34","text":"<ol> <li>Don't delete endpoints for the bring your own endpoint case.</li> <li>Fix bug with business summary chart.</li> </ol>"},{"location":"releases.html#v1032","title":"v1.0.32","text":"<ol> <li> <p>Report enhancements: New business summary chart, config file embedded in the report, version numbering and others.</p> </li> <li> <p>Additional config files: Meta Llama3 on Inf2, Mistral instruct with <code>lmi-dist</code> on <code>p4d</code> and <code>p5</code> instances.</p> </li> </ol>"},{"location":"resources.html","title":"Resources","text":""},{"location":"resources.html#pending-enhancements","title":"Pending enhancements","text":"<p>View the ISSUES on GitHub and add any you might think be an beneficial iteration to this benchmarking harness.</p>"},{"location":"resources.html#security","title":"Security","text":"<p>See CONTRIBUTING for more information.</p>"},{"location":"resources.html#license","title":"License","text":"<p>This library is licensed under the MIT-0 License. See the LICENSE file.</p>"},{"location":"results.html","title":"Results","text":"<p>Depending upon the experiments in the config file, the <code>FMBench</code> run may take a few minutes to several hours. Once the run completes, you can find the report and metrics in the local <code>results-*</code> folder in the directory from where <code>FMBench</code> was run. The rpeort and metrics are also written to the write S3 bucket set in the config file.</p> <p>Here is a screenshot of the <code>report.md</code> file generated by <code>FMBench</code>. </p>"},{"location":"run_as_container.html","title":"Run <code>FMBench</code> as a Docker container","text":"<p>You can now run <code>FMBench</code> on any platform where you can run a Docker container, for example on an EC2 VM, SageMaker Notebook etc. The advantage is that you do not have to install anything locally, so no <code>conda</code> installs needed anymore. Here are the steps to do that.</p> <ol> <li> <p>Create local directory structure needed for <code>FMBench</code> and copy all publicly available dependencies from the AWS S3 bucket for <code>FMBench</code>. This is done by running the <code>copy_s3_content.sh</code> script available as part of the <code>FMBench</code> repo. You can place model specific tokenizers and any new configuration files you create in the <code>/tmp/fmbench-read</code> directory that is created after running the following command. </p> <pre><code>curl -s https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/copy_s3_content.sh | sh\n</code></pre> </li> <li> <p>That's it! You are now ready to run the container.</p> <pre><code># set the config file path to point to the config file of interest\nCONFIG_FILE=https://raw.githubusercontent.com/aws-samples/foundation-model-benchmarking-tool/main/src/fmbench/configs/llama2/7b/config-llama2-7b-g5-quick.yml\ndocker run -v $(pwd)/fmbench:/app \\\n  -v /tmp/fmbench-read:/tmp/fmbench-read \\\n  -v /tmp/fmbench-write:/tmp/fmbench-write \\\n  aarora79/fmbench:v1.0.47 \\\n \"fmbench --config-file ${CONFIG_FILE} --local-mode yes --write-bucket placeholder &gt; fmbench.log 2&gt;&amp;1\"\n</code></pre> </li> <li> <p>The above command will create a <code>fmbench</code> directory inside the current working directory. This directory contains the <code>fmbench.log</code> and the <code>results-*</code> folder that is created once the run finished.</p> </li> </ol>"},{"location":"website.html","title":"Create a website for <code>FMBench</code> reports","text":"<p>When you use <code>FMBench</code> as a tool for benchmarking your foundation models you would soon want to have an easy way to view all the reports in one place and search through the results, for example, \"<code>Llama3.1-8b</code> results on <code>trn1.32xlarge</code>\". An <code>FMBench</code> website provides a simple way of viewing these results.</p> <p>Here are the steps to setup a website using <code>mkdocs</code> and <code>nginx</code>. The steps below generate a self-signed certificate for SSL and use username and password for authentication. It is strongly recommended that you use a valid SSL cert and a better authentication mechanism than username and password for your <code>FMBench</code> website.</p> <ol> <li> <p>Start an Amazon EC2 machine which will host the <code>FMBench</code> website. A <code>t3.xlarge</code> machine with an Ubuntu AMI say <code>ubuntu/images/hvm-ssd-gp3/ubuntu-noble-24.04-amd64-server-20240801</code> and 50GB storage is good enough. Allow SSH and TCP port 443 traffic from anywhere into that machine.</p> </li> <li> <p>SSH into that machine and install <code>conda</code>.</p> <pre><code>wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh\nbash Miniconda3-latest-Linux-x86_64.sh -b  # Run the Miniconda installer in batch mode (no manual intervention)\nrm -f Miniconda3-latest-Linux-x86_64.sh    # Remove the installer script after installation\neval \"$(/home/$USER/miniconda3/bin/conda shell.bash hook)\" # Initialize conda for bash shell\nconda init  # Initialize conda, adding it to the shell  \n</code></pre> </li> <li> <p>Install <code>docker-compose</code>.</p> <pre><code>sudo apt-get update\nsudo apt-get install --reinstall docker.io -y\nsudo apt-get install -y docker-compose\nsudo usermod -a -G docker $USER\nnewgrp docker\ndocker compose version \n</code></pre> </li> <li> <p>Setup the <code>fmbench_python311</code> conda environment and clone <code>FMBench</code> repo.</p> <pre><code>conda create --name fmbench_python311 -y python=3.11 ipykernel\nsource activate fmbench_python311\npip install -U fmbench mkdocs mkdocs-material mknotebooks\ngit clone https://github.com/aws-samples/foundation-model-benchmarking-tool.git\n</code></pre> </li> <li> <p>Get the <code>FMBench</code> results data from Amazon S3 or whichever storage system you used to store all the results.</p> <pre><code>curl \"https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip\" -o \"awscliv2.zip\"\nsudo apt-get install unzip -y\nunzip awscliv2.zip\nsudo ./aws/install\nFMBENCH_S3_BUCKET=your-fmbench-s3-bucket-name-here\naws s3 sync s3://$FMBENCH_S3_BUCKET $HOME/fmbench_data --exclude \"*.json\"\n</code></pre> </li> <li> <p>Create a directory for the <code>FMBench</code> website contents.</p> <p><pre><code>mkdir $HOME/fmbench_site\nmkdir $HOME/fmbench_site/ssl\n</code></pre> 1. Setup SSL certs (we strongly encourage you to not use self-signed certs, this step here is just for demo purposes, get SSL certs the same way you get them for your current production workloads).</p> <pre><code>sudo openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout $HOME/fmbench_site/ssl/nginx-selfsigned.key -out $HOME/fmbench_site/ssl/nginx-selfsigned.crt\n</code></pre> </li> <li> <p>Create an <code>.httpasswd</code> file. The <code>FMBench</code> website will use the <code>fmbench_admin</code> as a username and a password that you enter as part of the command below to allow login to the website.</p> <pre><code>sudo apt-get install apache2-utils -y\nhtpasswd -c $HOME/fmbench_site/.htpasswd fmbench_admin\n</code></pre> </li> <li> <p>Create the <code>mkdocs.yml</code> file for the website.</p> <pre><code>cd foundation-model-benchmarking-tool\ncp website/index.md $HOME/fmbench_data/\ncp -r img $HOME/fmbench_data/\npython website/create_fmbench_website.py\nmkdocs build -f website/mkdocs.yml --site-dir $HOME/fmbench_site/site\n</code></pre> </li> <li> <p>Update <code>nginx.conf</code> file. Note the hostname that is printed out below, the <code>FMBench</code> website would be served at this address.</p> <pre><code>TOKEN=`curl -X PUT \"http://169.254.169.254/latest/api/token\" -H \"X-aws-ec2-metadata-token-ttl-seconds: 21600\"`\nHOSTNAME=`curl -H \"X-aws-ec2-metadata-token: $TOKEN\" http://169.254.169.254/latest/meta-data/public-hostname`\necho \"hostname is: $HOSTNAME\"\nsed \"s/__HOSTNAME__/$HOSTNAME/g\" website/nginx.conf.template &gt; $HOME/fmbench_site/nginx.conf\n</code></pre> </li> <li> <p>Serve the website.</p> <pre><code>docker run --name fmbench-nginx -d -p 80:80 -p 443:443   -v $HOME/fmbench_site/site:/usr/share/nginx/html   -v $HOME/fmbench_site/nginx.conf:/etc/nginx/nginx.conf   -v $HOME/fmbench_site/ssl:/etc/nginx/ssl   -v $HOME/fmbench_site/.htpasswd:/etc/nginx/.htpasswd   nginx\n</code></pre> </li> <li> <p>Open a web browser and navigate to the hostname you noted in the step above, for example <code>https://&lt;your-ec2-hostname&gt;.us-west-2.compute.amazonaws.com</code>, ignore the security warnings if you used a self-signed SSL cert (replace this with a cert that you would normally use in your production websites) and then enter the username and password (the username would be <code>fmbench_admin</code> and password would be what you had set when running the <code>htpasswd</code> command). You should see a website as shown in the screenshot below.</p> </li> </ol> <p></p>"},{"location":"workflow.html","title":"Workflow for <code>FMBench</code>","text":"<p>The workflow for <code>FMBench</code> is as follows:</p> <pre><code>Create configuration file\n        |\n        |-----&gt; Deploy model on SageMaker/Use models on Bedrock/Bring your own endpoint\n                    |\n                    |-----&gt; Run inference against deployed endpoint(s)\n                                     |\n                                     |------&gt; Create a benchmarking report\n</code></pre> <ol> <li> <p>Create a dataset of different prompt sizes and select one or more such datasets for running the tests.</p> <ol> <li>Currently <code>FMBench</code> supports datasets from LongBench and filter out individual items from the dataset based on their size in tokens (for example, prompts less than 500 tokens, between 500 to 1000 tokens and so on and so forth). Alternatively, you can download the folder from this link to load the data.</li> </ol> </li> <li> <p>Deploy any model that is deployable on SageMaker on any supported instance type (<code>g5</code>, <code>p4d</code>, <code>Inf2</code>).</p> <ol> <li>Models could be either available via SageMaker JumpStart (list available here) as well as models not available via JumpStart but still deployable on SageMaker through the low level boto3 (Python) SDK (Bring Your  Own Script).</li> <li>Model deployment is completely configurable in terms of the inference container to use, environment variable to set, <code>setting.properties</code> file to provide (for inference containers such as DJL that use it) and instance type to use.</li> </ol> </li> <li> <p>Benchmark FM performance in terms of inference latency, transactions per minute and dollar cost per transaction for any FM that can be deployed on SageMaker.</p> <ol> <li>Tests are run for each combination of the configured concurrency levels i.e. transactions (inference requests) sent to the endpoint in parallel and dataset. For example, run multiple datasets of say prompt sizes between 3000 to 4000 tokens at concurrency levels of 1, 2, 4, 6, 8 etc. so as to test how many transactions of what token length can the endpoint handle while still maintaining an acceptable level of inference latency.</li> </ol> </li> <li> <p>Generate a report that compares and contrasts the performance of the model over different test configurations and stores the reports in an Amazon S3 bucket.</p> <ol> <li>The report is generated in the Markdown format and consists of plots, tables and text that highlight the key results and provide an overall recommendation on what is the best combination of instance type and serving stack to use for the model under stack for a dataset of interest.</li> <li>The report is created as an artifact of reproducible research so that anyone having access to the model, instance type and serving stack can run the code and recreate the same results and report.</li> </ol> </li> <li> <p>Multiple configuration files that can be used as reference for benchmarking new models and instance types.</p> </li> </ol>"},{"location":"misc/ec2_instance_creation_steps.html","title":"Create an EC2 instance suitable for an LMI (Large Model Inference)","text":"<p>Follow the steps below to create an EC2 instance for hosting a model in an LMI.</p> <ol> <li> <p>On the homepage of AWS Console go to \u2018EC2\u2019 - it is likely in recently visited:    </p> </li> <li> <p>If not found, go to the search bar on the top of the page. Type <code>ec2</code> into the search box and click the entry that pops up with name <code>EC2</code> :    </p> </li> <li> <p>Click \u201cInstances\u201d:    </p> </li> <li> <p>Click \"Launch Instances\":    </p> </li> <li> <p>Type in a name for your instance (recommended to include your alias in the name), and then scroll down. Search for \u2018deep learning ami\u2019 in the box. Select the one that says Deep Learning OSS Nvidia Driver AMI GPU PyTorch for a GPU instance type, select Deep Learning AMI Neuron (Ubuntu 22.04) for an Inferential/Trainium instance type. Your version number might be different.      </p> </li> <li> <p>Name your instance FMBenchInstance.</p> </li> <li> <p>Add a fmbench-version tag to your instance.    </p> </li> <li> <p>Scroll down to Instance Type. For large model inference, the g5.12xlarge is recommended.</p> </li> </ol> <p></p> <ol> <li> <p>Make a key pair by clicking Create new key pair. Give it a name, keep all settings as is, and then click \u201cCreate key pair\u201d.    </p> </li> <li> <p>Skip over Network settings (leave it as it is), going straight to Configure storage. 45 GB, the suggested amount, is not nearly enough, and using that will cause the LMI docker container to download for an arbitrarily long time and then error out. Change it to 100 GB or more:     </p> </li> <li> <p>Create an IAM role to your instance called FMBenchEC2Role. Attach the following permission policies: <code>AmazonSageMakerFullAccess</code>, <code>AmazonBedrockFullAccess</code>.</p> <p>Edit the trust policy to be the following: <pre><code>{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"ec2.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"sagemaker.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        },\n        {\n            \"Effect\": \"Allow\",\n            \"Principal\": {\n                \"Service\": \"bedrock.amazonaws.com\"\n            },\n            \"Action\": \"sts:AssumeRole\"\n        }\n    ]\n}\n</code></pre> Select this role in the IAM instance profile setting of your instance. </p> </li> <li> <p>Then, we\u2019re done with the settings of the instance. Click Launch Instance to finish. You can connect to your EC2 instance using any of these option     </p> </li> </ol>"},{"location":"misc/eks_cluster-creation_steps.html","title":"EKS cluster creation steps","text":"<p>The steps below create an EKS cluster called <code>trainium-inferentia</code>.</p> <ol> <li> <p>Before we begin, ensure you have all the prerequisites in place to make the deployment process smooth and hassle-free. Ensure that you have installed the following tools on your machine: aws-cli, kubectl and terraform. We use the <code>DoEKS</code> repository as a guide to deploy the cluster infrastructure in an AWS account.</p> </li> <li> <p>Ensue that your account has enough <code>Inf2</code> on-demand VCPUs as most of the DoEKS blueprints utilize this specific instance. To increase service quota navigate to the service quota page for the region you are in service quota. Then select services under the left side menu and search for Amazon Elastic Compute Cloud (Amazon EC2). This will bring up the service quota page, here search for <code>inf</code> and there should be an option for Running On-Demand Inf instances. Increase this quota to 300. </p> </li> <li> <p>Clone the <code>DoEKS</code> repository</p> <pre><code>git clone https://github.com/awslabs/data-on-eks.git\n</code></pre> </li> <li> <p>Ensure that the region names are correct in <code>variables.tf</code> file before running the cluster creation script.</p> </li> <li> <p>Ensure that the ELB to be created would be external facing. Change the helm value from <code>internal</code> to <code>internet-facing</code> here.</p> </li> <li> <p>Ensure that the IAM role you are using has the permissions needed to create the cluster. While we expect the following set of permissions to work but the current recommendation is to also add the <code>AdminstratorAccess</code> permission to the IAM role. At a later date you could remove the  <code>AdminstratorAccess</code> and experiment with cluster creation without it.</p> <ol> <li>Attach the following managed policies: <code>AmazonEKSClusterPolicy</code>, <code>AmazonEKS_CNI_Policy</code>, and <code>AmazonEKSWorkerNodePolicy</code>.</li> <li> <p>In addition to the managed policies add the following as inline policy. Replace your-account-id with the actual value of the AWS account id you are using.</p> <p><pre><code>{\n\"Version\": \"2012-10-17\",\n\"Statement\": [\n    {\n        \"Sid\": \"VisualEditor0\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateVpc\",\n            \"ec2:DeleteVpc\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor1\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:ModifyVpcAttribute\",\n            \"ec2:DescribeVpcAttribute\"\n        ],\n        \"Resource\": \"arn:aws:ec2:*:&lt;your-account-id&gt;:vpc/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor2\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AssociateVpcCidrBlock\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv6pool-ec2/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor3\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:DescribeSecurityGroupRules\",\n            \"ec2:DescribeNatGateways\",\n            \"ec2:DescribeAddressesAttribute\"\n        ],\n        \"Resource\": \"*\"\n    },\n    {\n        \"Sid\": \"VisualEditor4\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:CreateInternetGateway\",\n            \"ec2:RevokeSecurityGroupEgress\",\n            \"ec2:CreateRouteTable\",\n            \"ec2:CreateSubnet\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:security-group/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2::your-account-id:ipam-pool/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor5\",\n        \"Effect\": \"Allow\",\n        \"Action\": [\n            \"ec2:AttachInternetGateway\",\n            \"ec2:AssociateRouteTable\"\n        ],\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:vpn-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:internet-gateway/*\",\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:route-table/*\",\n            \"arn:aws:ec2:*:your-account-id:vpc/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor6\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:AllocateAddress\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:ipv4pool-ec2/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    },\n    {\n        \"Sid\": \"VisualEditor7\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:ReleaseAddress\",\n        \"Resource\": \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n    },\n    {\n        \"Sid\": \"VisualEditor8\",\n        \"Effect\": \"Allow\",\n        \"Action\": \"ec2:CreateNatGateway\",\n        \"Resource\": [\n            \"arn:aws:ec2:*:your-account-id:subnet/*\",\n            \"arn:aws:ec2:*:your-account-id:natgateway/*\",\n            \"arn:aws:ec2:*:your-account-id:elastic-ip/*\"\n        ]\n    }\n]\n}\n</code></pre> 1. Add the Role ARN and name here in the <code>variables.tf</code> file by updating these lines. Move the structure inside the <code>defaut</code> list and replace the role ARN and name values with the values for the role you are using.</p> </li> </ol> </li> <li> <p>Navigate into the <code>ai-ml/trainium-inferentia/</code> directory and run install.sh script.</p> <pre><code>cd data-on-eks/ai-ml/trainium-inferentia/\n./install.sh\n</code></pre> <p>Note: This step takes about 12-15 minutes to deploy the EKS infrastructure and cluster in the AWS account. To view more details on cluster creation, view an example here: Deploy Llama3 on EKS in the prerequisites section.</p> </li> <li> <p>After the cluster is created, navigate to the Karpenter EC2 node IAM role called <code>karpenter-trainium-inferentia-XXXXXXXXXXXXXXXXXXXXXXXXX</code>. Attach the following inline policy to the role:</p> <pre><code>{\n    \"Version\": \"2012-10-17\",\n    \"Statement\": [\n        {\n            \"Sid\": \"Statement1\",\n            \"Effect\": \"Allow\",\n            \"Action\": [\n                \"iam:CreateServiceLinkedRole\"\n            ],\n            \"Resource\": \"*\"\n        }\n    ]\n}\n</code></pre> </li> </ol>"},{"location":"misc/the-diy-version-w-gory-details.html","title":"The diy version w gory details","text":""},{"location":"misc/the-diy-version-w-gory-details.html#the-diy-version-with-gory-details","title":"The DIY version (with gory details)","text":"<p>Follow the prerequisites below to set up your environment before running the code:</p> <ol> <li> <p>Python 3.11: Setup a Python 3.11 virtual environment and install <code>FMBench</code>.</p> <pre><code>python -m venv .fmbench\npip install fmbench\n</code></pre> </li> <li> <p>S3 buckets for test data, scripts, and results: Create two buckets within your AWS account:</p> <ul> <li> <p>Read bucket: This bucket contains <code>tokenizer files</code>, <code>prompt template</code>, <code>source data</code> and <code>deployment scripts</code> stored in a directory structure as shown below. <code>FMBench</code> needs to have read access to this bucket.</p> <pre><code>s3://&lt;read-bucket-name&gt;\n    \u251c\u2500\u2500 source_data/\n    \u251c\u2500\u2500 source_data/&lt;source-data-file-name&gt;.json\n    \u251c\u2500\u2500 prompt_template/\n    \u251c\u2500\u2500 prompt_template/prompt_template.txt\n    \u251c\u2500\u2500 scripts/\n    \u251c\u2500\u2500 scripts/&lt;deployment-script-name&gt;.py\n    \u251c\u2500\u2500 tokenizer/\n    \u251c\u2500\u2500 tokenizer/tokenizer.json\n    \u251c\u2500\u2500 tokenizer/config.json\n</code></pre> <ul> <li> <p>The details of the bucket structure is as follows:</p> <ol> <li> <p>Source Data Directory: Create a <code>source_data</code> directory that stores the dataset you want to benchmark with. <code>FMBench</code> uses <code>Q&amp;A</code> datasets from the <code>LongBench dataset</code> or alternatively from this link. Support for bring your own dataset will be added soon.</p> <ul> <li> <p>Download the different files specified in the LongBench dataset into the <code>source_data</code> directory. Following is a good list to get started with:</p> <ul> <li><code>2wikimqa</code></li> <li><code>hotpotqa</code></li> <li><code>narrativeqa</code></li> <li><code>triviaqa</code></li> </ul> <p>Store these files in the <code>source_data</code> directory.</p> </li> </ul> </li> <li> <p>Prompt Template Directory: Create a <code>prompt_template</code> directory that contains a <code>prompt_template.txt</code> file. This <code>.txt</code> file contains the prompt template that your specific model supports. <code>FMBench</code> already supports the prompt template compatible with <code>Llama</code> models.</p> </li> <li> <p>Scripts Directory: <code>FMBench</code> also supports a <code>bring your own script (BYOS)</code> mode for deploying models that are not natively available via SageMaker JumpStart i.e. anything not included in this list. Here are the steps to use BYOS.</p> <ol> <li> <p>Create a Python script to deploy your model on a SageMaker endpoint. This script needs to have a <code>deploy</code> function that <code>2_deploy_model.ipynb</code> can invoke. See <code>p4d_hf_tgi.py</code> for reference.</p> </li> <li> <p>Place your deployment script in the <code>scripts</code> directory in your read bucket. If your script deploys a model directly from HuggingFace and needs to have access to a HuggingFace auth token, then create a file called <code>hf_token.txt</code> and put the auth token in that file. The <code>.gitignore</code> file in this repo has rules to not commit the <code>hf_token.txt</code> to the repo. Today, <code>FMBench</code> provides inference scripts for:</p> <ul> <li>All SageMaker Jumpstart Models</li> <li>Text-Generation-Inference (TGI) container supported models</li> <li>Deep Java Library DeepSpeed container supported models</li> </ul> <p>Deployment scripts for the options above are available in the scripts directory, you can use these as reference for creating your own deployment scripts as well.</p> </li> </ol> </li> <li> <p>Tokenizer Directory: Place the <code>tokenizer.json</code>, <code>config.json</code> and any other files required for your model's tokenizer in the <code>tokenizer</code> directory. The tokenizer for your model should be compatible with the <code>tokenizers</code> package. <code>FMBench</code> uses <code>AutoTokenizer.from_pretrained</code> to load the tokenizer.     &gt;As an example, to use the <code>Llama 2 Tokenizer</code> for counting prompt and generation tokens for the <code>Llama 2</code> family of models: Accept the License here: meta approval form and download the <code>tokenizer.json</code> and <code>config.json</code> files from Hugging Face website and place them in the <code>tokenizer</code> directory.</p> </li> </ol> </li> </ul> </li> <li> <p>Write bucket: All prompt payloads, model endpoint and metrics generated by <code>FMBench</code> are stored in this bucket. <code>FMBench</code> requires write permissions to store the results in this bucket. No directory structure needs to be pre-created in this bucket, everything is created by <code>FMBench</code> at runtime.</p> <p>```{.bash} s3://     \u251c\u2500\u2500      \u251c\u2500\u2500 /data     \u251c\u2500\u2500 /data/metrics     \u251c\u2500\u2500 /data/models     \u251c\u2500\u2500 /data/prompts ````"}]}
\ No newline at end of file
diff --git a/sitemap.xml.gz b/sitemap.xml.gz
index 2db7c08b..084fc14c 100755
Binary files a/sitemap.xml.gz and b/sitemap.xml.gz differ