Skip to content

Commit

Permalink
Benchmark script (#889)
Browse files Browse the repository at this point in the history
Working on multiple AWS nodes and my Ubuntu box. We can add conda
support later, for now, I just want to get it running on AWS to avoid
manual steps taking extra time.
  • Loading branch information
rengolin authored Feb 25, 2024
1 parent 6f4b13b commit 92efef4
Show file tree
Hide file tree
Showing 3 changed files with 174 additions and 1 deletion.
85 changes: 85 additions & 0 deletions scripts/benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
# Benchmarking TPP-MLIR

This directory containst the script that we used in the paper to extract performance numbers.

It is the same process we use in our CI.

## Machines

For our paper, to achieve reproducible numbers, we have used the following AWS instances:

| Node | Vendor | Arch | Series | Cores | Threads | Memory | Disk |
| ---- | ------ | ---- | ------ | ----- | ------- | ------ | ---- |
| c6i.8xlarge | Intel | Ice Lake | Xeon 3 | 16 | 32 | 64GB | 300GB |
| c6a.8xlarge | AMD | Milan | Zen 3 | 16 | 32 | 64GB | 300GB |
| c7i.8xlarge | Intel | Sapphire Rapids | Xeon 4 | 16 | 32 | 64GB | 300GB |
| c7a.8xlarge | AMD | Genoa with | Zen 4 | 16 | 16 | 32GB | 300GB |
| c7g.8xlarge | Arm | Graviton 3 | Neoverse V1 | 16 | 16 | 32GB | 300GB |

All instances above use the Amazon Linux, which is free on AWS.

## Benchmarks

The benchmarking script in this directory installs the necessary requirements, builds LLVM and TPP-MLIR and runs the following set of benchmarks:
* Base: Comparison between `libxsmm-dnn` "hand-code" with `tpp-mir` compiler generated code.
* PyTorch: Comparison between the compiler performance on Tensorflow-like generated IR with equivalent PyTorch extracted models.
* OpenMP: Scalability analysis between `libxsmm-dnn` and `tpp-mlir` runs above on 2, 4, 8 and 16 threads.

## Execution

### Reserve the node

Use the AWS interface (web, cli) to reserve one of the nodes above. Make sure you can access it in a way to run the scripts (random public IP, elastic IP, console & password). Once the node is running, connect to the node and run the commands below.

### Commands

```sh
# First, identify the type of CPU/OS and its properties, it's good to keep a log to compare
$ lscpu && free -h && uname -a

# Then install git, and clone the repository
$ sudo dnf install -y git
$ git clone https://github.com/plaidml/tpp-mlir.git
$ cd tpp-mlir

# Finally run the install script
$ ./scripts/benchmarks/build_and_run.sh
```

It's recommended that you run the script on either `nohup`, `screen` or `tmux`, so that you can safely disconnect and reconnect later to gather the results.

## Results

Initially, the script will install packages, download and build LLVM, then use that LLVM to build TPP-MLIR and then use the benchmark driver to run the three benchmarks above.

The output is in `verbose` mode, to give an idea of progress and hint at failures if they do occur. The log messages are pre-fixed with a timestamp, the machine name and a message type (`DEBUG`, `INFO`). You should not see `ERROR`s but there may be `WARNING`s depending on the machine type. Some tests only run on `x86_64` while others only on `arm`.

Actual output does not have a log prefix and is in the following format:
```
Benchmark: NAME X
benchmark_x_dnn : 104.446 gflops
benchmark_x_mlir : 106.956 gflops
benchmark_x_torch : 96.543 gflops
benchmark_x_omp_2_dnn : 208.185 gflops
benchmark_x_omp_2_mlir : 212.223 gflops
```

Benchmarks with `dnn` in their names are results from the `libxsmm-dnn` executable and denote micro-kernels laid out by hand in C++ code, calling `libxsmm` directly.

Benchmarks with `mlir`in their names are executed by the compiler on some MLIR input. Of those, there are additional tags:
* nohing: This is the default mode of the compiler and the _"fairest"_ comparison against `libxsmm-dnn`
* `const`: Weights and biases are constant literals on the function, for compile time optimizations.
* `args`: Weights and biases are arguments to the function, so harder to optimize.
* `torch`: Models from PyTorch, instead of our `mlir-gen` _"Tensorflow-like"_ MLIR generator.
* `omp_N`: OpenMP benchmarks on `N` threads.

## Troubleshooting

The script has been tested on developer machines and multiple AWS instances, so it should _"just work"_. However, it is not robust enough to run multiple times without failure. If something breaks in the middle, the worst case scenario is to fix the problem, remove everything and start over, until completion.

There are a number of things you can do to remove _"interim"_ state and re-start:
* If package installation fail, there is no context yet, you can run the script again once the packages are installed correctly.
* If the `build_llvm.sh` script has been called, it will create a directory `~/installs/llvm` before everything. So, if the LLVM build fails, you have to remove it before running it again. You'll also have to remove the zip file (`<hash>.zip`) and the build directory (`llvm`) in the source directory of tpp-mlir.
* If the LLVM build succeeds, then the install will be detected and you won't need to worry about it again.
* If the TPP-MLIR build fails, you can troubleshoot by following the main README and build by hand. The script will always try to build again, but because we use `ninja`, this will take less than a second on an already built directory.
* The benchmark scripts can show warnings or errors on its log. If that happens, go to the `benchmarks` directory and follow instructions on how to use the `driver.py` script and adjust for your system.
75 changes: 75 additions & 0 deletions scripts/benchmarks/build_and_run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
#!/usr/bin/env bash
#
# This script is meant to be used in a new machine, to build LLVM, TPP-MLIR
# and run all benchmarks. This should work in the same way as our local tests
# and reproduce our numbers on local/cloud machines.

# Include common utils
SCRIPT_DIR=$(realpath $(dirname $0)/..)
source ${SCRIPT_DIR}/ci/common.sh

# Install packages needed
if [ "$(is_linux_distro Ubuntu)" == "YES" ]; then
sudo apt update && \
sudo apt install -y build-essential \
cmake clang lld ninja-build \
unzip python3-pip libomp-dev git
elif [ "$(is_linux_distro Amazon)" == "YES" ]; then
sudo dnf install -y cmake clang lld ninja-build \
unzip python3-pip libomp-devel git
else
echo "Not Ubuntu distro, tools may not be available"
fi

# Environment used by the scripts
SOURCE_DIR=$(git_root)
export KIND=Release
export COMPILER=clang
export LINKER=lld

# Build LLVM
export LLVMROOT=${HOME}/installs/llvm
export LLVM_VERSION=$(llvm_version)
export LLVM_INSTALL_DIR=${LLVMROOT}/${LLVM_VERSION}
export LLVM_TAR_DIR=${SOURCE_DIR}/llvm
export LLVM_BUILD_DIR=${SOURCE_DIR}/llvm/build
if [ ! -f "${LLVM_INSTALL_DIR}/bin/mlir-opt" ]; then
${SCRIPT_DIR}/buildkite/build_llvm.sh
else
echo "LLVM already built on ${LLVM_INSTALL_DIR}"
fi

# Build TPP-MLIR
export BUILDKITE_BUILD_CHECKOUT_PATH=${SOURCE_DIR}
export BUILD_DIR=${SOURCE_DIR}/build-${COMPILER}
${SCRIPT_DIR}/buildkite/build_tpp.sh

# Run benchmarks
export BUILDKITE_BENCHMARK_NUM_ITER=1000
export BENCH_DIR=${BUILDKITE_BUILD_CHECKOUT_PATH:-.}/benchmarks
export CONFIG_DIR=$(realpath "${BENCH_DIR}/config")
export NUM_ITER=1000

pushd ${BENCH_DIR}

echo " ========= Base Benchmarks ==========="
echo_run ./driver.py -vv \
-n ${NUM_ITER} \
-c "${CONFIG_DIR}/base/base.json" \
--build "${BUILD_DIR}"

echo " ========= PyTorch Benchmarks ==========="
echo_run ./driver.py -vv \
-n ${NUM_ITER} \
-c "${CONFIG_DIR}/pytorch/torch_dynamo.json" \
--build "${BUILD_DIR}"

echo " ========= OpenMP Benchmarks ==========="
for cfg in dnn-fp32 dnn-bf16 mlir-fp32 mlir-bf16; do
echo_run ./driver.py -vv \
-n ${NUM_ITER} \
-c "${CONFIG_DIR}/omp/${cfg}.json" \
--build "${BUILD_DIR}"
done

popd
15 changes: 14 additions & 1 deletion scripts/ci/common.sh
Original file line number Diff line number Diff line change
Expand Up @@ -84,4 +84,17 @@ wait_for_file() {
echo "."
done
echo "Found"
}
}

# Check if Linux is of a particular distro
is_linux_distro() {
local NAME="${1}"

for file in /etc/os-release /etc/lsb-release /etc/redhat-release; do
if [ -f "${file}" ] && grep -qi "${NAME}" ${file}; then
echo "YES"
return
fi
done
echo "NO"
}

0 comments on commit 92efef4

Please sign in to comment.