Scripts and AWS results for perf section of super command doc (#5506)

brimdata · Nov 27, 2024 · 8901aa2 · 8901aa2
1 parent a178291
commit 8901aa2
Show file tree

Hide file tree

Showing 17 changed files with 1,562 additions and 840 deletions.
diff --git a/.markdownlint.yaml b/.markdownlint.yaml
@@ -11,4 +11,4 @@ whitespace: true
 MD010:
   code_blocks: false # Disallow hard tabs except in code blocks.
 MD033:
-  allowed_elements: ["p"]
+  allowed_elements: ["p","br"]
diff --git a/docs/commands/search.sql b/docs/commands/search.sql
diff --git a/docs/commands/super.md b/docs/commands/super.md
diff --git a/scripts/super-cmd-perf/README.md b/scripts/super-cmd-perf/README.md
@@ -0,0 +1,106 @@
+# Query Performance From `super` Command Doc
+
+These scripts were used to generate the results in the
+[Performance](https://zed.brimdata.io/docs/next/commands/super#performance)
+section of the [`super` command doc](https://zed.brimdata.io/docs/next/commands/super).
+The scripts have been made available to allow for easy reproduction of the
+results under different conditions and/or as tested systems evolve.
+
+# Environments
+
+The scripts were written to be easily run in two different environments.
+
+## AWS
+
+As an environment that's available to everyone, the scripts were developed
+primarily for use on a "scratch" EC2 instance in [AWS](https://aws.amazon.com/).
+Specifically, we chose the [`m6idn.2xlarge`](https://aws.amazon.com/ec2/instance-types/m6i/)
+instance that has the following specifications:
+
+* 8x vCPU
+* 32 GB of RAM
+* 474 GB NVMe instance SSD
+
+The instance SSD in particular was seen as important to ensure consistent I/O
+performance.
+
+Assuming a freshly-created `m6idn.2xlarge` instance running Ubuntu 24.04, to
+start the run:
+
+```
+curl -s https://github.com/brimdata/super/blob/main/scripts/super-cmd-perf/benchmark.sh | bash -xv 2>&1 | tee runlog.txt
+```
+
+The run proceeds in three phases:
+
+1. **(AWS only)** Instance SSD is formatted and required tools & data platforms tools are downloaded/installed
+2. Test data is downloaded and loaded into needed storage formats
+3. Queries are executed on all data platforms
+
+As the benchmarks may take a long time to run, the use of [`screen`](https://www.gnu.org/software/screen/)
+or a similar "detachable" terminal tool is recommended in case your remote
+network connection drops during a run.
+
+## macOS/other
+
+Whereas on [AWS](#aws) the scripts assume they're in a "scratch" environment
+where it may format the instance SSD for optimal storage and install required
+software, on other systems such as macOS it's assumed the required data
+platforms are already installed, and it will skip ahead right to
+downloading/loading test data and then running queries.
+
+For instance on macOS, the software needed can be first installed via:
+
+```
+brew install hyperfine datafusion duckdb clickhouse go
+go install github.com/brimdata/super/cmd/super@main
+```
+
+Then clone the [super repo](https://github.com/brimdata/super.git) and run the
+benchmarks.
+
+```
+git clone https://github.com/brimdata/super.git
+cd scripts/super-cmd-perf
+./benchmark.sh
+```
+
+All test data will remain in this directory.
+
+# Results
+
+Results from the run will accumulate in a subdirectory named for the date/time
+when the run started, e.g., `2024-11-19_01:10:30/`. In this directory, summary
+reports will be created in files ending in `.md` and `.csv` extensions, and
+details from each individual step in generating the results will be in files
+ending in `.out`. If run on AWS using the [`curl` command line shown above](#aws),
+the `runlog.txt` will also be present that holds the full console output of the
+entire run.
+
+An archive of results from our most recent run of the benchmarks on November
+26, 2024 can be downloaded [here](https://super-cmd-perf.s3.us-east-2.amazonaws.com/2024-11-26_03-17-25.tgz).
+
+# Debugging
+
+The scripts are configured to exit immediately if failures occur during the
+run. If you encounter a failure, look in the results directory for the `.out`
+file mentioned last in the console output as this will contain any detailed
+error message from the operation that experienced the failure.
+
+A problem that was encountered when developing the scripts that you may also
+encounter is DuckDB running out of memory. Specifically, this happened when
+we tried to run the scripts on an Intel-based Macbook with only 16 GB of
+RAM, and this is part of why we used an AWS instance with 32 GB of RAM as the
+reference platform. On the Macbooks, we found we could work around the memory
+problem by telling DuckDB it had the use of more memory than its default
+[80% heuristic for `memory_limit`](https://duckdb.org/docs/configuration/overview.html).
+The scripts support an environment variable to make it easy to increase this
+value, e.g., we found the scripts ran successfully at 16 GB:
+
+```
+$ DUCKDB_MEMORY_LIMIT="16GB" ./benchmark.sh
+```
+
+Of course, this ultimately caused swapping on our Macbook and a significant
+hit to performance, but it at least allowed the scripts to run without
+failure.
diff --git a/scripts/super-cmd-perf/benchmark.sh b/scripts/super-cmd-perf/benchmark.sh
@@ -0,0 +1,97 @@
+#!/bin/bash -xv
+set -euo pipefail
+export RUNNING_ON_AWS_EC2=""
+
+# If we can detect we're running on an AWS EC2 m6idn.2xlarge instance, we'll
+# treat it as a scratch host, installing all needed software and using the
+# local SSD for best I/O performance.
+if command -v dmidecode && [ "$(sudo dmidecode --string system-uuid | cut -c1-3)" == "ec2" ] && [ "$(TOKEN=$(curl -s -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600") && curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/instance-type)" == "m6idn.2xlarge" ]; then
+
+  export RUNNING_ON_AWS_EC2=true
+
+  sudo apt-get -y update
+  sudo apt-get -y upgrade
+  sudo apt-get -y install make gcc unzip hyperfine
+
+  # Prepare local SSD for best I/O performance
+  sudo fdisk -l /dev/nvme1n1
+  sudo mkfs.ext4 -E discard -F /dev/nvme1n1
+  sudo mount /dev/nvme1n1 /mnt
+  sudo chown ubuntu:ubuntu /mnt
+  sudo chmod 777 /mnt
+  echo 'export TMPDIR="/mnt/tmpdir"' >> "$HOME"/.profile
+  mkdir /mnt/tmpdir
+
+  # Install ClickHouse
+  if ! command -v clickhouse-client > /dev/null 2>&1; then
+    sudo apt-get install -y apt-transport-https ca-certificates curl gnupg
+    curl -fsSL 'https://packages.clickhouse.com/rpm/lts/repodata/repomd.xml.key' | sudo gpg --dearmor -o /usr/share/keyrings/clickhouse-keyring.gpg
+    echo "deb [signed-by=/usr/share/keyrings/clickhouse-keyring.gpg] https://packages.clickhouse.com/deb stable main" | sudo tee \
+        /etc/apt/sources.list.d/clickhouse.list
+    sudo apt-get update
+    sudo DEBIAN_FRONTEND=noninteractive apt-get install -y clickhouse-client
+  fi
+
+  # Install DuckDB
+  if ! command -v duckdb > /dev/null 2>&1; then
+    curl -L -O https://github.com/duckdb/duckdb/releases/download/v1.1.3/duckdb_cli-linux-amd64.zip
+    unzip duckdb_cli-linux-amd64.zip
+    sudo mv duckdb /usr/local/bin
+  fi
+
+  # Install Rust
+  curl -L -O https://static.rust-lang.org/dist/rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz
+  tar xf rust-1.82.0-x86_64-unknown-linux-gnu.tar.xz
+  sudo rust-1.82.0-x86_64-unknown-linux-gnu/install.sh
+  # shellcheck disable=SC2016
+  echo 'export PATH="$PATH:$HOME/.cargo/bin"' >> "$HOME"/.profile
+
+  # Install DataFusion CLI
+  if ! command -v datafusion-cli > /dev/null 2>&1; then
+    cargo install datafusion-cli
+  fi
+
+  # Install Go
+  if ! command -v go > /dev/null 2>&1; then
+    curl -L -O https://go.dev/dl/go1.23.3.linux-amd64.tar.gz
+    rm -rf /usr/local/go && sudo tar -C /usr/local -xzf go1.23.3.linux-amd64.tar.gz
+    # shellcheck disable=SC2016
+    echo 'export PATH="$PATH:/usr/local/go/bin:$HOME/go/bin"' >> "$HOME"/.profile
+    source "$HOME"/.profile
+  fi
+
+  # Install SuperDB
+  if ! command -v super > /dev/null 2>&1; then
+    git clone https://github.com/brimdata/super.git
+    cd super
+    make install
+  fi
+
+  cd scripts/super-cmd-perf
+
+fi
+
+rundir="$(date +%F_%T)"
+mkdir "$rundir"
+report="$rundir/report_$rundir.md"
+
+echo -e "|**Software**|**Version**|\n|-|-|" | tee -a "$report"
+for software in super duckdb datafusion-cli clickhouse
+do
+  if ! command -v $software > /dev/null; then
+    echo "error: \"$software\" not found in PATH"
+    exit 1
+  fi
+  echo "|$software|$($software --version)|" | tee -a "$report"
+done
+echo >> "$report"
+
+# Prepare the test data
+./prep-data.sh "$rundir"
+
+# Run the queries and generate the summary report
+./run-queries.sh "$rundir"
+
+if [ -n "$RUNNING_ON_AWS_EC2" ]; then
+  mv "$HOME/runlog.txt" "$rundir"
+fi
diff --git a/scripts/super-cmd-perf/prep-data.sh b/scripts/super-cmd-perf/prep-data.sh
@@ -0,0 +1,58 @@
+#!/bin/bash -xv
+set -euo pipefail
+pushd "$(cd "$(dirname "$0")" && pwd)"
+
+if [ "$#" -ne 1 ]; then
+  echo "Specify results directory string"
+  exit 1
+fi
+rundir="$(pwd)/$1"
+mkdir -p "$rundir"
+
+RUNNING_ON_AWS_EC2="${RUNNING_ON_AWS_EC2:-}"
+if [ -n "$RUNNING_ON_AWS_EC2" ]; then
+  cd /mnt
+fi
+
+function run_cmd {
+  outputfile="$1"
+  shift
+  { hyperfine \
+      --show-output \
+      --warmup 0 \
+      --runs 1 \
+      --time-unit second \
+      "$@" ;
+  } \
+    > "$outputfile" \
+    2>&1
+}
+
+mkdir gharchive_gz
+cd gharchive_gz
+for num in $(seq 0 23)
+do
+  curl -L -O "https://data.gharchive.org/2023-02-08-${num}.json.gz"
+done
+cd ..
+
+DUCKDB_MEMORY_LIMIT="${DUCKDB_MEMORY_LIMIT:-}"
+if [ -n "$DUCKDB_MEMORY_LIMIT" ]; then
+  increase_duckdb_memory_limit='SET memory_limit = '\'"${DUCKDB_MEMORY_LIMIT}"\''; '
+else
+  increase_duckdb_memory_limit=""
+fi
+
+run_cmd \
+  "$rundir/duckdb-table-create.out" \
+  "duckdb gha.db -c \"${increase_duckdb_memory_limit}CREATE TABLE gha AS FROM read_json('gharchive_gz/*.json.gz', union_by_name=true)\""
+
+run_cmd \
+  "$rundir/duckdb-parquet-create.out" \
+  "duckdb gha.db -c \"${increase_duckdb_memory_limit}COPY (from gha) TO 'gha.parquet'\""
+
+run_cmd \
+  "$rundir/super-bsup-create.out" \
+  "super -o gha.bsup gharchive_gz/*.json.gz"
+
+du -h gha.db gha.parquet gha.bsup gharchive_gz
diff --git a/scripts/super-cmd-perf/queries/agg.sql b/scripts/super-cmd-perf/queries/agg.sql
@@ -0,0 +1,4 @@
+SELECT count(),type
+FROM '__SOURCE__'
+WHERE repo.name='duckdb/duckdb'
+GROUP BY type
diff --git a/scripts/super-cmd-perf/queries/count.sql b/scripts/super-cmd-perf/queries/count.sql
@@ -0,0 +1,3 @@
+SELECT count()
+FROM '__SOURCE__'
+WHERE actor.login='johnbieren'
diff --git a/scripts/super-cmd-perf/queries/search+.spq b/scripts/super-cmd-perf/queries/search+.spq
@@ -0,0 +1,3 @@
+SELECT count()
+FROM '__SOURCE__'
+WHERE grep('in case you have any feedback 😊')