Skip to content
JoshRosen edited this page Nov 21, 2014 · 8 revisions

Welcome to the spark-perf wiki. This page lists several useful scripts, helper functions, and analysis tools for running spark-perf tests

Running tests

Automatically testing against multiple Spark versions

In config.py:

import os
SPARK_COMMIT_ID = os.environ["SPARK_COMMIT_ID"]

To run against multiple commits, use a shell script to repeatedly call bin/run with different environment variables:

#!/usr/bin/env bash

# Note: the spaces in the parens are necessary:
versions=( "origin/tag/v1.1.0" "origin/tag/v1.1.1-rc2" "origin/tag/v1.2.0-snapshot1" "origin/branch-1.2" )
for version in ${versions[@]}
do
  export SPARK_COMMIT_ID="$version"
  ./bin/run
done

To print the SHAs of every NRth commit between two git tags (useful for bisecting):

git log --oneline origin/branch-1.2...v1.1.0 | awk 'NR == 1 || NR % 50 == 0' | cut -d ' ' -f1

Analyzing results

Uploading logs from spark-ec2 clusters to S3

Upgrade to a newer version of the aws tool and configure AWS credentials:

sudo easy_install --upgrade awscli
aws configure
  # AWS Access Key ID [None]: AKIAIOSFODNN7EXAMPLE
  # AWS Secret Access Key [None]: wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
  # Default region name [None]:
  # Default output format [None]:

Sync the results folder to an S3 bucket:

aws s3 cp --recursive /local/path/to/results/directory s3://bucket-name/resultsdir/
Clone this wiki locally