The purpose of this package is to document the performance testing of the TOAST package on current and new systems. This includes the installation of dependencies and TOAST itself, as well as running several example workflows.
The TOAST package is a hybrid C++ / Python framework:
https://github.com/hpc4cmb/toast
The required dependencies include several compiled libraries as well as some standard python packages. This toast-bench git repo documents running TOAST benchmarks using a variety of installation methods, from pre-built binaries to running scripts that build all dependencies with custom compiler options.
After installing TOAST (see below), you should have the toast_benchmark.py
script in
your executable search path. Although this script will run without MPI, you should have
the mpi4py
package installed in order to do any useful tests. The
toast_benchmark.py
script will automatically choose one of several pre-set workflow
sizes based on the size and configuration of the job you run. You can see the small
number of commandline options with:
%> toast_benchmark.py --help
usage: toast_benchmark.py [-h] [--node_mem_gb NODE_MEM_GB] [--dry_run DRY_RUN]
Run a TOAST workflow scaled appropriately to the MPI communicator size and available memory.
optional arguments:
-h, --help show this help message and exit
--node_mem_gb NODE_MEM_GB
Use this much memory per node in GB
--dry_run DRY_RUN Comma-separated total_procs,node_procs to simulate.
The --node_mem_gb
option is used to override the amount of RAM to use on each node, in
case the amount detected by psutil
gives incorrect results. The --dry_run
option
will simulate the job setup given the total number of MPI processes and the number of
processes per node. These are specified as two numbers specified by a comma.
Before launching real jobs, it is useful to test the job setup in dry-run mode. You can
do these tests serially. The script will select the workflow size based on your
commandline values to the --dry_run
option and will create the job output directory,
configuration log file, and some input data files. So this is a way of checking that
everything makes sense before submitting the actual job. For example, to simulate the
job setup for running 1024 MPI processes with 16 processes per node (so 64 nodes), on a
system with 90GB of RAM per node, you can do:
%> toast_benchmark.py --node_mem_gb 90 --dry_run 1024,16
TOAST INFO: TOAST version = 2.3.8.dev9
TOAST INFO: Using a maximum of 4 threads per process
TOAST INFO: Running with 1 processes at 2020-10-08 09:35:26.808623
TOAST INFO: DRY RUN simulating 1024 total processes with 16 per node
TOAST INFO: Minimum detected per-node memory available is 57.53 GB
TOAST INFO: Setting per-node available memory to 90.00 GB as requested
TOAST INFO: Job has 64 total nodes
TOAST INFO: Examining 7 possible cases to run:
TOAST INFO: tiny : requires 1 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: xsmall : requires 1 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: small : requires 1 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: medium : requires 5 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: large : requires 47 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: xlarge : requires 470 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: heroic : requires 4700 nodes for 16 MPI ranks and 90.0GB per node
TOAST INFO: Selected case 'large'
TOAST INFO: Using groups of 1 nodes
TOAST INFO: Using 64 detectors for approximately 180 days
TOAST INFO: Generating input schedule file toast_001024_grp-0016p-01n_20201008-09:35:26/inputs/schedule.txt:
TOAST INFO: Adding patch "BICEP"
TOAST INFO: Rectangular format
TOAST INFO: Creating 'toast_001024_grp-0016p-01n_20201008-09:35:26/inputs'
TOAST INFO: Global timer: toast_ground_schedule: 35.72 seconds (1 calls)
TOAST INFO: Generating input map toast_001024_grp-0016p-01n_20201008-09:35:26/inputs/cmb.fits
TOAST INFO: Exit from dry run
You can see that for this configuration, the script would choose the "large" workflow case. The script actually goes through the process of creating an input sky signal map for the simulation and also some number of days of telescope observing schedules. These are the same setup operations that would happen at the beginning of a real run. You can also see the different node counts needed for each of the possible workflow tests assuming the same memory per node and processes per node that you specified.
A good starting point is to begin with a single-node job. Choose how many processes you will be using per node. Things to consider:
-
Most of the parallelism in TOAST is process-level using MPI. There is some limited use of OpenMP and there will soon be support for CUDA / OpenCL for some workflow modules. However, running with more MPI ranks generally leads to better performance at the moment.
-
There is some additional memory overhead on each node, so running nodes "fully packed" with MPI processes may not be possible. You should experiment with different numbers of MPI ranks per node.
-
Make sure to set
OMP_NUM_THREADS
appropriately so that the(MPI ranks per node) X (OpenMP threads)
equals the total number of physical cores on each node.
Here is an example running interactively on cori.nersc.gov
, which uses the SLURM
scheduler:
# 16 ranks per node, each with 4 threads.
# 4 cores left for OS use.
# Depth is 16, due to 4 hyperthreads per core.
%> export OMP_NUM_THREADS=4
%> srun -C knl -q interactive -t 00:30:00 -N 1 -n 16 -c 16 \
toast_benchmark.py
After doing dry-run tests and running very small jobs you can increase the node count to support something like the small / medium workflow cases. At this point you can test the effects of adjusting the number of MPI processes per node. After you have found a configuration that seems the best, increase the node count again to run the larger cases.
At the end of the job a "Science Metric" is reported. This is currently based on the number of detectors, total number of data samples processed and the node-seconds that were used. The different exponents in the metric attempt to weight different aspects of the job in an appropriate way. For example, often it is critical to process a group of hundreds to several thousands of detectors together in a single job. The more observing time we can process in a single job also leads to better quality results. The metric is currently computed as:
Metric = (constant prefactor) x (Number of detectors)^2 x (1.0e-3 * Total samples)^1.2
/ (Run seconds * Number of nodes)
Where the run time only includes the actual science calculations and not the serial job setup portion. From this equation, we can see that "large" nodes would be preferred. This Metric should be further divided by the "node watts" to obtain a relative measure of "Science per Watt".
This benchmark script is testing a single TOAST workflow at different data volumes. There are hundreds of possible workflows, but we have tried to capture the relevant features in this one script. For each TOAST release, there will be improvements to the software and there may also be changes to the benchmark script to make the results more meaningful and realistic.
This raises a critical reminder: any benchmark results should be accompanied by the associated job log file, which includes information about the code version and parameters that were used. Benchmarks should only be compared for the same TOAST release.
There are several ways to install TOAST. Pre-built binary packages are available as pip wheels and also through the conda-forge channel. When running benchmarks on a new or unusual system, you may achieve better performance building TOAST and some or all of the dependencies from scratch. However, it can be useful to run the tests with pre-built packages as a starting point for the performance testing and as a point of comparison.
A full discussion of TOAST installation is in the official documentation. Here we provide a summary. Make sure that you have a Python3 installation that is at least version 3.6.0:
%> python3 --version
Python 3.8.2
For installation of packages from PyPI (the most general method), first create a new
"virtualenv" that will act as a sandbox environment for our installation. You can call
this whatever you like, but for this example we will make a location called toast
in
our home directory:
%> python3 -m venv ${HOME}/toast
Now activate this environment:
%> source ${HOME}/toast/bin/activate
Within this virtualenv, update pip to the latest version. This is needed in order to install more recent wheels from PyPI:
%> python3 -m pip install --upgrade pip
Next, use pip to install toast and its requirements (note that the name of the package is "toast-cmb" on PyPI):
%> python3 -m pip install toast-cmb
Although TOAST does not require MPI, it is needed for any meaningful benchmarking work.
We will install the mpi4py
package, but first need to make sure that we have a working
MPI C compiler. You can read about install options for mpi4py
here. For example, if your MPI
compiler wrapper is called "cc
", then you would do:
# (Substitute the name of your MPI C compiler)
%> MPICC=cc python3 -m pip install --no-cache-dir mpi4py
It is always a good idea to run the TOAST unit tests with your new installation:
# Example: generic Linux:
%> export OMP_NUM_THREADS=1
%> mpirun -np 4 python3 -c 'import toast.tests; toast.tests.run()'
# Example: on cori.nersc.gov with slurm:
%> export OMP_NUM_THREADS=4
%> srun -N 1 -n 4 -C haswell -t 00:30:00 python3 -c 'import toast.tests; toast.tests.run()'
If you are already using the conda
python package manager, then you may find it easier
to install TOAST from the packages in the conda-forge
package repository. Before
doing any conda operations (like creating an environment), you must make sure that conda
has been initialized with the conda init
command or by manually sourcing the shell
snippet to do this. Simply having the conda
command in your search path is not
sufficient. For example, on the cori system at NERSC you would do:
# Load conda command into your path:
module load python
# Actually initialize conda:
conda init
The init
command adds a shell snippet to your ~/.bashrc
. Now reload your shell
resource file or log out and log back in. Now we are ready to use the conda tool.
NOTE: If you don't want to always load the conda software stack in your ~/.bashrc, then you can create your own shell function which does the same thing. For example, at NERSC, you could add this function to ~/.bashrc:
load_conda () {
module load python
conda_prefix=$(dirname $(dirname $(which conda)))
source "${conda_prefix}/etc/profile.d/conda.sh"
}
Then from a fresh shell you can selectively do:
%> load_conda
Only when you want to use this conda installation. Now that our conda tool is initialized, we can install TOAST. Begin by creating a new conda environment:
%> conda create -y -n toast
Now activate this environment and install TOAST:
%> conda activate toast
%> conda install -y -c conda-forge toast
The conda tool currently installs a replacement linker (ld
program) which frequently
breaks compilation on some systems. We are going to remove it from our environment, so
that we can successfully compile mpi4py
in the next step:
%> rm -f $(dirname $(dirname $(which python)))/compiler_compat/ld
For meaningful benchmarking, we need to install mpi4py
.
NOTE: On HPC systems, you should install mpi4py with pip, not conda. This will allow compilation of mpi4py against the system compilers.
# On a laptop or workstation:
%> conda install mpi4py
# On an HPC system with a vendor MPI installation:
# (Substitute the name of your MPI C compiler)
%> MPICC=cc python3 -m pip install --no-cache-dir mpi4py
It is always a good idea to run the TOAST unit tests with your new installation:
# Example: generic Linux:
%> export OMP_NUM_THREADS=1
%> mpirun -np 4 python3 -c 'import toast.tests; toast.tests.run()'
# Example: on cori.nersc.gov with slurm:
%> export OMP_NUM_THREADS=4
%> srun -N 1 -n 4 -C haswell -t 00:30:00 python3 -c 'import toast.tests; toast.tests.run()'
This git repo includes some scripts to install the minimum dependencies needed by TOAST for the benchmarks. However, there are some external requirements that must already exist on the system:
- A C++11 compiler
- CMake (>= 3.12)
- Python3 (>= 3.6)
- Python packages: numpy, scipy, h5py
- MPI and mpi4py (for effective parallelism)
Before installing the remaining dependencies, you should make sure the above tools are loaded into your environment.
WARNING: conda installs its own version of a linker. If you are using a conda environment for your Python, check if the file
compiler_compat/ld
exists in the conda environment prefix. If so, remove it or rename it.
If you are using a conda environment, make sure to install the mpi4py
package with pip
(not conda). Read the documentation of that package to see how to make sure it finds
your intended MPI compiler.
The remaining minimal dependencies can be installed with the included install_deps.sh
script. The defaults in this script use the GNU compilers, but there are comments in
the script about where to make changes for other use cases. You can comment out the
installation of some packages completely if you already have an optimized replacement
installed. There are also several examples (example_deps_*.sh
) included. The script
supports installing:
- OpenBLAS (not needed if using an alternate BLAS/LAPACK)
- FFTW (not needed if using Intel MKL)
- SuiteSparse (for atmosphere simulations)
- libaatm (for atmosphere simulations)
- Python packages: astropy, healpy, ephem
After modifying this script to your needs, just run it:
%> ./install_deps.sh
After installing the dependencies, this script prints out an example shell function
(load_toast
) which can be added to your ~/.bashrc
for easy loading of this software
stack.
After loading the software stack you installed in the previous section, we can install
TOAST to this same location. The included install_toast.sh
script is an example that
sets up the compilers and dependency locations, checks out a version of TOAST from
github, and installs it to the same location as the compiled dependencies. There are
some examples (example_toast_*.sh
) as well. After modifying, run this script:
%> ./install_toast.sh
After installation, it is a good idea to run the unit test suite (see previous sections).
The compiled python extension in TOAST links to external libraries for BLAS/LAPACK and
FFTs. The python Numpy package may use some of the same libraries for its internal
operations. Inside a single process, a shared library can only be loaded once. For
example, if numpy links to libmkl_rt
and so does TOAST, then only one copy of
libmkl_rt
can be loaded- and the library which actually gets loaded depends on the
order of importing the toast
and numpy
python modules!
For example, here are some combinations and the result:
TOAST Built With | Python Using | Result |
---|---|---|
Statically linked OpenBLAS (binary wheels on PyPI) | numpy with any LAPACK | Works since TOAST does not load any external LAPACK. |
Intel compiler and MKL | numpy with MKL | Broken, UNLESS both MKLs are ABI compatible and using the Intel threading layer. |
GCC compiler and MKL | numpy with MKL | Broken, since TOAST uses the MKL GNU threading layer and numpy uses the Intel threading layer (only one can be used in a single process). |
GCC and system OpenBLAS | numpy with MKL | Works: different libraries are dl-opened by TOAST and numpy. |
Intel compiler and MKL | numpy with OpenBLAS (from conda-forge) | Works: different libraries are dl-opened by TOAST and numpy. |
When in doubt, run the toast unit tests with OpenMP enabled (i.e. OMP_NUM_THREADS
>
1). This should exercise code that will fail if there is some incompatibility. After
the unit tests pass, you are ready to run the benchmarks.