Gromacs was picked because it is among a few Spack recipes that has explicit uarch directives.
- Spack built uarch optimized binaries provide best results
- hyperthreading helps
- for Zen2, use the SandyBridge binary, not Zen2
All results for benchMEM, ns/day = higher is better
-
Load balance on and auto pinning is the most performant (those should be the defaults, or -dlb yes -pin auto)
-
uarch optimized binaries are important, esp. for SKL:
Zen2 64 CPU cores, HT off
CPU | Nehalem | SandyBridge | Zen2 | Skylake |
---|---|---|---|---|
uarch | SSE4.2, HT | AVX, HT | AVX2, no HT | AVX512 |
ns/day | 72.26 | 87.12 | 78.91 | N/A |
SKL 32 CPU cores, HT on
CPU | Nehalem | SandyBridge | Zen2 | Skylake |
---|---|---|---|---|
uarch | SSE4.2 | AVX | AVX2 | AVX512 |
ns/day | 35.89 | 41.58 | 44.19 | 51.65 |
- Hyperthreading slower on Zen2 with Zen2 optimized binary, faster with Intel optimized binaries:
Zen 2 HT comparison
Nehalem | SandyBridge | Zen2 MKL | Zen 2 OB | |
---|---|---|---|---|
HT | 72.26 | 87.12 | 76.44 | 76.00 |
no HT | 67.24 | 85.07 | 78.91 | 70.98 |
SKL HT comparison
Nehalem | SandyBridge | Zen2 MKL | Zen 2 OB | |
---|---|---|---|---|
HT | 35.89 | 41.58 | 44.19 | 51.65 |
no HT | 33.93 | 38.46 | 42.61 | 48.87 |
-
OpenBLAS is slower than MKL on Zen2, but, close when HT is on
-
Parallel scaling is about the same on Zen2 and SKL columns 8, 16, 32, 64 cores: Zen2:
Cores | 8 | 16 | 32 | 64 |
---|---|---|---|---|
Ideal sc. | 1 | 2 | 4 | 8 |
Actual sc. | 1.00 | 1.85 | 2.98 | 4.12 |
SKL:
Cores | 8 | 16 | 32 |
---|---|---|---|
Ideal sc. | 1 | 2 | 4 |
Actual sc. | 1.00 | 1.81 | 2.79 |
- MKLs LD_PRELOAD helps on Zen2:
LD_PRELOAD comparison, SandyBridge
off | on |
---|---|
83.90 | 85.07 |
- Build 3 binaries, nehalem, sandybridge and skylake_avx512
- For Zen2 use sandybridge binary
- Hardcode libfakeintel.so into the sandybridge binary
/uufs/chpc.utah.edu/sys/installdir/amdmkl/bin/patchmkl.sh /uufs/chpc.utah.edu/sys/spack/linux-centos7-sandybridge/gcc-10.2.0/gromacs-2020.4-e6jluwefhuwdbcwctqp2gfdiepkh3jsg/bin/gmx_mpi
spack install gromacs%[email protected]+lapack^intel-mkl threads=openmp target=nehalem
spack install gromacs%[email protected]+lapack^intel-mkl threads=openmp target=sandybridge
spack install gromacs%[email protected]+lapack^intel-mkl threads=openmp target=skylake_avx512
NWCHEM was selected because it is expected to heavily utilize accelerated linear algebra.
Runtime in seconds, lower is better.
-
uarch optimized binaries have small effect on execution time (7% on zen2, 5% on skl wrt. nehalem build)
-
MKL is not being used much since LD_PRELOAD trick does not have any effect on runtime.
uarch binary | nehalem | nehldpreload | sb | sbldpreload | skl | sklldpreload | zen2 | zen2ldpreload |
---|---|---|---|---|---|---|---|---|
zen2 node | 1162 | 1156 | 1086 | 1089 | 1112 | 1142 | 1112 | 1142 |
skl node | 2201 | 2235 | 2098 | 2189 |
- Build 3 binaries, nehalem, sandybridge and skylake-avx512
- For Zen2 use sandybridge binary
- No need to do anything with LD_PRELOAD
QE is also expected to have strong linear algebra dependency
Runtime in seconds, lower is better.
-
uarch optimized binaries have no effect on the Zen2 and about 5% improvement on SB and SKL over NEH
-
LD_PRELOAD has a huge effect on the performance, which suggests that most of time is spent in the MKL routines
nehalem | nehldpreload | sb | sbldpreload | skl | sklldpreload | |
---|---|---|---|---|---|---|
zen2_node | 718 | 378 | 715 | 377 | 714 | 377 |
skl_node | 467 | 442 | 443 | 444 |
- Build binaries, nehalem, sandybridge
- For Zen2 use sandybridge binary.
- Hardcode libfakeintel.so into the sandybridge binary
for i in `ls /uufs/chpc.utah.edu/sys/spack/linux-centos7-sandybridge/gcc-8.3.0/quantum-espresso-6.6-nbtlug57rby5ejjinegl62ak4uagpmpj/*.x`; do /uufs/chpc.utah.edu/sys/installdir/amdmkl/bin/patchmkl.sh $i; done
spack install quantum-espresso+openmp+epw+elpa hdf5=parallel ^intel-mkl threads=openmp target=nehalem
spack install quantum-espresso+openmp+epw+elpa hdf5=parallel ^intel-mkl threads=openmp target=sandybridge