-
Notifications
You must be signed in to change notification settings - Fork 382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Redesigned "caar loop pre-boundary exchange", tuned for Frontier, also faster on Perlmutter GPU #6972
base: master
Are you sure you want to change the base?
Conversation
Trey, for clarity, would you mind explaining how you take a line from a timer file and derive the entry in the table? As an example, from this line
we see that the maximum is 43.766 sec and the average is 1.360819e+04/320 = 42.52559375. The quantity being measured is "time spent in DIRK over the course of the run", where each rank makes 4.608000e+06/320 = 14400.0 calls to DIRK over the course of the run. |
The times listed above are from the fourth column of numbers in the timer output, Frontier GPU original:
Frontier GPU new:
Perlmutter GPU original:
Perlmutter GPU new:
Perlmutter CPU Gnu original:
Perlmutter CPU Gnu new:
Perlmutter CPU Intel original:
Perlmutter CPU Intel new:
|
Thanks, Trey. That clears things up. Note that in your table you wrote 1.244864e+04 as 1245 rather than 12449, and similarly for the other CPU numbers. |
Oops! Fixed. |
Would it make sense to break this PR up? It looks like there may be changes not specifically related to the new CAAR design? |
I would say that all the changes are related to the new Caar design. The new BFB build files, for example, are only tested for the Caar unit test. I think each change can be tracked back to a Caar dependence. |
Removing Also, I noticed
|
Still unsure about the double underscore noted above, but I tried the NESAP ne1024 benchmark (without double underscore) on pm-gpu and we are seeing around 9% overall improvement. Though there were a number of changes that may have also had some impact. |
Sorry, I missed this comment. Yes, typo. Just pushed a fix. Thanks, @ndkeen! |
This pull request attempts to provide Frontier optimizations even better than those used in the 2023 Gordon Bell Climate runs, but with a software architecture that meets the requirements of https://acme-climate.atlassian.net/wiki/x/oICd6, and with additional changes to reduce slowdown on Perlmutter CPUs.
It replaces pull request #6522.
Summary of changes:
struct
s inSphereOperators.hpp
to use for the new Caar pre-boundary exchange.SphereOperators.hpp
that allow code that uses implicit parallelism and vector registers on GPUs to add explicit loops and temporary arrays on CPUs.#if (WARP_SIZE == 1)
preprocessor directives inSphereOperators.hpp
to try to minimize CPU-specific code.zbelow
inSphereOperators.hpp
to supportScalar
types withVECTOR_SIZE
> 1 on CPUs.CaarFunctorImpl.cpp
that implements the newcaar_compute
function and template functions with Kokkos loops. The single source code supports both GPUs and CPUs by relying on functions and macros defined inSphereOperators.hpp
.CaarFunctorImpl.hpp
with new functions, slight changes to temporary buffers, and#if
to turn on/off the newcaar_compute
. If we adopt the new implementation permanently, significant code can be eliminated from this file.viewAsReal
functions inViewUtils.hpp
.LaunchBounds<512,1>
calls, which I think are incorrect for AMD GPUs, where the Kokkos teams sometimes use 1024 threads.frontier-bfb.cmake
,frontier-bfb-serial.cmake
, andpm-cpu-bfb.cmake
files for bit-for-bit unit testing of Caar.I confirmed that the modified code passes the
caar_ut
unit test, and I ran a single-node NE30 test from Noel Keen on Frontier, Perlmutter GPU, and Perlmutter CPU. Here is a comparison of total "caar compute" times, summed over all MPI tasks on the node (8 on Frontier GPU, 4 on Perlmutter GPU, 128 on Perlmutter CPU).#if 0
#if 1
The good news is that the new code is faster on both Frontier and Perlmutter GPUs. The bad news is that it slows down Perlmutter CPUs. In particular, it appears to inhibit whatever optimization the Intel compiler is able to do over the Gnu compiler.