-
Notifications
You must be signed in to change notification settings - Fork 16
A Brief Look at GPU Refactoring and Porting Choices for Climate and Weather Models
Most existing physics packages and many dynamics packages are structured so that a loop over columns wraps around a number of subroutine calls, e.g.:
do icol = 1 , ncol
call microphysics(rho_d(:,icol),temp(:,icol),rho_v(:,icol),...)
call gravity_wave_drag(...)
call radiation(...)
! ...
enddo
! ...
subroutine microphysics(rho_d,temp,rho_v,...)
real(rp), intent(in ) :: rho_d(nlev)
real(rp), intent(inout) :: temp(nlev), rho_v(nlev), ...
do ilev = 1 , nlev
! ...
enddo
endsubroutine
There are a number of potential issues with this that will be discussed in the next section. It creates one giant kernel that may suffer register spillage, may have performance issues with arbitrary nlev
, and may be quite hard to debug and profile. To improve this behavior, you can "push the looping down the callstack", resulting in the following:
call microphysics(rho_d,temp,rho_v,...)
call gravity_wave_drag(...)
call radiation(...)
! ...
subroutine microphysics(rho_d,temp,rho_v,...)
real(rp), intent(in ) :: rho_d(nlev,ncol)
real(rp), intent(inout) :: temp(nlev,ncol), rho_v(nlev,ncol), ...
do icol = 1 , ncol
do ilev = 1 , nlev
! ...
enddo
enddo
endsubroutine
Once the looping is pushed down the callstack:
- You can split the work into as many or as few kernels as you wish.
- You are no longer at the mercy of the number of vertical levels having to be a GPU-friendly number (e.g., multiple of 32) because you can collapse the parallelism into a single dimension.
- You can control register spillage by fissioning kernels however you wish.
- You can easily debug and profile, determining which section of code is failing or performing poorly
- You can isolate vertically dependent work (loop carried dependencies) and handle it separately from other kernels
There are some potential advantages to pushing looping down the callstack, but, it is a lot of effort.
- You have to promote all variables inside subroutines (parameters and most internal variables) to include an extra
ncol
dimension. This is by far the most labor-intensive operation. - It can be done incrementally, but you need a solid and fast-running testing framework to catch bugs.
- This can increase local memory requirements since temporary arrays dimensioned as
nlev
are now dimensioned asnlev x ncol
.
One potential issue with this approach on CPUs is that you may overflow cache where you might have fit in cache before.
The mitigation for this is to allow the user to pass as many or few columns as they wish into the physics routines in a flexible way. The code is very similar to pushing all looping down the callstack. This way, on CPUs, you can pass few columns down, and on GPUs, you can pass many columns down.
The efficacy of pushing looping down the call stack (see the previous section) really boils down to the idea of "one big kernel" versus "many small kernels". There are pros and cons to each.
First, if you choose "one big kernel" in climate and weather model physics, and you want to thread over the vertical dimension, you are almost certainly committing to "hierarchical parallelism". In this case, it is typically two-level parallelism with ncol
being the coarser and outer-level looping; and nlev
being the finer and inner-level looping. Ostensibly, this would thread nlev
within a Streaming Multiprocessor (SM, basically a vector unit) on an Nvidia GPU, and it would thread ncol
across multiple SMs in an embarrassingly parallel fashion.
This requires that nlev
be a GPU-friendly number -- often a multiple of 32.
This also potentially increases complexity because all code in the giant kernel must be on the GPU with no exceptions. Many physics packages setup and use indirect addressing, like to perform output to stdout
and stderr
, and have loop-carried dependencies such as hydrostatic summation or radiation tendency calculations. These can be complex to deal with when you cannot fission the kernel or run anything on the host.
Another con of one big kernel is that you typically have order 1K - 10K lines of code (some times more) in a single kernel. The higher the volume of code (the more lines of code) you have in a single kernel, the more registers the compiler typically needs to allocate. Registers are a highly limited resource on GPUs, leading to two issues:
- More registers means you can place fewer threads inside a single vector unit, which limits your ability to use the whole GPU (i.e., "occupancy")
- You typically end up "spilling" registers, which usually incurs a large performance penalty
Thus, when registers used are large or when registers "spill", you typically see significantly reduced performance.
On GPUs, kernels take a certain amount of time to "launch" (typically 5-20 microseconds, depending on the GPU architecture). Climate applications in particular have such a strong throughput constraint that they strong scale to the point that there is barely enough work to properly use a GPU. In these cases, kernel runtimes are frequently of the same order of launch latencies. Thus, launching many small kernels incurs many launch latency costs whereas launching one large kernel incurs very little launch latency cost.
One pro of the "one big kernel" approach is that it significantly reduces the number of kernel launches, and therefore, it can significantly outperform the "many small kernels" approach in the highest strong scaling limits.
Another potential issue with the "one big kernel" approach is that debugging and profiling can be challenging. If you get an invalid memory address in a kernel that is 10K lines long, it can be extremely difficult to determine where that error is occurring. Also, if the kernel performs very poorly, it can be very hard to determine why and which section of code (if it can be narrowed down to any particular section) is causing the problem.
With many small kernels, you know which kernel is performing poorly, and you know which kernel gives an invalid memory address error. It makes it significantly easier to determine the source of the problem by automatically narrowing it down to a much smaller section of code.
Personally, I believe there is a somewhat optimal middleground between the two approaches.
- Push the looping down the callstack first so that the hard work is already done of promoting variables to include an extra dimension
- Port with many small kernels first to identify bugs and performance problems
- Then merge kernels with few lines of code until you experience too much register pressure or register spillage
I think this workflow will give the fastest results over different problem sizes per GPU, and it will help the debugging efforts of refactoring and porting.
These are certainly not the only choices for a scientific GPU developer, but they are the most common. Fortran + "X" (where X is OpenACC, OpenMP target offload, or do concurrent
) allows the code to remain in Fortran, which is desirable for domain-oriented scientists on many projects.
I cannot stress enough that I don't believe there is a perfect solution to productivity and performance regarding this question. The choice will differ project-by-project, so I want to present the pros and cons of each.
C++ portability libraries are just C++ libraries with vendor code hidden underneath for different hardware backends. It's not a C++ language extension -- just vanilla C++. For more information, please see https://github.com/mrnorman/YAKL/wiki/CPlusPlus-Performance-Portability-For-OpenACC-and-OpenMP-Folks
The libraries I'm most familiar with are (https://github.com/kokkos/kokkos/)[Kokkos] and (https://github.com/mrnorman/YAKL)[YAKL]. They each have "owned" multi-dimensional arrays, parallel_for
kernel launchers for fully independent thread execution, parallel reductions, and data movement routines.
- YAKL: YAKL has Fortran-style multi-dimensional arrays and looping, a limited Fortran intrinsics library, and an automatically used pool allocator for efficient frequent allocations and deallocations with Fortran hooks.
- Kokkos: Kokkos doesn't have YAKL's Fortran behavior capabilities, but it has a lot more access to more detailed GPU hardware aspects like
shared
memory, and it has a larger ecosystem of products as well. Kokkos is a much larger development effort.
If your project does not desire Fortran-like behavior in the resulting user-level code, and it does not do frequent allocations and deallocations, then Kokkos is the better library of choice.
C++ is more pedantic regarding how its data structures are laid out in memory and how they behave in a large number of situations. Fortran is not pedantic about this. For this reason, using class
and struct
objects in GPU code in C++ is significantly better supported overall than you will experience in Fortran. In C++, the user has complete control and understanding of how classes, class members, and class methods behave on the CPU and GPU. The user can manually create constructors, move and copy constructors, and a destructor to make the class behave however they wish.
In general, for non-Nvidia hardware and for non-Nvidia compilers, you need to avoid using %
inside your GPU kernels or directives. This means kernels and data movement statements need to be shielded from classes, type-bound procedures, and sometimes even pointers. For this reason, many projects choose to isolate their Fortran GPU kernel code into a more Fortran 90 style without these newer data structure features.
In Fortran, you're allowed to override the bounds of an array being passed into subroutines. It also does not pass array lower bounds. This means that array index checking is not entirely reliable in a Fortran code. In portable C++ (at least Kokkos and YAKL, but probably the others too), all array metadata is attached to the object itself. Therefore, you cannot implicitly redimension or reshape arrays. This means array index checking is more reliable in portable C++ libraries. You do lose some of the convenience you have in Fortran, such as implied slicing and reshaping at subroutine interfaces, but in exchange for the inconvenience, you get guaranteed safety.
This section is YAKL-specific. Some Fortran codes allocate and deallocate frequently during runtime. YAKL automatically uses a pool allocator for all device allocations, which significantly reduces the cost of allocation and deallocation in C++ YAKL code. In Fortran, in general, one will pay a large penalty for allocations and deallocations during runtime.
Again, C++ is not a salvation from dealing with complex issues. There are benefits over Fortran and detriments with respect to Fortran, but there are also subtle issues that can arise in C++. I will list the main ones below and the resolutions to these issues:
- C++ lambdas do not automatically capture variables by value when they are out of local scope.
- Out-of-local-scope variables are captured by reference (i.e., a CPU reference), which will lead to invalid memory addressing. This happens when using data from your own class in a kernel (i.e.,
this->var
), data from global scope (i.e.::var
), and data from namespace scope (i.e.,mynamespace::var
). -
RESOLUTION: YAKL has a
YAKL_SCOPE()
macro that pulls that variable into local scope so that you can use it in a kernel.
- Out-of-local-scope variables are captured by reference (i.e., a CPU reference), which will lead to invalid memory addressing. This happens when using data from your own class in a kernel (i.e.,
- Calling C++ routines from your own class or using data members from your own class from inside a GPU kernel.
- Any time you call a routine from your own class or use a data member from your own class, you implicitly use the
this->
reference, which is (in general) invalid on the GPU. -
RESOLUTION:
- All class member functions decorated with
YAKL_INLINE
must be declared asstatic
so that it inherently has nothis->
pointer since it belongs to the class itself, not an object of the class.
- All class member functions decorated with
- Any time you call a routine from your own class or use a data member from your own class, you implicitly use the
- You cannot use many
std::
objects and functions on the GPU. This includes things likestd::list
andstd::vector
. So you need to be aware of this and careful. Allstd::
functions that areconstexpr
can be used, since they are done at compile time (e.g.std::is_same
orstd::enable_if
).
Not all Fortran compilers support all directives, and their levels of support can differ widely. The Nvidia compiler appears to have the most robust implementation of OpenACC and OpenMP target offload for Nvidia hardware. For non-Nvidia hardware, you'll need to use Cray or GNU for OpenACC; and Cray, GNU, or Intel for OpenMP target offload. GNU's performance is currently lagging a bit for OpenACC and signficantly for OpenMP target offload. Cray and GNU do not support all parts of the OpenACC standard. OpenMP target offload appears to be the more widely supported standard in the future.
You may also find, particularly in the presence of modern Fortran, that different compilers have different abilities to handle modern constructions, which also informs portability.
If choosing to use Fortran + OpenMP target offload, you'll likely want to avoid use of modern constructs in the kernel code, and you'll need to be prepared for different compilers to have different definitions of where simd
, parallel for
and target teams distribute
parallelism land on different accelerator devices.
There are some considerations regarding using in-language standards-based parallelism approaches like do concurrent
in the latest Fortran standard. These are very attractive options because they are part of the language standard itself, but they do have some limitations.
Because do concurrent
does not have information about separate memory address spaces, it basically relies on some form of Unified Memory. This can cause performance issues because there are many contexts when memory is paged to the device on-demand -- meaning the memory isn't paged to the device until it is requested by a kernel, which leads to large latency costs. If the data is used continuously on the device, this happens less often, though. So the performance penalty is application dependent; i.e., apps that shuffle data between host and device frequently will incur larger costs.
Another issue with do concurrent
is that it has no asynchronous semantics (i.e., asynchronous kernel launches and / or streams). For situations with significant amounts of threading or kernels with many lines of code (where kernel runtime is much greater than launch latencies), this is not a problem. But for situations with very little threading, being able to hide launch latencies with asynchronous kernel launches can be very imporant. This is a situation inevitably encountered in climate and often encountered in weather.
If you favor a hierarchical parallelism approach, the Fortran standard has no statements on where nested do concurrent
loops would land on an accelerator device (just as OpenMP does not either). To be fair, OpenACC does not either, but most if not all compilers place gang looping over SMs and worker and vector looping within SMs.
Another issue is the timeline for supporting do concurrent
constructs for accelerator kernel launching. Currently, the only compiler I'm aware of that supports it is Nvidia compilers on Nvidia hardware.
Further, reduction constructs are not currently in the standard and are waiting to be accepted in a future standard.
Another potential issue with Fortran + directives is that (for reasons I'm not entirely sure of), calling routines from kernels remains a highly buggy feature of implementations of the OpenACC and OpenMP target offload standards. This is something to be aware of especially when using the "one large kernel" approach.
Calling routines from kernels in C++ is much simpler because they are nearly universally inlined, which Fortran compilers typically do not do.
On the GPU, performance is typically about the same between C++ and Fortran + "X". As a general statement, though, C++ code will be slower on the CPU than Fortran code. How much so depends on the compiler, hardware, compiler flags, and Fortran style. For instance, when using -Ofast
, Fortran can be significantly faster (1.5-2x), but when using -O3
, they are often much closer together. Further, with -O3
and Fortran code that allocates frequently during runtime, C++ CPU performance is often the same or a bit faster. Compilers differ widely in performance on the CPU for both Fortran and C++.
Without significant packing and vectorization effort on the C++ code (which often makes it much less readable), you should expect C++ performance to be slower than Fortran on the CPU and about the same on the GPU
Recall that there are a number of C++ portability libraries available. For instance, YAKL can produce user-level code that maintains Fortran's look and feel to a great deal. This can reduce time for porting as well as bug potential. Further, it can increase the chances that domain scientists will feel comfortable reading and innovating the ported codebase. The C++ code could be run for GPU simulations, and the Fortran code could be run for CPU simulations. These can be kept up to date with one another with unit testing to determine when a change is made in one and not the other, causing the codes to diverge.
There are also projects that port to a more extensive library like Kokkos without the look and feel of Fortran, and while some might consider that code less readable, they can still keep it up to date with the original Fortran code with unit testing.
Another question is whether one should expose vertical looping as a parallel dimension. Many modeling centers choose not to do this because of the complexity of dealing with vertical loop-carried dependencies / race conditions, which are numerous in model physics packages. There are costs to not exposing the vertical looping as parallel looping, though.
Climate throughput requirements are so fast that you basically have no choice when running hi-res simulations at 5 Simulated Years Per wallclock Day (SYPD) or faster. These models must be strong scaled significantly to achieve these throughputs for hi-res grids (i.e., less than 50km); and there is so little column looping available per GPU that you need vertical looping.
Weather models have slower throughput requirements, and they might be able to get away without parallelizing the vertical dimension, but as they increase model resolution, the number of columns available per GPU will decrease -- meaning they will likely end up in the same place climate is in now.
It takes some significant effort to expose vertical looping as a parallel dimension because most physics packages have algorithmic dependencies in the vertical dimension. Radiation tendency calculations and hydrostatic summations, for instance, require prefix sums. There are many surface-level fields that need atomic accesses. Identifying these and handling these takes extra care that the user could otherwise ignore when not parallelizing the vertical dimension.