A Brief Look at GPU Refactoring and Porting Choices for Climate and Weather Models

1. To push looping down the callstack or not?

Most existing physics packages and many dynamics packages are structured so that a loop over columns wraps around a number of subroutine calls, e.g.:

do icol = 1 , ncol
  call microphysics(rho_d(:,icol),temp(:,icol),rho_v(:,icol),...)
  call gravity_wave_drag(...)
  call radiation(...)
  ! ...
enddo
! ...
subroutine microphysics(rho_d,temp,rho_v,...)
  real(rp), intent(in   ) :: rho_d(nlev)
  real(rp), intent(inout) :: temp(nlev), rho_v(nlev), ...
  do ilev = 1 , nlev
    ! ...
  enddo
endsubroutine

There are a number of potential issues with this that will be discussed in the next section. It creates one giant kernel that may suffer register spillage, may have performance issues with arbitrary nlev, and may be quite hard to debug and profile. To improve this behavior, you can "push the looping down the callstack", resulting in the following:

call microphysics(rho_d,temp,rho_v,...)
call gravity_wave_drag(...)
call radiation(...)
! ...
subroutine microphysics(rho_d,temp,rho_v,...)
  real(rp), intent(in   ) :: rho_d(nlev,ncol)
  real(rp), intent(inout) :: temp(nlev,ncol), rho_v(nlev,ncol), ...
  do icol = 1 , ncol
    do ilev = 1 , nlev
      ! ...
    enddo
  enddo
endsubroutine

Once the looping is pushed down the callstack:

You can split the work into as many or as few kernels as you wish.
You are no longer at the mercy of the number of vertical levels having to be a GPU-friendly number (e.g., multiple of 32) because you can collapse the parallelism into a single dimension.
You can control register spillage by fissioning kernels however you wish.
You can easily debug and profile, determining which section of code is failing or performing poorly
You can isolate vertically dependent work (loop carried dependencies) and handle it separately from other kernels

1.1 It's a massive effort

There are some potential advantages to pushing looping down the callstack, but, it is a lot of effort.

You have to promote all variables inside subroutines (parameters and most internal variables) to include an extra ncol dimension. This is by far the most labor-intensive operation.
It can be done incrementally, but you need a solid and fast-running testing framework to catch bugs.
This can increase local memory requirements since temporary arrays dimensioned as nlev are now dimensioned as nlev x ncol.

1.2 What about CPU caching?

One potential issue with this approach on CPUs is that you may overflow cache where you might have fit in cache before.

1.3 Mitigations for CPU caching

The mitigation for this is to allow the user to pass as many or few columns as they wish into the physics routines in a flexible way. The code is very similar to pushing all looping down the callstack. This way, on CPUs, you can pass few columns down, and on GPUs, you can pass many columns down.

2. One big kernel, or many small kernels?

The efficacy of pushing looping down the call stack (see the previous section) really boils down to the idea of "one big kernel" versus "many small kernels". There are pros and cons to each.

2.1 Pros and Cons of Hierarchical Parallelism

First, if you choose "one big kernel" in climate and weather model physics, and you want to thread over the vertical dimension, you are almost certainly committing to "hierarchical parallelism". In this case, it is typically two-level parallelism with ncol being the coarser and outer-level looping; and nlev being the finer and inner-level looping. Ostensibly, this would thread nlev within a Streaming Multiprocessor (SM, basically a vector unit) on an Nvidia GPU, and it would thread ncol across multiple SMs in an embarrassingly parallel fashion.

This requires that nlev be a GPU-friendly number -- often a multiple of 32.

This also potentially increases complexity because all code in the giant kernel must be on the GPU with no exceptions. Many physics packages setup and use indirect addressing, like to perform output to stdout and stderr, and have loop-carried dependencies such as hydrostatic summation or radiation tendency calculations. These can be complex to deal with when you cannot fission the kernel or run anything on the host.

2.2 Register pressure

Another con of one big kernel is that you typically have order 1K - 10K lines of code (some times more) in a single kernel. The higher the volume of code (the more lines of code) you have in a single kernel, the more registers the compiler typically needs to allocate. Registers are a highly limited resource on GPUs, leading to two issues:

More registers means you can place fewer threads inside a single vector unit, which limits your ability to use the whole GPU (i.e., "occupancy")
You typically end up "spilling" registers, which usually incurs a large performance penalty

Thus, when registers used are large or when registers "spill", you typically see significantly reduced performance.

2.3 Launch Latencies

On GPUs, kernels take a certain amount of time to "launch" (typically 5-20 microseconds, depending on the GPU architecture). Climate applications in particular have such a strong throughput constraint that they strong scale to the point that there is barely enough work to properly use a GPU. In these cases, kernel runtimes are frequently of the same order of launch latencies. Thus, launching many small kernels incurs many launch latency costs whereas launching one large kernel incurs very little launch latency cost.

One pro of the "one big kernel" approach is that it significantly reduces the number of kernel launches, and therefore, it can significantly outperform the "many small kernels" approach in the highest strong scaling limits.

2.4 Debugging and Profiling

Another potential issue with the "one big kernel" approach is that debugging and profiling can be challenging. If you get an invalid memory address in a kernel that is 10K lines long, it can be extremely difficult to determine where that error is occurring. Also, if the kernel performs very poorly, it can be very hard to determine why and which section of code (if it can be narrowed down to any particular section) is causing the problem.

With many small kernels, you know which kernel is performing poorly, and you know which kernel gives an invalid memory address error. It makes it significantly easier to determine the source of the problem by automatically narrowing it down to a much smaller section of code.

2.5 Mitigations and middle-ground

Personally, I believe there is a somewhat optimal middleground between the two approaches.

Push the looping down the callstack first so that the hard work is already done of promoting variables to include an extra dimension
Port with many small kernels first to identify bugs and performance problems
Then merge kernels with few lines of code until you experience too much register pressure or register spillage

I think this workflow will give the fastest results over different problem sizes per GPU, and it will help the debugging efforts of refactoring and porting.

3. Fortran + "X" or Portable C++?

These are certainly not the only choices for a scientific GPU developer, but they are the most common. Fortran + "X" (where X is OpenACC, OpenMP target offload, or do concurrent) allows the code to remain in Fortran, which is desirable for domain-oriented scientists on many projects.

I cannot stress enough that I don't believe there is a perfect solution to productivity and performance regarding this question. The choice will differ project-by-project, so I want to present the pros and cons of each.

3.1 Different C++ libraries

C++ portability libraries are just C++ libraries with vendor code hidden underneath for different hardware backends. It's not a C++ language extension -- just vanilla C++. For more information, please see https://github.com/mrnorman/YAKL/wiki/CPlusPlus-Performance-Portability-For-OpenACC-and-OpenMP-Folks

The libraries I'm most familiar with are (https://github.com/kokkos/kokkos/)[Kokkos] and (https://github.com/mrnorman/YAKL)[YAKL]. They each have "owned" multi-dimensional arrays, parallel_for kernel launchers for fully independent thread execution, parallel reductions, and data movement routines.

YAKL: YAKL has Fortran-style multi-dimensional arrays and looping, a limited Fortran intrinsics library, and an automatically used pool allocator for efficient frequent allocations and deallocations with Fortran hooks.
Kokkos: Kokkos doesn't have YAKL's Fortran behavior capabilities, but it has a lot more access to more detailed GPU hardware aspects like shared memory, and it has a larger ecosystem of products as well. Kokkos is a much larger development effort.

If your project does not desire Fortran-like behavior in the resulting user-level code, and it does not do frequent allocations and deallocations, then Kokkos is the better library of choice.

3.2 Ability to handle complex data structures

C++ is more pedantic regarding how its data structures are laid out in memory and how they behave in a large number of situations. Fortran is not pedantic about this. For this reason, using class and struct objects in GPU code in C++ is significantly better supported overall than you will experience in Fortran. In C++, the user has complete control and understanding of how classes, class members, and class methods behave on the CPU and GPU. The user can manually create constructors, move and copy constructors, and a destructor to make the class behave however they wish.

In general, for non-Nvidia hardware and for non-Nvidia compilers, you need to avoid using % inside your GPU kernels or directives. This means kernels and data movement statements need to be shielded from classes, type-bound procedures, and sometimes even pointers. For this reason, many projects choose to isolate their Fortran GPU kernel code into a more Fortran 90 style without these newer data structure features.

3.3 Array indexing safety

In Fortran, you're allowed to override the bounds of an array being passed into subroutines. It also does not pass array lower bounds. This means that array index checking is not entirely reliable in a Fortran code. In portable C++ (at least Kokkos and YAKL, but probably the others too), all array metadata is attached to the object itself. Therefore, you cannot implicitly redimension or reshape arrays. This means array index checking is more reliable in portable C++ libraries. You do lose some of the convenience you have in Fortran, such as implied slicing and reshaping at subroutine interfaces, but in exchange for the inconvenience, you get guaranteed safety.

3.4 Allocating frequently during runtime and pool allocation

This section is YAKL-specific. Some Fortran codes allocate and deallocate frequently during runtime. YAKL automatically uses a pool allocator for all device allocations, which significantly reduces the cost of allocation and deallocation in C++ YAKL code. In Fortran, in general, one will pay a large penalty for allocations and deallocations during runtime.

3.5 C++ subtleties

Again, C++ is hardly a salvation from dealing with complex issues. There are benefits over Fortran, but there are also subtle issues that can arise in C++. I will list the main ones below and the resolutions to these issues:

C++ lambdas do not automatically capture variables by value when they are out of local scope.
- Out-of-local-scope variables are captured by reference (i.e., a CPU reference), which will lead to invalid memory addressing. This happens when using data from your own class in a kernel (i.e., this->var), data from global scope (i.e. ::var), and data from namespace scope (i.e., mynamespace::var).
- RESOLUTION: YAKL has a YAKL_SCOPE() macro that pulls that variable into local scope so that you can use it in a kernel.
Calling C++ routines from your own class or using data members from your own class from inside a GPU kernel.
- Any time you call a routine from your own class or use a data member from your own class, you implicitly use the this-> reference, which is (in general) invalid on the GPU.
- RESOLUTION:
  1. asdf

3.6 Fortran directives compiler support

3.7 In-language parallelism (e.g., "do concurrent")

3.7.1 Unified Memory Pros and Cons

3.7.2 Asynchronous Execution and Launch Latencies

3.8 Calling routines from kernels in Fortran + directives

4. Expose Vertical Looping or not?

4.1 Throughput requirements and strong scaling

4.2 Encountering and handling loop-caried dependencies

5. A Possible Pattern for Refactoring Fortran code to C++

Maintaining both Fortran and C++ codes (Fortran for CPU, C++ for GPU). Test for answer differences between the two, and mirror them when required.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly