Skip to content

CPlusPlus Performance Portability For OpenACC and OpenMP Folks

Matt Norman edited this page Sep 9, 2019 · 22 revisions

Table of Contents

Performance portability is a bit of a unicorn in the real world. We all want something that works optimally on all architectures with a single source, but in the real world, we have to make compromises. The goal should be to find the route with the fewest limitations and the most flexibility. When it comes to portability, the "kernel" model has a lot of advantages. This is the model used in Nvidia's CUDA language, for instance.

The Kernel Approach

The gist of the kernel approach is to give the following information:

  1. A kernel that performs the work of a single thread
  2. The number of threads to run

You basically just say, "Here's what I want to do for a given thread, and here's how many threads I have." With this information, a compiler and / or library can then launch those threads in whatever manner it wants. If you're on the CPU, you can wrap the kernel in a simple for loop. If you're on the GPU you can launch it as a GPU kernel, splitting the parallelism both within and between the GPU's vector units. If you're on the Intel MIC, you can tile the parallelism for appropriate caching.

As far as I can tell, the kernel approach is the most portable avenue we currently have for general computations. You can, of course, use tuned libraries like cuBLAS and cuFFT for specific computations and create Domain Specific Languages for slightly less specific cases. But the kernel approach allows you to define the "kernel" however you want for truly generic computations.

A well-coded directives-based approach like OpenACC or OpenMP offload will do this anyway. The kernel code is simply the code inside a set of loops, and the number of threads for the kernel is the product of the loop dimensions. Consider the codes below, for instance

! These loops will create nx*ny*nz threads
!$acc parallel loop collapse(3)
do k= 1 , nz
  do j = 1 , ny
    do i = 1 , nx
      c(i,j,k) = a(i,j,k) + b(i,j,k)  ! This is the kernel code
    enddo
  enddo
enddo
// These loops will create nx*ny*nz threads
#pragma acc parallel loop collapse(3)
for (int k=0; k<nz; k++) {
  for (int j=0; j<ny; j++) {
    for (int i=0; i<nx; i++) {
      c[k][j][i] = a[k][j][i] + b[k][j][i]  // This is the kernel code
    }
  }
}

In the C++ kernel-based approach, though, you can explicitly launch the kernel code on different backends yourself rather than having to rely on the compiler to do the right thing or waiting on them to implement the specific backend you want and how you want it. Compiler development is usually substantially slower because they have to be more general than a project's development needs to be.

How To Express a Kernel in C++

To launch some kernel code on multiple possible backends, you must be able to pass the kernel code as a function parameter. In C++, this essentially means it needs to be a class object, which is basically just a C struct with functions and inheritance. It's true that you can pass functions as pointers in C and C++, but that's an unsafe and probably less performant approach. There is no checking for parameters of function pointers, and usually a C++ compiler can inline the kernel's function when it's a class object, but never when it's a function pointer. That's not to mention the fact that you cannot use a CPU function pointer on the GPU because they use different memory address spaces.

So, you're passing a class object that "does something". That implies you're using a class member function. The launcher you're passing your kernel to needs to know which function to call. While you could just agree on a common function name (e.g., snuffaluffagus()) for all of your kernel class objects to use, a more convenient thing to do is to simply overload the parentheses, (), operator. The benefit of this is that when you use the parentheses operator on the class object, it looks just like a function call. A C++ class that overloads the parentheses operator is called a "functor". An example of a functor that adds two arrays, a and b, and stores the result in c is below:

class AddKernel {
float *a, *b, *c;
public:
  AddKernel(float *aIn, float *bIn, float *cIn) {
    a = aIn;
    b = bIn;
    c = cIn;
  }
  void operator() (int const i) {
    c[i] = a[i] + b[i];
  }
};

For those less familiar with C++, the function with no return type that has the same name as the class is called the constructor, and in this case, it initializes the pointers so the operator() function can operate on them. Notice that the operator() function takes a single parameter: the thread ID; and the function does something specific to that thread ID. You would then pass an instance of this class to a launcher function that would run it with something like the following on the CPU in serial:

function launcher(int nThreads, AddKernel &kernel) {
  for (int i=0; i<nThreads; i++) {
    kernel(i);
  }
}

Running this kernel on the GPU would look more like:

__global__ void cudaKernel(int nThreads, AddKernel kernel) {
  int i = blockIdx.x*blockDim.x + threadIdx.x;
  if (i < nThreads) {
    f( i );
  }
}
int const vectorSize = 128;
function launcher(int nThreads, AddKernel &kernel) {
  cudaKernel <<< (nIter-1)/vectorSize+1 , vectorSize >>> ( nThreads , kernel );
}

The point is that we can take this kernel in the form of a functor and run it on multiple backends pretty easily. This is the gist of C++ performance portability.

C++ "Lambda"s

When your functor uses a lot of variables, it can be a real pain to have to explicitly create functors for every single kernel, declaring and initializing those variables in each functor, and having to explicitly create functor objects. That's a lot of wasted code. Enter the C++ "Lambda." A Lambda is a convenient way to create a functor without all of the tedium. I've heard it called "syntactic sugar" for creating functors that are only used once or have a small body of code. The real benefit of a Lambda is the "capture" clause (e.g., [var1,var2,&var3]), which automatically defines and initializes variables from the surrounding context in the functor for you. All Lambdas are "anonymous" functors because the classes are never named since they're only used once. It would be nice if you could name a Lambda in C++ so when you profile the code, the function name is sensible. A note of warning, Lambda function names in the symbol table tend to be pretty unwieldy.

As an example, the AddKernel functor you saw before could be expressed as the following as a Lambda:

auto myKernel = [a,b,c] (int i) { c[i] = a[i] + b[i]; };

The auto keyword, by the way, is just a convenience, where the compiler will assign the object type based on compile-time deduction. It the only way I know of to store a lambda object to a variable, though often, people just declare the lambda in place as a function parameter to the kernel launcher.

Notice the Lambda is only one line of code instead of twelve for this simple case. It becomes more drastic the more variables you use in the capture clause. Lambdas are also nice because they allow you to keep the kernel's code within the context it was originally in, grabbing variables from that context and creating a functor on the fly. If you have five successive kernels (i.e., five separate loops) in a given function, they can all stay in that function together like they originally were. This often makes the code easier to read and follow.

C++ Templates

To be honest, I'm not a fan of "meta"-templating because eventually the code becomes too hard to follow. But judicious use of templates can be quite nice. A C++ template directs the compiler to generate a function on the fly based on how it is called. The example launchers you saw above are not very useful because they only work with one functor object type. We can change them to use templates to then work with any kernel functor as follows:

template <class F> void parallel_for( int const nIter , F &f ) { 
  for (int i=0; i<nIter; i++) {
    f(i);
  }
}

When you call parallel_for with the AddKernel functor, the compiler will generate a version of this function with that object type. When you next call it with another functor (say from a Lambda), the compiler will generate yet another version of this launcher function. It's a convenience that allows you to write the launcher code only once for any kernel.

From Loops To Kernels: An Example

Consider the following CPU code, which computes a Runge-Kutta update based on weights and previous states and tendencies:

inline void applyTendencies(realArr &state2, real const c0, realArr const &state0,
                                             real const c1, realArr const &state1,
                                             real const ct, realArr const &tend,
                                             Domain const &dom) {
  for (int l=0; l<numState; l++) {
    for (int k=0; k<dom.nz; k++) {
      for (int j=0; j<dom.ny; j++) {
        for (int i=0; i<dom.nx; i++) {
          state2(l,hs+k,hs+j,hs+i) = c0 * state0(l,hs+k,hs+j,hs+i) +
                                     c1 * state1(l,hs+k,hs+j,hs+i) +
                                     ct * dom.dt * tend(l,k,j,i);
        }
      }
    }
  }
}

In Fortran, if you were to port this to an accelerator using OpenACC or OpenMP offload, this function would look like:

subroutine applyTendencies(state2, c0, state0, c1, state1, ct, tend)
  real, intent(  out) :: state2(:,:,:,:)
  real, intent(in   ) :: c0, c1, ct
  real, intent(in   ) :: state0(:,:,:,:), state1(:,:,:,:), tend(:,:,:,:)
#ifdef _USE_OPENACC
  !$acc parallel loop collapse(4)
#elif defined(_USE_OPENMP)
  !$omp target teams distribute parallel for simd collapse(4)
#endif
  do l=1,numState
    do k=1,nz
      do j=1,ny
        do i=1,nx
          state2(i,j,k,l) = c0 *    state0(i,j,k,l) + &
                            c1 *    state1(i,j,k,l) + &
                            ct * dt * tend(i,j,k,l)
        enddo
      enddo
    enddo
  enddo
end subroutine applyTendencies

Once this is transformed into a Lambda-based kernel launcher using the YAKL library, this code becomes:

inline void applyTendencies(realArr &state2, real const c0, realArr const &state0,
                                             real const c1, realArr const &state1,
                                             real const ct, realArr const &tend,
                                             Domain const &dom) {
  // for (int l=0; l<numState; l++) {
  //   for (int k=0; k<dom.nz; k++) {
  //     for (int j=0; j<dom.ny; j++) {
  //       for (int i=0; i<dom.nx; i++) {
  yakl::parallel_for( numState*dom.nz*dom.ny*dom.nx , YAKL_LAMBDA (int iGlob) {
    int l, k, j, i;
    unpackIndices(iGlob,numState,dom.nz,dom.ny,dom.nx,l,k,j,i);
    state2(l,hs+k,hs+j,hs+i) = c0 * state0(l,hs+k,hs+j,hs+i) +
                               c1 * state1(l,hs+k,hs+j,hs+i) +
                               ct * dom.dt * tend(l,k,j,i);
  }); 
}

I like to leave the for loops commented to better describe what's going on in the kernel launch itself, since the syntax still seems kind of ugly to me. This code works both on the CPU and the GPU with a single source, and yes, it shamelessly copies the Kokkos syntax because it's a good syntax.

Some Notes for Fortran Users

Hopefully the code above gives a good example of what something written in Fortran directives would look like in a C++ kernel-based approach. It always begs the question, though: Why bother moving to C++? I've programmed quite a bit in both contexts, and here are my thoughts:

  1. The OpenACC and OpenMP runtimes often have a difficult time dealing with Fortran data structures that are even remotely modern (pointers, ISO C bindings, derived types with allocatables, etc.). In my experience, this is the second-buggiest part of directives-based approaches, second only to function call directives.
  2. In directives-based approaches, you're relying on the compiler to do everything for you when you want to move to a new hardware back-end (like Intel MIC or AMD GPUs), and this can take a lot of time for compiler developers to implement. When new architectures and programming models emerge, mature Fortran implementations lag behind C++ implementations by usually a year or two (and sometimes significantly more than that).
  3. C++ compilers do a much better job of inlining function calls than Fortran compilers do. In Fortran, you often have to use the !$acc routine(...) clause for function calls, regardless of potential inlining. Not only does this tend to perform badly, but it's also in my experience the buggiest feature of OpenACC.
  4. In C++ and CUDA, you can use CUDA Managed Memory to shuffle data back and forth for you just by changing the allocation of your data buffer in your Array classes. In Fortran, you must rely on the compiler to do this for you under the hood, and thus far, only Nvidia's PGI compiler does this for you using -ta=nvidia,managed.
  5. C++, being used significantly more in mainstream industry than Fortran, naturally has much more mature compilers with fewer bugs than you find in Fortran compiler implementations. Fortran compilers just have fewer people working on them.
  6. C++ templating usually means you have to write significantly less code, and the compiler generates a lot of the code for you. In Fortran, for a generic function with multiple data types, you have to explicitly create a Fortran interface block that maps a generic function to multiple explicitly implemented functions based on multiple data types (which is really cumbersome).
  7. One (valid) criticism of C++ is the lack of a native multi-dimensional array class, but it's not that hard to create one yourself or borrow an already existing one made for GPU kernel launching (e.g., a Kokkos View or YAKL's Array). Also, there's a particularly strong benefit to a C++ Array class. In C++, the metadata of your multi-dimensional array is tied to the object itself whereas in Fortran, you can easily override an array's metadata in a routine's header, which is unsafe. This makes array bounds checking in Fortran subjective and error prone. In C++, you can add in your own array bounds checking in the Array class (turned on or off by a preprocessor macro), and the Array's metadata never changes in C++, making it safer and more reliable in the bounds checking.
  8. C++ has static analysis tools to catch bugs and undefined behavior, where as Fortran does not (to my knowledge).
  9. Fortran pointers are ... special.
    1. They allow too much flexibility (mainly pointers into strided memory), and the compiler can't know ahead of time when they're strided and when they aren't. For this reason, they use some form of indirect addressing by default, and they perform slower than automatic or allocated Fortran arrays.
    2. When you pass a Fortran pointer to a function, that function can accept it as a pointer, an array with implied bounds, or an array with specified bounds. In each of these cases, the compiler must choose what meta data to discard and how to discard it. I've seen some bugs in Fortran OpenACC implementations related to this being done wrongly.
    3. Further, Fortran array metadata is handled entirely implicitly by the compiler (the user has no say over this), and the Fortran standard doesn't specify how it should be handled. In C++ Array classes, you determine and handle the metadata yourself and have control over it. Keeping track of this metadata properly has been a large source of problems in OpenACC runtime implementations in Fortran. In C++, it's handled by copy and move constructors you write yourself and always seems to work in my experience.
    4. In Fortran, the pointer address is hidden from you unless you hack it out with c_loc(), meaning it's not really a pointer. Because of this, pointers are dereferenced by default in Fortran. So directives-based approaches have to do special things with regard to the pointer address versus pointer contents on the CPU and GPU, and it can get confusing. In C++, it's explicitly exposed and handled by you, so you at least know what's going on.

GPU Limitations and the Pains They Cause

From this point on, things are going to get a little more into the weeds. From a user's perspective, the most important things to get from this section is that when running on the GPU, (1) you must capture by value in your lambdas (i.e., [=] (int i) {...}); and (2) because you must capture by value, you must use an array / data class provided to you by the C++ library you're using (Kokkos, Raja, YAKL, etc.) unless you know what you're doing (more on this in the next section).

Everything is fine when you're running on the CPU, but things get uglier when you try to run on the GPU. The primary reason is that you have two separate memory address spaces: (1) main system memory and (2) GPU memory. Pointers in CPU memory are not valid on the GPU, and vice versa. So, if you create an object on the CPU and then try to use it on the GPU, you'll get a segmentation fault or bus error. Because of this, we have to introduce something new that works on single and multiple memory address spaces.

Shallow Copy Arrays

This, to me, is the crux of getting things working in a generic kernel launching approach in C++. If you want things to work on the GPU, and you cannot address an object created on the CPU, then you need to copy the object to the GPU. To accommodate this, it's easiest to create your own Array class, preferably a multi-dimensional Array class since most science uses arrays with multiple dimensions. This Array class needs Array metadata that is stored in simple compile-time known arrays (not pointers). Only the Array class's data should be a pointer that is allocated. If you want to use this Array on the GPU, allocate it with cudaMalloc(). If you want to use it on the CPU, use malloc(). If you want to use it on both without having to copy explicitly, use cudaMallocManaged().

When you pass any data class you use in your kernel functors to the GPU, you must do so by value, not by reference, again because you cannot use CPU pointers on the GPU. To do this properly, you usually have to explicitly override the class's default move and copy constructors to copy the class's metadata explicitly but simply move the data pointer itself (essentially sharing it between the original and the GPU copy). However, you must be careful to only allow the last used copy of the object to deallocate that data pointer. This can be done either by keeping track of a reference count shared by all objects (more complicated, but always works) or by keeping track of the original object's pointer (simpler, but sometimes fails).

When you share the data pointer, this is considered a "shallow" copy because the full object isn't being replicated but only its metadata. If you want to do a so-called "deep" copy, which copies the data from one allocated buffer to another, you'll need to create an explicit routine for this rather than using the = operator or copy and move constructors (à la the Kokkos View API).

Function Inlining

If you call a function from a GPU kernel, unless the called function does a lot of work over most of the GPU's thread indices, it's not worth the performance hit. GPU function calls flush the register page and cache, requiring fetches from DRAM for all data inside the called function. In the case of relatively small function calls, you'll want to define that function upon definition rather than waiting so that the compiler can inline it for sake of performance.

Private Variables

To define private variables that every thread on the GPU has a copy of, you must define an array with a size that is known at compile time because the GPU cannot handle variable length arrays on the stack. It must know the size of the stack frame for any GPU kernel at compile time. On the CPU, most compilers have extensions for putting Variable Length Arrays (VLAs) on the function's stack frame, but this will give a fatal compile-time error when using nvcc. If you have VLAs in your code, you have three options:

  1. Fix the array size as a compile time constant and keep the private variables.
  2. Fix a maximum private variable array length, and basically use partially filled arrays.
  3. Allocate a globally indexed array instead of private variables, and pass it to your kernel, having each thread index its piece of the global array.

Class Data Members

In C++, the whole point of a class is that the object has its own data to operate on, so it should be no surprise that you will often access your own class data in a GPU kernel. However, again, you cannot access your class's this pointer because it is pointing to CPU memory. Therefore, when you're creating C++ Lambda kernels to pass to a launcher, you need to avoid using the this pointer. While in C++17, you're allowed to create Labmdas that capture the *this object by value, this isn't the most efficient approach. The best thing to do in this case is to create a local reference in the class function containing the Lambda. An example of doing this with YAKL is below:

class Exchange {
...
  int nPack;
  real *haloSendBufW_cpu;
  real *haloSendBufE_cpu;
...
  inline void haloPackN_x(Domain const &dom, realArr const &a, int const n) {
    auto &haloSendBufW = this->haloSendBufW;
    auto &haloSendBufE = this->haloSendBufE;
    auto &nPack        = this->nPack       ;
    yakl::parallel_for( n*dom.nz*dom.ny*hs , YAKL_LAMBDA (int iGlob) {
      int v, k, j, ii;
      unpackIndices(iGlob,n,dom.nz,dom.ny,hs,v,k,j,ii);
      int nGlob = dom.nz*dom.ny*hs;
      haloSendBufW(nPack*nGlob+iGlob) = a(v,hs+k,hs+j,hs    +ii);
      haloSendBufE(nPack*nGlob+iGlob) = a(v,hs+k,hs+j,dom.nx+ii);
    });
    nPack = nPack + n;
  }
...
};

The lines starting with auto are the ones that create the local references that allow this code to work on the GPU. YAKL_LAMBDA is a macro that expands to [=] __host__ __device__ just like the KOKKOS_LAMBDA does. The nvcc compiler uses this code to create an anonymous C++ functor whose operator() member function has the __host__ __device__ attribute, meaning it can be run on either the host or the device.

Further Resources

A good place to look is the Kokkos and RAJA wikis:

AMD ROCm's HC language and Khronos's SYCLE also implement similar approaches:

Clone this wiki locally