Skip to content
Matt Norman edited this page Dec 10, 2021 · 35 revisions

YAKL (Yet Another Kernel Launcher)

A Simple C++ Framework for Performance Portability and Fortran Code Porting

Author: Matt Norman (Oak Ridge National Laboratory) - mrnorman.github.io

Contributors:

Matt Norman (Oak Ridge National Laboratory) Isaac Lyngaas (Oak Ridge National Laboratory) Abhishek Bagusetty (Argonne National Laboratory) Mark Berrill (Oak Ridge National Laboratory)

Overview

YAKL (like Kokkos and RAJA) is a portable C++ library that allows developers to conveniently export code to different hardware backends like CUDA, HIP, and SYCL for single-source portability. YAKL, Kokkos, and RAJA are all just C++ libraries, and the code is purely C++ without any language extensions. For more information about portable C++ libraries, particularly from the perspective of using directives, please read this article.

The YAKL API is similar to Kokkos in many ways, but is quite simplified and has much stronger and Fortran-like behavior in the arrays and parallel loops. YAKL currently has backends for:

  • CPUs (serial)
  • CPU OpenMP threading
  • CUDA
  • HIP
  • SYCL
  • OpenMP offload (in progress)

What does YAKL provide?

  • Multi-dimensional dynamically allocated arrays in Fortran and C styles
  • Multi-dimensional statically defined arrays in Fortran and C styles
  • Kernel launchers to launch code in parallel over threads on different hardware backends
  • Various methods of transferring data between host and device memory spaces
  • Basic atomic operations (add, min, and max) using hardware atomics when available
  • Efficient reductions via convenient syntax patterned after Fortran's sum(), minval(), and maxval() using vendor libraries
  • Synchronization via a fence() function
  • Pool allocator that is automatically turned on for all device allocations in separate memory address spaces
  • Fortran bindings for YAKL allocators and YAKL init and finalize
  • Limited Fortran intrinsics library
  • Classes to handle scalars that need to be read after being written to in a parallel kernel.
  • NetCDF and Parallel NetCDF I/O routines using YAKL's multi-dimensional Arrays
  • Automated timers for YAKL's parallel_for calls using the General Purpose Timing Library (GPTL)

Example YAKL Code

The following is an example of a section of code in Fortran + OpenACC, parallel YAKL C++ in Fortran-style, and parallel YAKL in Fortran-style:

OpenACC Fortran Code

real stateTend      (nx  ,ny,nz,numState);
real stateFluxLimits(nx+1,ny,nz,numState);

!$acc parallel loop collapse(4)
do l = 1 , numState
  do k = 1 , nz
    do j = 1 , ny
      do i = 1 , nx
        stateTend(i,j,k,l) = - ( stateFluxLimits(i+1,j,k,l) -
                                 stateFluxLimits(i  ,j,k,l) ) / dx;
      enddo
    enddo
  enddo
enddo

Portable C++ Code (Fortran-style YAKL Arrays)

typedef yakl::Array<float,4,yakl::memDevice,yakl::styleFortran> real4d;
using yakl::fortran::parallel_for;
using yakl::fortran::Bounds;

real4d stateTend      ("stateTend"      ,nx  ,ny,nz,numState);
real4d stateFluxLimits("stateFluxLimits",nx+1,ny,nz,numState);

// do l = 1 , numState
//   do k = 1 , nz
//     do j = 1 , ny
//       do i = 1 , nx
parallel_for( Bounds<4>(numState,nz,ny,nx) ,
              YAKL_LAMBDA(int l, int k, int j, int i) { 
  stateTend(i,j,k,l) = - ( stateFluxLimits(i+1,j,k,l) -
                           stateFluxLimits(i  ,j,k,l) ) / dx;
});

Portable C++ Code (C-style YAKL Arrays)

typedef yakl::Array<float,4,yakl::memDevice,yakl::styleC> real4d;
using yakl::c::parallel_for;
using yakl::c::Bounds;

real4d stateTend      ("stateTend"      ,numState,nz,ny,nx  );
real4d stateFluxLimits("stateFluxLimits",numState,nz,ny,nx+1);

// for (int l=0; l < numState; l++) {
//   for (int k=0; k < nz; k++) {
//     for (int j=0; j < ny; j++) {
//       for (int i=0; i < nx; i++) {
parallel_for( Bounds<4>(numState,nz,ny,nx) ,
              YAKL_LAMBDA(int l, int k, int j, int i) { 
  stateTend(l,k,j,i) = - ( stateFluxLimits(l,k,j,i+1) -
                           stateFluxLimits(l,k,j,i  ) ) / dx;
});  
Clone this wiki locally