Skip to content

parallel_for and Bounds (C and Fortran Styles), and synchronization

Matt Norman edited this page Feb 28, 2022 · 5 revisions
namespace yakl {
namespace [c | fortran] {
template <class F> parallel_for( int num_threads , F const &functor);
template <class F> parallel_for( Bounds const &bounds , F const &functor);
template <class F> parallel_for( char const *label , int num_threads , F const &functor);
template <class F> parallel_for( char const *label , Bounds const &bounds , F const &functor);
// In Bounds, left to right ordering always matches outer to inner loop ordering
// Meaning the right-most loop bounds match the fastest varying / innermost loop
// You can use initializer lists {b1,b2} to pass to std::vector arguments
template <int DIMS> Bounds<DIMS>( int d0 [ , int d1 , ...] );
template <int DIMS> Bounds<DIMS>( std::vector<int> d0 [ , std::vector<int> d1 , ...] );
}
}

parallel_for and Bounds (C-style)

YAKL's parallel_for is the main workhorse of dispatching code on parallel threads, and the Bounds class is how you specify the implied (potentially multi-dimensional) loops to thread over. To pass the body of a set of loops to a parallel_for function, it is by far the most convenient to use a C++ "lambda" expression. If you're a C++ novice and don't understand what that is, just see the examples below for how to turn loops into a parallel_for call. If you're a Fortran developer, then see the Fortran-style examples below. If you're more advanced in C++, and you want to create your own functor, then feel free to do that and pass an instance of it to the parallel_for in place of the lambda expressions below. The examples immediately below are using C-style behavior.

Code launched by a parallel_for call is always launched on the device. To run code on the host, please use traditional [nested] for loop expressions.

Important: All parallel_for calls are asynchronous by default relative to CPU work. To synchronize with the host, you must use the yakl::fence() function (see below).

for (int i=0; i < nx; i++) {
  a(i) = 1;
}
// This will become the following:
// In c-style, an integer bound, nx, implies iteration from 0,...,nx-1
yakl::c::parallel_for( nx , YAKL_LAMBDA (int i) {
  a(i) = 1;
});


for (int j=0; j < ny; j++) {
for (int i=0; i < nx; i++) {
  a(j,i) = 1;
}
}
// This will become:
yakl::c::parallel_for( yakl::c::Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});


for (int j=-1; j <= ny+1; j++) {
for (int i=0; i < nx; i++) {
  a(j,i) = 1;
}
}
// This will become the following.
// Two entries for a bound means lower, upper (INCLUSIVE!)
yakl::c::parallel_for( yakl::c::Bounds<2>({-1,ny+1},nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});


for (int j=-1; j <= ny+1; j+=2) {
for (int i=0; i < nx; i++) {
  a(j,i) = 1;
}
}
// This will become the following.
// Three entries for a bound means lower, upper (INCLUSIVE), stride
// Strides must always be positive. No negative iterations at present
// If ordering matters, it's likely you have some kind of scan and cannot use parallel_for anyway
yakl::c::parallel_for( yakl::c::Bounds<2>({-1,ny+1,2},nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});

Fortran-style parallel_for and Bounds

Fortran-style parallel_for and Bounds behavior differs from C-style. An integer, nx, passed to a Fortran-style bounds, by default, iterates from 1,...,nx

// do i = 1 , nx
//   a(i) = 1;
// enddo
// This will become the following:
// In fortran-style, an integer bound, nx, implies iteration from 1,...,nx
yakl::fortran::parallel_for( nx , YAKL_LAMBDA (int i) {
  a(i) = 1;
});


// do j = 1 , ny
// do i = 1 , nx
//   a(i,j) = 1;
// }
// }
// This will become:
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
  a(i,j) = 1;
});


// do j = -1,ny+1
// do i = 1,x
//   a(i,j) = 1;
// }
// }
// This will become the following.
// Two entries for a bound means lower, upper (INCLUSIVE!)
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>({-1,ny+1},nx) , YAKL_LAMBDA (int j, int i) {
  a(i,j) = 1;
});


// do j = -1 , ny+1 , 2
// do i = 1 , nx
//   a(i,j) = 1;
// }
// }
// This will become the following.
// Three entries for a bound means lower, upper (INCLUSIVE), stride
// Strides must always be positive. No negative iterations at present
// If ordering matters, it's likely you have some kind of scan and cannot use parallel_for anyway
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>({-1,ny+1,2},nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});

Labeled parallel_for

If you plan to eventually use YAKL's automated timers and CUDA NVTX wrappers in nvprof, it is beneficial to label your parallel_for calls. These are parallel_for calls that begin with a string label (see the top of this page).

Readability with using

Tired of typing yakl::fortran:: over and over? Use the C++ using command:

using yakl::fortran::parallel_for;
using yakl::fortran::Bounds;
// Now it can look much cleaner:
parallel_for( Bounds<3>(nz,ny,nx) , YAKL_LAMBDA (int k, int j, int i) {
  arr(i,j,k) = 1;
});

YAKL_LAMBDA and YAKL_DEVICE_LAMBDA

YAKL_LAMBDA creates a lambda-generated functor that is valid on the host or device. In CUDA and HIP, for instance, it is a C++ macro that processes to __host__ __device__. The YAKL_DEVICE_LAMBDA macro is only valid on the device.

Calling functions from kernels: YAKL_INLINE and YAKL_DEVICE_INLINE

You can call functions from parallel_for kernels. Any routine you want to call from a kernel should be prefaced with YAKL_INLINE if you want that routine to be available on the host and device, and YAKL_DEVICE_INLINE if you want that routine to be available only on the device.

YAKL_INLINE float sum(float x, float y) { return x+y; }
parallel_for( nx , YAKL_LAMBDA (int i) { c(i) = sum( a(i) , b(i) ); });

Synchronization

Asynchronous operations on the device are synchronized with the host by using the yakl::fence() function. This corresponds to cudaDeviceSynchronize() in CUDA and similar in HIP.

Can parallel_for be nested in another parallel_for?

No, YAKL does not support this.

How do you run just one thread on the device?

parallel_for( 1 , YAKL_LAMBDA (int dummy) {
  //work
});

Does YAKL Support Hierarchical Parallelism?

YAKL does not support hierarchical or multi-level parallelism on the device, and there are no current plans to support this. However, there are plans to integrate Kokkos parallel dispatch with YAKL Array objects and YAKL intrinsics. At that point, it will be possible for the developer to use the Kokkos parallel dispatch, which does include hierarchical parallelism.

Clone this wiki locally