Skip to content

parallel_for and Bounds (C and Fortran Styles), and synchronization

Matt Norman edited this page Jul 14, 2022 · 5 revisions
namespace yakl {
namespace [c | fortran] {
template <class F> parallel_for( int num_threads , F const &functor);
template <class F> parallel_for( Bounds const &bounds , F const &functor);
template <class F> parallel_for( char const *label , int num_threads , F const &functor);
template <class F> parallel_for( char const *label , Bounds const &bounds , F const &functor);
// In Bounds, left to right ordering always matches outer to inner loop ordering
// Meaning the right-most loop bounds match the fastest varying / innermost loop
// You can use initializer lists {b1,b2} to pass to std::vector arguments
template <int DIMS> Bounds<DIMS>( int d0 [ , int d1 , ...] );
template <int DIMS> Bounds<DIMS>( std::vector<int> d0 [ , std::vector<int> d1 , ...] );
}
}

parallel_for and Bounds (C-style)

YAKL's parallel_for is the main workhorse of dispatching code on parallel threads, and the Bounds class is how you specify the implied (potentially multi-dimensional) loops to thread over. To pass the body of a set of loops to a parallel_for function, it is by far the most convenient to use a C++ "lambda" expression. If you're a C++ novice and don't understand what that is, just see the examples below for how to turn loops into a parallel_for call. If you're a Fortran developer, then see the Fortran-style examples below. If you're more advanced in C++, and you want to create your own functor, then feel free to do that and pass an instance of it to the parallel_for in place of the lambda expressions below. The examples immediately below are using C-style behavior.

Code launched by a parallel_for call is always launched on the device. To run code on the host, please use traditional [nested] for loop expressions.

Important: All parallel_for calls are asynchronous by default relative to CPU work. To synchronize with the host, you must use the yakl::fence() function (see below).

for (int i=0; i < nx; i++) {
  a(i) = 1;
}
// This will become the following:
// In c-style, an integer bound, nx, implies iteration from 0,...,nx-1
yakl::c::parallel_for( nx , YAKL_LAMBDA (int i) {
  a(i) = 1;
});


for (int j=0; j < ny; j++) {
for (int i=0; i < nx; i++) {
  a(j,i) = 1;
}
}
// This will become:
yakl::c::parallel_for( yakl::c::Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});


for (int j=-1; j <= ny+1; j++) {
for (int i=0; i < nx; i++) {
  a(j,i) = 1;
}
}
// This will become the following.
// Two entries for a bound means lower, upper (INCLUSIVE!)
yakl::c::parallel_for( yakl::c::Bounds<2>({-1,ny+1},nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});


for (int j=-1; j <= ny+1; j+=2) {
for (int i=0; i < nx; i++) {
  a(j,i) = 1;
}
}
// This will become the following.
// Three entries for a bound means lower, upper (INCLUSIVE), stride
// Strides must always be positive. No negative iterations at present
// If ordering matters, it's likely you have some kind of scan and cannot use parallel_for anyway
yakl::c::parallel_for( yakl::c::Bounds<2>({-1,ny+1,2},nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});

Fortran-style parallel_for and Bounds

Fortran-style parallel_for and Bounds behavior differs from C-style. An integer, nx, passed to a Fortran-style bounds, by default, iterates from 1,...,nx

// do i = 1 , nx
//   a(i) = 1;
// enddo
// This will become the following:
// In fortran-style, an integer bound, nx, implies iteration from 1,...,nx
yakl::fortran::parallel_for( nx , YAKL_LAMBDA (int i) {
  a(i) = 1;
});


// do j = 1 , ny
// do i = 1 , nx
//   a(i,j) = 1;
// }
// }
// This will become:
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
  a(i,j) = 1;
});


// do j = -1,ny+1
// do i = 1,x
//   a(i,j) = 1;
// }
// }
// This will become the following.
// Two entries for a bound means lower, upper (INCLUSIVE!)
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>({-1,ny+1},nx) , YAKL_LAMBDA (int j, int i) {
  a(i,j) = 1;
});


// do j = -1 , ny+1 , 2
// do i = 1 , nx
//   a(i,j) = 1;
// }
// }
// This will become the following.
// Three entries for a bound means lower, upper (INCLUSIVE), stride
// Strides must always be positive. No negative iterations at present
// If ordering matters, it's likely you have some kind of scan and cannot use parallel_for anyway
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>({-1,ny+1,2},nx) , YAKL_LAMBDA (int j, int i) {
  a(j,i) = 1;
});

Labeled parallel_for

If you plan to eventually use YAKL's automated timers and CUDA NVTX wrappers in nvprof, it is beneficial to label your parallel_for calls. These are parallel_for calls that begin with a string label (see the top of this page).

Readability with using

Tired of typing yakl::fortran:: over and over? Use the C++ using command:

using yakl::fortran::parallel_for;
using yakl::fortran::Bounds;
// Now it can look much cleaner:
parallel_for( Bounds<3>(nz,ny,nx) , YAKL_LAMBDA (int k, int j, int i) {
  arr(i,j,k) = 1;
});

YAKL_LAMBDA and YAKL_DEVICE_LAMBDA

YAKL_LAMBDA creates a lambda-generated functor that is valid on the host or device. In CUDA and HIP, for instance, it is a C++ macro that processes to __host__ __device__. The YAKL_DEVICE_LAMBDA macro is only valid on the device.

Calling functions from kernels: YAKL_INLINE and YAKL_DEVICE_INLINE

You can call functions from parallel_for kernels. Any routine you want to call from a kernel should be prefaced with YAKL_INLINE if you want that routine to be available on the host and device, and YAKL_DEVICE_INLINE if you want that routine to be available only on the device.

YAKL_INLINE float sum(float x, float y) { return x+y; }
parallel_for( nx , YAKL_LAMBDA (int i) { c(i) = sum( a(i) , b(i) ); });

Synchronization

Asynchronous operations on the device are synchronized with the host by using the yakl::fence() function. This corresponds to cudaDeviceSynchronize() in CUDA and similar in HIP.

Can parallel_for be nested in another parallel_for?

No, YAKL does not support this.

How do you run just one thread on the device?

parallel_for( 1 , YAKL_LAMBDA (int dummy) {
  //work
});

Does YAKL Support Hierarchical Parallelism?

Yes! See the documentation here: Hierarchical Parallelism

Clone this wiki locally