-
Notifications
You must be signed in to change notification settings - Fork 15
parallel_for and Bounds (C and Fortran Styles), and synchronization
namespace yakl {
namespace [c | fortran] {
template <class F> parallel_for( int num_threads , F const &functor);
template <class F> parallel_for( Bounds const &bounds , F const &functor);
template <class F> parallel_for( char const *label , int num_threads , F const &functor);
template <class F> parallel_for( char const *label , Bounds const &bounds , F const &functor);
// In Bounds, left to right ordering always matches outer to inner loop ordering
// Meaning the right-most loop bounds match the fastest varying / innermost loop
// You can use initializer lists {b1,b2} to pass to std::vector arguments
template <int DIMS> Bounds<DIMS>( int d0 [ , int d1 , ...] );
template <int DIMS> Bounds<DIMS>( std::vector<int> d0 [ , std::vector<int> d1 , ...] );
}
}
YAKL's parallel_for
is the main workhorse of dispatching code on parallel threads, and the Bounds
class is how you specify the implied (potentially multi-dimensional) loops to thread over. To pass the body of a set of loops to a parallel_for
function, it is by far the most convenient to use a C++ "lambda" expression. If you're a C++ novice and don't understand what that is, just see the examples below for how to turn loops into a parallel_for
call. If you're a Fortran developer, then see the Fortran-style examples below. If you're more advanced in C++, and you want to create your own functor, then feel free to do that and pass an instance of it to the parallel_for
in place of the lambda expressions below. The examples immediately below are using C-style behavior.
Code launched by a parallel_for
call is always launched on the device. To run code on the host, please use traditional [nested] for loop expressions.
Important: All parallel_for
calls are asynchronous by default relative to CPU work. To synchronize with the host, you must use the yakl::fence()
function (see below).
for (int i=0; i < nx; i++) {
a(i) = 1;
}
// This will become the following:
// In c-style, an integer bound, nx, implies iteration from 0,...,nx-1
yakl::c::parallel_for( nx , YAKL_LAMBDA (int i) {
a(i) = 1;
});
for (int j=0; j < ny; j++) {
for (int i=0; i < nx; i++) {
a(j,i) = 1;
}
}
// This will become:
yakl::c::parallel_for( yakl::c::Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
a(j,i) = 1;
});
for (int j=-1; j <= ny+1; j++) {
for (int i=0; i < nx; i++) {
a(j,i) = 1;
}
}
// This will become the following.
// Two entries for a bound means lower, upper (INCLUSIVE!)
yakl::c::parallel_for( yakl::c::Bounds<2>({-1,ny+1},nx) , YAKL_LAMBDA (int j, int i) {
a(j,i) = 1;
});
for (int j=-1; j <= ny+1; j+=2) {
for (int i=0; i < nx; i++) {
a(j,i) = 1;
}
}
// This will become the following.
// Three entries for a bound means lower, upper (INCLUSIVE), stride
// Strides must always be positive. No negative iterations at present
// If ordering matters, it's likely you have some kind of scan and cannot use parallel_for anyway
yakl::c::parallel_for( yakl::c::Bounds<2>({-1,ny+1,2},nx) , YAKL_LAMBDA (int j, int i) {
a(j,i) = 1;
});
Fortran-style parallel_for
and Bounds
behavior differs from C-style. An integer, nx, passed to a Fortran-style bounds, by default, iterates from 1,...,nx
// do i = 1 , nx
// a(i) = 1;
// enddo
// This will become the following:
// In fortran-style, an integer bound, nx, implies iteration from 1,...,nx
yakl::fortran::parallel_for( nx , YAKL_LAMBDA (int i) {
a(i) = 1;
});
// do j = 1 , ny
// do i = 1 , nx
// a(i,j) = 1;
// }
// }
// This will become:
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i) {
a(i,j) = 1;
});
// do j = -1,ny+1
// do i = 1,x
// a(i,j) = 1;
// }
// }
// This will become the following.
// Two entries for a bound means lower, upper (INCLUSIVE!)
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>({-1,ny+1},nx) , YAKL_LAMBDA (int j, int i) {
a(i,j) = 1;
});
// do j = -1 , ny+1 , 2
// do i = 1 , nx
// a(i,j) = 1;
// }
// }
// This will become the following.
// Three entries for a bound means lower, upper (INCLUSIVE), stride
// Strides must always be positive. No negative iterations at present
// If ordering matters, it's likely you have some kind of scan and cannot use parallel_for anyway
yakl::fortran::parallel_for( yakl::fortran::Bounds<2>({-1,ny+1,2},nx) , YAKL_LAMBDA (int j, int i) {
a(j,i) = 1;
});
If you plan to eventually use YAKL's automated timers and CUDA NVTX wrappers in nvprof, it is beneficial to label your parallel_for
calls. These are parallel_for
calls that begin with a string label (see the top of this page).
Tired of typing yakl::fortran::
over and over? Use the C++ using
command:
using yakl::fortran::parallel_for;
using yakl::fortran::Bounds;
// Now it can look much cleaner:
parallel_for( Bounds<3>(nz,ny,nx) , YAKL_LAMBDA (int k, int j, int i) {
arr(i,j,k) = 1;
});
YAKL_LAMBDA
creates a lambda-generated functor that is valid on the host or device. In CUDA and HIP, for instance, it is a C++ macro that processes to __host__ __device__
. The YAKL_DEVICE_LAMBDA
macro is only valid on the device.
You can call functions from parallel_for
kernels. Any routine you want to call from a kernel should be prefaced with YAKL_INLINE
if you want that routine to be available on the host and device, and YAKL_DEVICE_INLINE
if you want that routine to be available only on the device.
YAKL_INLINE float sum(float x, float y) { return x+y; }
parallel_for( nx , YAKL_LAMBDA (int i) { c(i) = sum( a(i) , b(i) ); });
Asynchronous operations on the device are synchronized with the host by using the yakl::fence()
function. This corresponds to cudaDeviceSynchronize()
in CUDA and similar in HIP.
No, YAKL does not support this.
parallel_for( 1 , YAKL_LAMBDA (int dummy) {
//work
});
YAKL does not support hierarchical or multi-level parallelism on the device, and there are no current plans to support this. However, there are plans to integrate Kokkos parallel dispatch with YAKL Array
objects and YAKL intrinsics. At that point, it will be possible for the developer to use the Kokkos parallel dispatch, which does include hierarchical parallelism.