Skip to content

Hierarchical Parallelism

Matt Norman edited this page Jul 14, 2022 · 6 revisions

YAKL supports two-level hierarchical parallelism with yakl::parallel_outer, yakl::parallel_inner, yakl::fence_inner, and yakl::single_inner. Please be warned that the syntax necessarily increases in complexity with the choice of hierarchical parallelism. There are computational concerns that YAKL can no longer shield the user from as it can with vanilla parallel_for. Please see a small example below:

using yakl::c::parallel_outer;
using yakl::c::parallel_inner;
using yakl::c::Bounds;
parallel_outer( "blah" , Bounds<2>(ny,nx) , YAKL_LAMBDA (int j, int i, InnerHandler handler ) {
  // Load data
  parallel_inner( nz , [&] (int k) {
    w_velocity(j,i,1+k) = state(idW,k,j,i);
  } , handler );
  inner_fence( handler );
  // Boundary conditions
  single_inner( [&] () {
    w_velocity(j,i,0   ) = 0;
    w_velocity(j,i,nz+1) = 0;
  } , handler );
  fence_inner( handler );
  // Compute momentum from velocity and pre-computed density
  parallel_inner( nz+2 , [&] (int k) {
    w_momentum(j,i,k) = w_velocity(j,i,k) * density(j,i,k);
  } , handler );
} , LaunchConfig<nz>() );

Note that parallel_outer translates to simple for loops for a serial CPU backend, and it translates to OpenMP-threaded loops for the OpenMP CPU threading backend. parallel_inner translates to simple for loops for serial and OpenMP-threaded CPU backends, meaning it is not threaded for the OpenMP CPU threading backend. fence_inner does nothing for serial and OpenMP-threaded CPU backends.

parallel_outer

The yakl::c::parallel_outer and yakl::fortran::parallel_outer functions distribute parallel threads over a CUDA "grid" or an OpenACC "gang". It is the coarsest level parallelism on GPUs. The function has the following signature:

template <class F, int N, bool simple, int VecLen=YAKL_DEFAULT_VECTOR_LEN, bool B4B = false>
inline void parallel_outer( char const * str , Bounds<N,simple> const &bounds , F const &f ,
                            LaunchConfig<VecLen,B4B> config = LaunchConfig<>() );

TL;DR

Call parallel_outer as, e.g.,

parallel_outer("mylabel",Bounds<2>(ny,nx),YAKL_LAMBDA (int j, int i, InnerHandler handler) {
  // My lambda code
} , LaunchConfig<inner_loop_size>() );

OR

parallel_outer("mylabel",Bounds<2>(ny,nx),YAKL_LAMBDA (int j, int i, InnerHandler handler) {
  // My lambda code
} , LaunchConfig<MAX_INNER_LOOP_SIZE>().set_inner_size(actual_inner_loop_size) );

Ensure that actual_inner_loop_size <= MAX_INNER_LOOP_SIZE.

Always accept an InnerHandler object at the end of your outer loop indices in the functor or lambda you pass to parallel_outer.

The lambda you pass to parallel_outer must be created with YAKL_LAMBDA.

Full parallel_outer description

As with other parallel_for APIs, the beginning string label str is optional and can be omitted. As always, you can pass an integer type or a loop bound expression {lower_bound,upper_bound,stride} in place of a Bounds or SimpleBounds object. f is the functor or lambda you're distributing across threads. The LaunchConfig object gives YAKL important information about the kernel launch.

The first template parameter of LaunchConfig tells YAKL what the maximum size of the inner loop should be. As a template parameter, it must be known at compile time, and YAKL has reasons for this requirement (mostly computational efficiency reasons). It's highly recommended for the user to set this with a compile-time-known value that provides a maximum bound on the size of the inner parallel loop. Without the user setting this template parameter for LaunchConfig, the maximum bound defaults to the default vector length for a given architecture, which may not be as large as the user wants. In CUDA, you will get a runtime failure at kernel launch if the inner loop size exceeds the VecLen template parameter, but with other architectures, for now, you get a silent wrong answer that will be very difficult to track down.

There are two ways to tell YAKL the inner loop size. You can pass LaunchConfig<max_vector_length>(), which will automatically set the inner loop size to max_vector_length for you. Or you can pass LaunchConfig<max_vector_length>().set_inner_size(actual_vector_length) to parallel_outer, **making sure that actual_vector_length <= max_vector_length()*.

The InnerHandler object you accept in the functor or lambda you pass to parallel_outer is needed for some of the backends, and therefore, it must be used for all backends for portability. You will pass this object to all inner routines as you see in the example above.

parallel_inner

The yakl::c::parallel_inner and yakl::fortran::parallel_inner functions distribute parallel threads over a CUDA "block" or an OpenACC "vector" (necessarily nested within a parallel_outer function). It is the finest level parallelism on GPUs. This will launch the passed lambda or functor over all inner loop threads in parallel with no guarantee of execution order. The function has the following signature:

template <class F, int N, bool simple>
inline void parallel_inner( Bounds<N,simple> const &bounds , F const &f , InnerHandler handler );

Call parallel_inner as, e.g.,

parallel_inner(Bounds<2>(nz,nvars), [&] (int k, int ll) {
  // My lambda code
} , handler );

The object handler is an InnerHandler object passed into the lambda called by parallel_outer.

Important!!!: Please use [&] to create your C++ lambda rather than YAKL_LAMBDA. This is already on the device, so a simple capture-by-reference lambda is what you need to use. Most backends will give compile-time warnings or errors if you attempt to use YAKL_LAMBDA to create the lambda expression passed to parallel_inner.

As always, you can pass an integer type or a loop bound expression {lower_bound,upper_bound,stride} in place of a Bounds or SimpleBounds object.

single_inner

The yakl::c::single_inner and yakl::fortran::single_inner functions execute a single thread at the same parallelism level as parallel_inner (necessarily nested within a parallel_outer function). single_inner does not synchronize threads before or after, so without the user adding fence_inner(handler) before and after single_inner, the other parallel threads within the inner level parallelism are still executing in parallel with no guarantee of order. Therefore, single_inner is typically (nearly always?) surrounded by fence_inner() calls to ensure the single thread executes in isolation. The function has the following signature:

template <class F>
inline void single_inner( F const &f , InnerHandler handler );

Call single_inner as, e.g.,

yakl::fence_inner(handler);
single_inner( [&] () {
  // My lambda code
} , handler );
yakl::fence_inner(handler);

As with parallel_inner, the user must pass an InnerHandler object. The lambda passed to single_inner must accept zero parameters, and again, the lambda must be created with [&] and not YAKL_LAMBDA.

fence_inner

The YAKL inline void fence_inner( InnerHandler handler ) routine synchronizes threads within inner-level parallelism (CUDA "block" or OpenACC "vector"). This is equivalent to CUDA and HIP __syncthreads(), for instance.

fence_inner must be nested within a parallel_outer call, and you will likely get errors if you attempt to call this outside a parallel_outer call for the GPU backends.

Inner loop reductions

These are under development and are not available yet.

Clone this wiki locally