Skip to content

Floating point determinism bitwise reproducibility

Matt Norman edited this page Mar 23, 2022 · 8 revisions

How to achieve bit-for-bit reproducibility with YAKL

YAKL supports floating point determinism, which is fully validated for HIP and CUDA backends, also called bit-for-bit reproducibility for GPU runs. There are only two places where bit-for-bit reproducibility can potentially fail with YAKL: sum reductions and atomicAdd. These are cases where we often have no control over the order of operations, and floating point arithmetic is not commutative.

It turns out that YAKL's reliance on cub and hipCUB for reductions is beneficial in this regard, as each of those libraries guarantees determinism from one run to the next on the same GPU.

For atomics, the developer will need to be involved here, but the effort is low. First, all parallel_for launches that have an atomicAdd call need to have an extra parameter, yakl::DefaultLaunchConfigB4b(), added to the end of the parallel_for calls. What this does is tell YAKL that if the YAKL_B4B macro is passed in from the compiler, then this kernel should be run serially on the host rather than in parallel on the GPU. If the YAKL_B4B macro is not defined, then the kernel runs efficiently in parallel on the GPU like normal without bitwise reproducibility. An example is below:

#include "YAKL.h"

using yakl::Array;
using yakl::styleC;
using yakl::memDevice;
using yakl::c::parallel_for;
using yakl::c::Bounds;
using yakl::c::SimpleBounds;
using yakl::COLON;

typedef double real;
int constexpr n = 1024*16;
Array<real,1,memDevice,styleC> data("data",n);

parallel_for( n , YAKL_LAMBDA (int i) {
  data(i) = yakl::Random(i).genFP<real>();  // Rand # in [0,1]
});

for (int k=0; k < 10; k++) {
  yakl::ScalarLiveOut<real> sum(0.);  // Scalars written to in kernels must be
                                      // allocated on the device via ScalarLiveOut

  // yakl::intrinsics::sum(data) is clearly the more sensible choice for this
  // situation, but this is for demonstration only
  parallel_for( n , YAKL_LAMBDA (int i) {
    yakl::atomicAdd( sum() , data(i) );
  } , yakl::DefaultLaunchConfigB4b() );

  std::cout << std::scientific << std::setprecision(18) << sum.hostRead() << "\n";
}

If you compile this code with -DYAKL_B4B passed in the YAKL_CUDA_FLAGS, YAKL_HIP_FLAGS, or YAKL_SYCL_FLAGS, then each call of the kernel will give the same exact floating point result. If you compile without the flag on a GPU, you will find that you get a different result with each call.

What is going on under the hood?

A kernel will only run on the host if the YAKL_B4B macro is defined and yakl::DefaultLaunchConfigB4b() is added at the end of the parallel_for call for that kernel. When you specify yakl::DefaultLaunchConfigB4b(), a hidden template parameter tells parallel_for that if YAKL_B4B is defined, then run it serially on the host.

When you define YAKL_B4B, YAKL automatically turns on "Managed Memory" (for CUDA and HIP) or "Shared" memory (for SYCL), which allows device data to be accessible on the host. This is how we can run code with yakl::memDevice Array objects on the host when needed.

Gotchas

  1. Since -DYAKL_B4B turns on Managed memory, you will no longer be able to detect issues where you try to use device data on the host. So please have one test with -DYAKL_B4B enabled to check for bitwise differences from baseline. But also have another test without that flag to ensure you don't encounter segfaults or YAKL error messages from these situations.
  2. You cannot run a YAKL_DEVICE_LAMBDA on the host. So please do not add yakl::DefaultLaunchConfigB4b() to the end of a parallel_for call that uses YAKL_DEVICE_LAMBDA. You'll need to use YAKL_LAMBDA instead.
Clone this wiki locally