-
Notifications
You must be signed in to change notification settings - Fork 15
Floating point determinism bitwise reproducibility
YAKL supports floating point determinism, which is fully validated for HIP and CUDA backends, also called bit-for-bit reproducibility for GPU runs. There are only two places where bit-for-bit reproducibility can potentially fail with YAKL: sum reductions and atomicAdd
. These are cases where we often have no control over the order of operations, and floating point arithmetic is not commutative.
It turns out that YAKL's reliance on cub and hipCUB for reductions is beneficial in this regard, as each of those libraries guarantees determinism from one run to the next on the same GPU.
For atomics, the developer will need to be involved here, but the effort is low. First, all parallel_for
launches that have an atomicAdd
call need to have an extra parameter, yakl::DefaultLaunchConfigB4b()
, added to the end of the parallel_for
calls. What this does is tell YAKL that if the YAKL_B4B
macro is passed in from the compiler, then this kernel should be run serially on the host rather than in parallel on the GPU. If the YAKL_B4B
macro is not defined, then the kernel runs efficiently in parallel on the GPU like normal without bitwise reproducibility. An example is below:
#include "YAKL.h"
using yakl::Array;
using yakl::styleC;
using yakl::memDevice;
using yakl::c::parallel_for;
using yakl::c::Bounds;
using yakl::c::SimpleBounds;
using yakl::COLON;
typedef double real;
int constexpr n = 1024*16;
Array<real,1,memDevice,styleC> data("data",n);
parallel_for( n , YAKL_LAMBDA (int i) {
data(i) = yakl::Random(i).genFP<real>(); // Rand # in [0,1]
});
for (int k=0; k < 10; k++) {
yakl::ScalarLiveOut<real> sum(0.); // Scalars written to in kernels must be
// allocated on the device via ScalarLiveOut
// yakl::intrinsics::sum(data) is clearly the more sensible choice for this
// situation, but this is for demonstration only
parallel_for( n , YAKL_LAMBDA (int i) {
yakl::atomicAdd( sum() , data(i) );
} , yakl::DefaultLaunchConfigB4b() );
std::cout << std::scientific << std::setprecision(18) << sum.hostRead() << "\n";
}
If you compile this code with -DYAKL_B4B
passed in the YAKL_CUDA_FLAGS
, YAKL_HIP_FLAGS
, or YAKL_SYCL_FLAGS
, then each call of the kernel will give the same exact floating point result. If you compile without the flag on a GPU, you will find that you get a different result with each call.
A kernel will only run on the host if the YAKL_B4B
macro is defined and yakl::DefaultLaunchConfigB4b()
is added at the end of the parallel_for
call for that kernel. When you specify yakl::DefaultLaunchConfigB4b()
, a hidden template parameter tells parallel_for
that if YAKL_B4B
is defined, then run it serially on the host.
When you define YAKL_B4B
, YAKL automatically turns on "Managed Memory" (for CUDA and HIP) or "Shared" memory (for SYCL), which allows device data to be accessible on the host. This is how we can run code with yakl::memDevice
Array
objects on the host when needed.
- Since
-DYAKL_B4B
turns on Managed memory, you will no longer be able to detect issues where you try to use device data on the host. So please have one test with-DYAKL_B4B
enabled to check for bitwise differences from baseline. But also have another test without that flag to ensure you don't encounter segfaults or YAKL error messages from these situations. - You cannot run a
YAKL_DEVICE_LAMBDA
on the host. So please do not addyakl::DefaultLaunchConfigB4b()
to the end of aparallel_for
call that usesYAKL_DEVICE_LAMBDA
. You'll need to useYAKL_LAMBDA
instead.