Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[STF] reduce access mode #2830

Merged
merged 111 commits into from
Dec 10, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
111 commits
Select commit Hold shift + click to select a range
ebd96bd
Experiment to start introducing a reduction access mode used in kerne…
caugonnet Nov 15, 2024
4dbbc92
Add a trait to count the number of reductions required in a tuple of …
caugonnet Nov 15, 2024
7826839
WIP: create a new scalar<T> interface which can be used in a reductio…
caugonnet Nov 18, 2024
ff8f4e7
WIP ! Introduce owning_container_of trait class
caugonnet Nov 18, 2024
21535ea
WIP: save progress here, lots of hardcoded things and we need to move…
caugonnet Nov 20, 2024
c32d5b6
WIP : first prototype working...
caugonnet Nov 20, 2024
1446e4e
Proper initialization of shared memory buffers, and add another example
caugonnet Nov 20, 2024
ee65592
Some cleanups and renaming of classes for better clarity
caugonnet Nov 20, 2024
0cda63c
clang-format
caugonnet Nov 20, 2024
388e98e
workaround some false unused captured variable warning
caugonnet Nov 21, 2024
a5716f7
Fix various C++ errors, and do not use the I variable
caugonnet Nov 21, 2024
c198f07
Rework the CFD example to use reductions, and generalize the transfer…
caugonnet Nov 21, 2024
c22c3b7
clang-format
caugonnet Nov 21, 2024
eef3139
Implement transfer_host (name subject to change !) directly in the co…
caugonnet Nov 22, 2024
349b4c7
clang-format
caugonnet Nov 22, 2024
1c1ee34
Make it possible to either accumulate a reduction result with an exis…
caugonnet Nov 22, 2024
58c1b6f
Implement a set of predefined reducers
caugonnet Nov 22, 2024
f4bee86
clang-format
caugonnet Nov 22, 2024
c4191eb
move the definition of do_init and no_init
caugonnet Nov 22, 2024
ebd416d
update word count example
caugonnet Nov 22, 2024
7953e43
Code simplification to facilitate the transition to ::cuda::std::tuple
caugonnet Nov 22, 2024
2fcf0bb
Use ::cuda::std::tuple for reduction variables
caugonnet Nov 22, 2024
f87fd05
use proper type for the size of buffers
caugonnet Nov 22, 2024
c59e3c8
clang-format
caugonnet Nov 22, 2024
8c1be55
remove unused variables
caugonnet Nov 23, 2024
4f38bde
fix buffer size
caugonnet Nov 23, 2024
384510d
add missing typename
caugonnet Nov 23, 2024
ff95d9f
Add missing typename
caugonnet Nov 23, 2024
2946b3a
Add maybe_unused for variables currently unused in a WIP code
caugonnet Nov 23, 2024
4986fa5
clang-format
caugonnet Nov 23, 2024
41cfe8f
add a doxygen comment
caugonnet Nov 25, 2024
53c059c
Add missing constructors
caugonnet Nov 26, 2024
5f61b1a
Code cleanup
caugonnet Nov 26, 2024
23b8feb
remove dead code
caugonnet Nov 26, 2024
6b75b5e
task_dep_op_none should just be a tag type, there is no need to imple…
caugonnet Nov 26, 2024
0fb4110
Remove dead code
caugonnet Nov 26, 2024
10d78e6
Remove unused template parameter
caugonnet Nov 26, 2024
1795649
Slightly simpler count_type trait
caugonnet Nov 26, 2024
44b2bff
clang-format
caugonnet Nov 26, 2024
cf0976b
Add a small unit test to test count_type_v
caugonnet Nov 27, 2024
7e45e99
Do not define both no_init and do_init types anymore, just expose no_…
caugonnet Nov 27, 2024
6f11026
sort examples in cmake
caugonnet Nov 27, 2024
9ddd6e0
clang-format
caugonnet Nov 27, 2024
6323168
Simplify redux_vars
andralex Nov 27, 2024
3e789f6
Use ::std::monostate instead of EmptyType
andralex Nov 27, 2024
cd4f07d
Simplify redux_vars
andralex Nov 27, 2024
1b0c9ed
clang-format
caugonnet Dec 2, 2024
f6fac05
Add a missing doxygen comment
caugonnet Dec 2, 2024
4ca01bf
Replace 01-axpy-reduce.cu with 09-dot-reduce.cu which is a more meani…
caugonnet Dec 2, 2024
cd342a3
clang-format
caugonnet Dec 2, 2024
4d11ef7
fix word count example
caugonnet Dec 3, 2024
14798ae
Minimize copying of dependencies
andralex Dec 3, 2024
40e9b4b
- Fix how we load data in shared memory during the finalization kerne…
caugonnet Dec 4, 2024
bccfd2d
clang-format
caugonnet Dec 4, 2024
b549262
Example to compute pi using Monte Carlo method
caugonnet Dec 4, 2024
5e2a88b
Add a unit test to ensure the reduce access mode works
caugonnet Dec 4, 2024
f908c48
clang-format
caugonnet Dec 4, 2024
2c63730
Not all ascii chars between A and z are alphanumerical chars
caugonnet Dec 4, 2024
f6f9be9
remove dead code
caugonnet Dec 4, 2024
3fb8332
minor cleanups
caugonnet Dec 4, 2024
f826de9
Not all ascii chars between A and z are alphanumerical chars
caugonnet Dec 4, 2024
b029b36
no need for type alias when we use it once only
caugonnet Dec 4, 2024
49e8669
Fix pi test
andralex Dec 4, 2024
2d0c13d
Move reduction operator and init flag to task_dep, step 1
andralex Dec 4, 2024
7f3e4c2
Add a new test to check that the scalar interface works as expected (…
caugonnet Dec 5, 2024
563d3ef
Fully implement the scalar interface
caugonnet Dec 5, 2024
25424d6
fix potentially uninialized variable warnings
caugonnet Dec 5, 2024
1ede4a0
fix unused variable warning
caugonnet Dec 5, 2024
4362a3a
Add a test to ensure we properly deal with empty shapes in parallel_f…
caugonnet Dec 5, 2024
543b423
clang-format
caugonnet Dec 5, 2024
550220f
Implement the CUDA kernel for reduction with empty shapes
caugonnet Dec 5, 2024
d1cb05d
Move reduction operator and init flag to task_dep, step 2: parallel_f…
andralex Dec 5, 2024
3e51d0d
Move reduction operator and init flag to task_dep, step 3: make paral…
andralex Dec 5, 2024
ef86fe4
Fix the finalize kernel if there are more threads than items
caugonnet Dec 6, 2024
1c9cc01
clang-format
caugonnet Dec 6, 2024
872d81e
Implementation of the reduce access mode for CUDA graphs
caugonnet Dec 6, 2024
5223e0d
Test empty shapes with reductions on both stream and graphs
caugonnet Dec 6, 2024
234065b
Move reduction operator and init flag to task_dep, step 4: eliminate …
andralex Dec 6, 2024
1b170c4
clang-format
caugonnet Dec 6, 2024
120aac0
fix parallel_for on host
caugonnet Dec 6, 2024
96bd78d
Disable nvrtc workaround (#1116)
miscco Dec 6, 2024
3bddb50
Tighten overloading of context::parallel_for
andralex Dec 7, 2024
cd6d1e1
clang-format
andralex Dec 7, 2024
96f22f5
Optimize loop function by hoisting lambda definition out of the loop …
andralex Dec 7, 2024
7fb06e9
No need for SelectType
andralex Dec 7, 2024
5f40c75
A few more improvements
andralex Dec 7, 2024
a9489f9
Fix build
andralex Dec 7, 2024
d65f64d
Documentation for reduce()
caugonnet Dec 8, 2024
6cf4fea
Improve doc for reduce()
caugonnet Dec 8, 2024
91ea8a5
Rename transfer_host in wait
caugonnet Dec 8, 2024
cf50f84
doxygen blocks for reducer operators
caugonnet Dec 8, 2024
9f61096
Add missing doxygen blocks or make them more accurate
caugonnet Dec 8, 2024
14216b7
Remove commented code
caugonnet Dec 8, 2024
d0222b5
remove printf
caugonnet Dec 8, 2024
e53539c
Add sanity checks to detect unimplemented uses of reduce()
caugonnet Dec 8, 2024
4d2f92b
Fix a logic error
caugonnet Dec 8, 2024
e7844fe
remove maybe_unused that is not needed anymore
caugonnet Dec 8, 2024
c5660b2
Properly handle reduce on a CUDA graph that is not executed by device 0
caugonnet Dec 8, 2024
288cf77
Reimplement pagerank using a reduce access mode
caugonnet Dec 9, 2024
677f6ce
No need to atomicMaxFloat when using a reduce(reducer::maxval<float>{})
caugonnet Dec 9, 2024
5994828
use references in calculating_pagerank
caugonnet Dec 9, 2024
f106fb9
Add a missing doxygen block for scalar<T>
caugonnet Dec 9, 2024
4460e8d
Remove count_type_v and count_type which are not used anymore
caugonnet Dec 9, 2024
ca32d55
replace an atomic add by a reduction
caugonnet Dec 9, 2024
64c118c
Simpler scalar implementation with a struct
caugonnet Dec 10, 2024
de26df0
Comment to clarify get_owning_container_of
caugonnet Dec 10, 2024
44ebce3
Remove useless ctor
caugonnet Dec 10, 2024
9f27ca5
fix spelling issue
caugonnet Dec 10, 2024
c6e8fa1
clang-format
caugonnet Dec 10, 2024
a6024ca
Explain how we statically dispatch between the different task_dep(_un…
caugonnet Dec 10, 2024
7624ae4
Do provide constructors for scalar<T>
caugonnet Dec 10, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions cudax/examples/stf/09-dot-reduce.cu
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
//===----------------------------------------------------------------------===//
//
// Part of CUDASTF in CUDA C++ Core Libraries,
// under the Apache License v2.0 with LLVM Exceptions.
// See https://llvm.org/LICENSE.txt for license information.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
// SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES.
//
//===----------------------------------------------------------------------===//

/**
* @file
*
* @brief Implementation of the DOT kernel using a reduce access mode
*
*/

#include <cuda/experimental/stf.cuh>

using namespace cuda::experimental::stf;

int main()
{
const size_t N = 16;
double X[N], Y[N];

double ref_res = 0.0;

for (size_t i = 0; i < N; i++)
{
X[i] = cos(double(i));
Y[i] = sin(double(i));

// Compute the reference result of the DOT product of X and Y
ref_res += X[i] * Y[i];
}

context ctx;
auto lX = ctx.logical_data(X);
auto lY = ctx.logical_data(Y);

auto lsum = ctx.logical_data(shape_of<scalar<double>>());

/* Compute sum(x_i * y_i)*/
ctx.parallel_for(lY.shape(), lX.read(), lY.read(), lsum.reduce(reducer::sum<double>{}))
->*[] __device__(size_t i, auto dX, auto dY, double& sum) {
sum += dX(i) * dY(i);
};

double res = ctx.wait(lsum);

ctx.finalize();

_CCCL_ASSERT(fabs(res - ref_res) < 0.0001, "Invalid result");
}
11 changes: 7 additions & 4 deletions cudax/examples/stf/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -16,17 +16,18 @@ set(stf_example_sources

# Examples which rely on code generation (parallel_for or launch)
set(stf_example_codegen_sources
01-axpy-parallel_for.cu
01-axpy-launch.cu
01-axpy-parallel_for.cu
binary_fhe.cu
09-dot-reduce.cu
cfd.cu
custom_data_interface.cu
fdtd_mgpu.cu
frozen_data_init.cu
graph_algorithms/degree_centrality.cu
graph_algorithms/jaccard.cu
graph_algorithms/pagerank.cu
graph_algorithms/tricount.cu
graph_algorithms/jaccard.cu
fdtd_mgpu.cu
heat.cu
heat_mgpu.cu
jacobi.cu
Expand All @@ -35,11 +36,13 @@ set(stf_example_codegen_sources
launch_sum.cu
launch_sum_cub.cu
logical_gates_composition.cu
mandelbrot.cu
parallel_for_2D.cu
pi.cu
scan.cu
mandelbrot.cu
standalone-launches.cu
word_count.cu
word_count_reduce.cu
)

# Examples using CUBLAS, CUSOLVER...
Expand Down
162 changes: 20 additions & 142 deletions cudax/examples/stf/cfd.cu
Original file line number Diff line number Diff line change
Expand Up @@ -188,86 +188,18 @@ void jacobistepvort(
};
}

template <typename T>
T transfer_host(context& ctx, logical_data<slice<T>>& ldata)
double deltasq(context& ctx, logical_data<slice<double, 2>> lnewarr, logical_data<slice<double, 2>> loldarr)
{
T out;
auto ldsq = ctx.logical_data(shape_of<scalar<double>>()).set_symbol("tmp_accumulator");

bool is_graph = ctx.is_graph_ctx();

if (is_graph)
{
ctx.host_launch(ldata.read()).set_symbol("transfer_host")->*[&](auto data) {
out = data(0);
};

/* This forces the completion of the host callback, so that the host
* thread can use the content for dynamic control flow */
cudaStreamSynchronize(ctx.task_fence());
}
else
{
ctx.task(exec_place::host, ldata.read()).set_symbol("transfer_host")->*[&](cudaStream_t stream, auto data) {
cuda_safe_call(cudaStreamSynchronize(stream));
out = data(0);
};
}

return out;
}

double
deltasq(context& ctx, logical_data<slice<double, 2>> lnewarr, logical_data<slice<double, 2>> loldarr, int m, int n)
{
auto ldsq = ctx.logical_data(shape_of<slice<double>>({1})).set_symbol("tmp_accumulator");

//
// for (i = 1; i <= m; i++) {
// for (j = 1; j <= n; j++) {
// double tmp = newarr[i * (m + 2) + j] - oldarr[i * (m + 2) + j];
// dsq += tmp * tmp;
// }
// }

auto spec = con(con<128>(hw_scope::thread));
ctx.launch(spec, ldsq.write(), lnewarr.read(), loldarr.read()).set_symbol("deltasq")->*
[m, n] __device__(auto th, auto dsq, auto newarr, auto oldarr) {
if (th.rank() == 0)
{
dsq(0) = 0.0;
}
th.sync();

// Each thread computes the sum of elements assigned to it
double local_sum = 0.0;
for (auto [i, j] :
th.apply_partition(box<2>({1, m + 1}, {1, n + 1}), std::tuple<blocked_partition, cyclic_partition>()))
{
double tmp = newarr(i, j) - oldarr(i, j);
local_sum += tmp * tmp;
}

auto ti = th.inner();

__shared__ double block_sum[th.static_width(1)];
block_sum[ti.rank()] = local_sum;

for (size_t s = ti.size() / 2; s > 0; s /= 2)
{
if (ti.rank() < s)
{
block_sum[ti.rank()] += block_sum[ti.rank() + s];
}
ti.sync();
}

if (ti.rank() == 0)
{
atomicAdd(&dsq(0), block_sum[0]);
}
};
ctx.parallel_for(lnewarr.shape(), ldsq.reduce(reducer::sum<double>{}), lnewarr.read(), loldarr.read())
.set_symbol("deltasq")
->*[] __device__(size_t i, size_t j, auto& dsq, auto newarr, auto oldarr) {
double tmp = newarr(i, j) - oldarr(i, j);
dsq += tmp * tmp;
};

return transfer_host(ctx, ldsq);
return ctx.wait(ldsq);
}

void boundarypsi(context& ctx, logical_data<slice<double, 2>> lpsi, int m, int /*n*/, int b, int h, int w)
Expand Down Expand Up @@ -422,44 +354,14 @@ int main(int argc, char** argv)
boundarypsi(ctx, lpsi, m, n, b, h, w);

// compute normalisation factor for error
auto lbnorm = ctx.logical_data(shape_of<slice<double>>({1})).set_symbol("bnorm");
auto lbnorm = ctx.logical_data(shape_of<scalar<double>>()).set_symbol("bnorm");

nvtxRangePush("Compute_Normalization");

// bnorm += psi * psi
auto spec = con(con<32>());
ctx.launch(spec, lbnorm.write(), lpsi.read()).set_symbol("Compute_Normalization")
->*[] __device__(auto th, auto bnorm, auto psi) {
if (th.rank() == 0)
{
bnorm(0) = 0.0;
}
th.sync();
// Each thread computes the sum of elements assigned to it
double local_sum = 0.0;
for (auto [i, j] : th.apply_partition(shape(psi)))
{
local_sum += psi(i, j) * psi(i, j);
}

auto ti = th.inner();

__shared__ double block_sum[th.static_width(1)];
block_sum[ti.rank()] = local_sum;

for (size_t s = ti.size() / 2; s > 0; s /= 2)
{
if (ti.rank() < s)
{
block_sum[ti.rank()] += block_sum[ti.rank() + s];
}
ti.sync();
}

if (ti.rank() == 0)
{
atomicAdd(&bnorm(0), block_sum[0]);
}
// bnorm = psi * psi
ctx.parallel_for(lpsi.shape(), lpsi.read(), lbnorm.reduce(reducer::sum<double>{}))
->*[] __device__(size_t i, size_t j, auto psi, auto& bnorm) {
bnorm += psi(i, j) * psi(i, j);
};

if (!irrotational)
Expand All @@ -468,37 +370,13 @@ int main(int argc, char** argv)
boundaryzet(ctx, lzet, lpsi, m, n);

// update normalisation
ctx.launch(spec, lbnorm.rw(), lzet.read()).set_symbol("Compute_Normalization")
->*[] __device__(auto th, auto bnorm, auto zet) {
// Each thread computes the sum of elements assigned to it
double local_sum = 0.0;
for (auto [i, j] : th.apply_partition(shape(zet)))
{
local_sum += zet(i, j) * zet(i, j);
}

auto ti = th.inner();

__shared__ double block_sum[th.static_width(1)];
block_sum[ti.rank()] = local_sum;

for (size_t s = ti.size() / 2; s > 0; s /= 2)
{
if (ti.rank() < s)
{
block_sum[ti.rank()] += block_sum[ti.rank() + s];
}
ti.sync();
}

if (ti.rank() == 0)
{
atomicAdd(&bnorm(0), block_sum[0]);
}
ctx.parallel_for(lzet.shape(), lzet.read(), lbnorm.reduce(reducer::sum<double>{}, no_init{}))
->*[] __device__(size_t i, size_t j, auto zet, auto& bnorm_zet) {
bnorm_zet += zet(i, j) * zet(i, j);
};
}

double bnorm = transfer_host(ctx, lbnorm);
double bnorm = ctx.wait(lbnorm);
bnorm = sqrt(bnorm);

// begin iterative Jacobi loop
Expand All @@ -525,11 +403,11 @@ int main(int argc, char** argv)
bool compute_error = (iter == numiter) || (checkerr && (iter % printfreq == 0));
if (compute_error)
{
error = deltasq(ctx, lpsitmp, lpsi, m, n);
error = deltasq(ctx, lpsitmp, lpsi);

if (!irrotational)
{
error += deltasq(ctx, lzettmp, lzet, m, n);
error += deltasq(ctx, lzettmp, lzet);
}

error = sqrt(error);
Expand Down
57 changes: 16 additions & 41 deletions cudax/examples/stf/graph_algorithms/pagerank.cu
Original file line number Diff line number Diff line change
Expand Up @@ -21,22 +21,6 @@

using namespace cuda::experimental::stf;

/**
* @brief Performs an atomic maximum operation on floating-point numbers by reinterpreting them as integers.
*
* @param address Pointer to the float value that will be updated.
* @param val The float value to compare and possibly set at the address.
* @return The old value at the address (reinterpreted as a float).
*/
__device__ float atomicMaxFloat(float* address, float val)
{
int* address_as_int = (int*) address;
int old = *address_as_int;
int new_val = __float_as_int(val);
atomicMax(address_as_int, new_val);
return __int_as_float(old);
}

/**
* @brief Calculates the PageRank for a given vertex.
*
Expand All @@ -49,10 +33,10 @@ __device__ float atomicMaxFloat(float* address, float val)
*/
__device__ void calculating_pagerank(
int idx,
slice<const int> loffsets,
slice<const int> lnonzeros,
slice<const float> lpage_rank,
slice<float> lnew_page_rank,
const slice<const int>& loffsets,
const slice<const int>& lnonzeros,
const slice<const float>& lpage_rank,
slice<float>& lnew_page_rank,
float init_rank)
{
float rank_sum = 0.0;
Expand All @@ -77,7 +61,6 @@ int main()
int num_vertices = offsets.size() - 1;
float init_rank = 1.0f / num_vertices;
float tolerance = 1e-6f;
float max_diff = 0.0f;
int NITER = 100;

// output pageranks for each vertex
Expand All @@ -88,34 +71,26 @@ int main()
auto lnonzeros = ctx.logical_data(&nonzeros[0], nonzeros.size());
auto lpage_rank = ctx.logical_data(&page_rank[0], page_rank.size());
auto lnew_page_rank = ctx.logical_data(&new_page_rank[0], new_page_rank.size());
auto lmax_diff = ctx.logical_data(&max_diff, {1});
auto lmax_diff = ctx.logical_data(shape_of<scalar<float>>());

for (int iter = 0; iter < NITER; ++iter)
{
// Calculate Current Iteration PageRank
ctx.parallel_for(box(num_vertices), loffsets.read(), lnonzeros.read(), lpage_rank.rw(), lnew_page_rank.rw())
->*[init_rank] __device__(size_t idx, auto loffsets, auto lnonzeros, auto lpage_rank, auto lnew_page_rank) {
ctx.parallel_for(
box(num_vertices),
loffsets.read(),
lnonzeros.read(),
lpage_rank.rw(),
lnew_page_rank.rw(),
lmax_diff.reduce(reducer::maxval<float>{}))
->*[init_rank] __device__(
size_t idx, auto loffsets, auto lnonzeros, auto lpage_rank, auto lnew_page_rank, auto& max_diff) {
calculating_pagerank(idx, loffsets, lnonzeros, lpage_rank, lnew_page_rank, init_rank);
};

// Calculate Current Iteration Error
ctx.parallel_for(box(1), lmax_diff.write())->*[] __device__(size_t, auto lmax_diff) {
lmax_diff(0) = 0.0f;
};

// Calculate Current Iteration Error
ctx.parallel_for(box(num_vertices), lpage_rank.read(), lnew_page_rank.read(), lmax_diff.rw())
->*[] __device__(size_t idx, auto lpage_rank, auto lnew_page_rank, auto lmax_diff) {
atomicMaxFloat(lmax_diff.data_handle(), fabs(lnew_page_rank[idx] - lpage_rank[idx]));
max_diff = ::std::max(max_diff, lnew_page_rank[idx] - lpage_rank[idx]);
};

// Reduce Error and Check for Convergence
bool converged;
ctx.task(exec_place::host, lmax_diff.read())->*[tolerance, &converged](cudaStream_t s, auto max_diff) {
cuda_safe_call(cudaStreamSynchronize(s));
converged = (max_diff(0) < tolerance);
};

bool converged = (ctx.wait(lmax_diff) < tolerance);
if (converged)
{
break;
Expand Down
Loading
Loading