-
Notifications
You must be signed in to change notification settings - Fork 205
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Thrust] Perform asynchronous allocations by default for the par_nosync
policy
#4204
base: main
Are you sure you want to change the base?
[Thrust] Perform asynchronous allocations by default for the par_nosync
policy
#4204
Conversation
/ok to test |
pre-commit.ci autofix |
…ync` policy. This will make algorithms (like scans) that don't have a computation-dependent result but do temporary allocation properly asynchronous under `par_nosync`.
d6560a0
to
ee5c351
Compare
ee5c351
to
7a3edf1
Compare
🟨 CI finished in 1h 31m: Pass: 54%/97 | Total: 1d 21h | Avg: 28m 18s | Max: 1h 15m | Hits: 75%/57652
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
+/- | Thrust |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 97)
# | Runner |
---|---|
68 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-arm64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe there is no reason to not use asynchronous allocations.
par_nosync
relates to synchronization at the end of the algorithm and we should thrive the make the internals as efficient as possible
void* ptr; | ||
cudaError_t status = cudaMallocAsync(&ptr, sizeof(T) * n, nullptr); | ||
|
||
if (status != cudaSuccess) | ||
{ | ||
cudaGetLastError(); // Clear the CUDA global error state. | ||
|
||
// That didn't work. We could be somewhere where async allocation isn't | ||
// supported like Windows, so try again with cudaMalloc. | ||
status = cudaMalloc(&ptr, sizeof(T) * n); | ||
|
||
if (status != cudaSuccess) | ||
{ | ||
throw system::detail::bad_alloc(cuda_category().message(status).c_str()); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That should be a compile time switch and not runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean? Status is a runtime variable.
template <typename Pointer> | ||
_CCCL_HOST void return_temporary_buffer(par_nosync_t&, Pointer ptr, ::cuda::std::ptrdiff_t n) | ||
{ | ||
void* void_ptr = reinterpret_pointer_cast<void*>(ptr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In libcu++ we have __voidify
which does the proper cast to void thingy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the point here is to cast a thrust::device_pointer<T>
into it's underlying T*
. Since T*
converts to void*
automatically, we should be able to just thrust::raw_pointer_cast(ptr)
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bernhard is right, the intent is to go from device_pointer
to T*
. reinterpert_pointer_cast
is what we use elsewhere in *_temporary_buffer
machinery which is why we used it. I also think he's right that we can just use raw_pointer_cast
here.
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
Ideally, we need a way to test the fallback cases of synchronous malloc/free, which I'm told isn't available everywhere. I'll check with the driver team and confirm the platforms where this is the case. |
🟨 CI finished in 1h 28m: Pass: 55%/97 | Total: 1d 22h | Avg: 28m 59s | Max: 1h 13m | Hits: 75%/57652
|
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
CUB | |
+/- | Thrust |
CUDA Experimental | |
stdpar | |
python | |
CCCL C Parallel Library | |
Catch2Helper |
Modifications in project or dependencies?
Project | |
---|---|
CCCL Infrastructure | |
libcu++ | |
+/- | CUB |
+/- | Thrust |
CUDA Experimental | |
+/- | stdpar |
+/- | python |
+/- | CCCL C Parallel Library |
+/- | Catch2Helper |
🏃 Runner counts (total jobs: 97)
# | Runner |
---|---|
68 | linux-amd64-cpu16 |
9 | windows-amd64-cpu16 |
6 | linux-arm64-cpu16 |
6 | linux-amd64-gpu-rtxa6000-latest-1 |
3 | linux-amd64-gpu-h100-latest-1 |
3 | linux-amd64-gpu-rtx4090-latest-1 |
2 | linux-amd64-gpu-rtx2080-latest-1 |
This will make algorithms (like scans) that don't have a computation-dependent result but do temporary allocation properly asynchronous under
par_nosync
.This is implemented by adding overloads of
(get|return)_temporary_buffer
forpar_nosync
and it's corresponding stream-attached form. The overloads just usecuda(Malloc|Free)Async
directly for now. It falls back tocuda(Malloc|Free)
in the case that async allocation fails, because I'm told that async allocation doesn't work on some platforms (notably Windows TCC driver).When libcu++ memory resources are non-experimental, instead of specializing and using
cuda(Malloc|FreeAsync
directly, we should check whether the memory resource attached to the policy supports async, and forpar_nosync
we should default to using the memory resource's async APIs. We'd also have to givepar_nosync
a default policy.However, for the time being, this works and allows much broader use of
par_nosync
.Implements #4117.