You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspecting alpaka's implementations when thinking about zero-copying as part of #1820 I wondered whether alpaka actually supports copying buffers between two GPUs on e.g. the CUDA backend. Searching alpaka for API calls like cudaMemcpyPeer and cudaDeviceEnablePeerAccess only points me to the documentation here, which says that cudaDeviceEnablePeerAccess is "automatically done when required", but the API is never called inside the alpaka codebase. So I wondered, whether CUDA just does that automatically as part of cudaMemcpy when the source and destination are on different GPUs, and whether that is a feature of CUDA that is always present or requires some kind of minium CUDA version or compute architecture. Does anyone know?
Independently, we should have tests and also one example to show such a scenario. It also concerns all backends, not just CUDA.
The text was updated successfully, but these errors were encountered:
By the way, cudaMemcpy and cudaMemcpyAsync are not able to copy memory allocated by cudaMallocAsync across different devices. The comment from NVIDIA is to use cudaMemcpyPeer or cudaMemcpyPeerAsync.
I removed the explicit peer copies in the past #1400 because cudaMemcpy* is doing it automatically. There was no need to fiddle around with the peer copies anymore. Looks like I forgot to remove this in the documentation.
Inspecting alpaka's implementations when thinking about zero-copying as part of #1820 I wondered whether alpaka actually supports copying buffers between two GPUs on e.g. the CUDA backend. Searching alpaka for API calls like
cudaMemcpyPeer
andcudaDeviceEnablePeerAccess
only points me to the documentation here, which says thatcudaDeviceEnablePeerAccess
is "automatically done when required", but the API is never called inside the alpaka codebase. So I wondered, whether CUDA just does that automatically as part ofcudaMemcpy
when the source and destination are on different GPUs, and whether that is a feature of CUDA that is always present or requires some kind of minium CUDA version or compute architecture. Does anyone know?Independently, we should have tests and also one example to show such a scenario. It also concerns all backends, not just CUDA.
The text was updated successfully, but these errors were encountered: