Skip to content

Enabling concurrency using CUDA streams

Bei Wang edited this page Dec 2, 2019 · 3 revisions

There are four types of concurrency in GPU:

  1. CPU/GPU concurrency
  2. Kernel processing concurrency
  3. Memcpy/kernel processing concurrency
  4. Multi-GPU concurrency.

In implementing strip clustering on GPU, we have enabled first three types of concurrency. Here we will describe each of them and how they are used in strip clustering.

CPU/GPU concurrency refers the synchronization between CPU and GPU with respect to GPU calls. Synchronous means the CPU will wait for the completion of the issued GPU call to move to the next instruction. Asynchronous means the issued GPU call will return immediately and so the CPU will be able to move to the next instruction without waiting for the completion of the issued GPU call. In CUDA, the following GPU calls are asynchronous and the rest unlisted is not:

  1. Kernel launch
  2. cudaMemcpyAsync
  3. cudaMemsetAsync
  4. cudaMemcpy within the same device
  5. cudaMemcpy from H2D for data 64kB or less It is important that all CUDA calls which we expect to run concurrently should be the ones as listed above.

kernel concurrency refers that multiple kernels on GPU can be running concurrently. This is enabled by GPU streams. A stream is a queue of device work. The host places work in the queue and continues on immediately. The device schedules work from streams when resources are free. Within a given stream, operations are performed in sequence, while the operations in different streams can be performed in parallel.

Unless otherwise specified all calls are placed into default stream, often referred as "stream 0". The default stream synchronous with all streams where the operation will not be initiated until all the preceding operations on GPUs have been completed. There are two approaches we can use to avoid synchronization between the default and non-default streams:

  1. create non-default streams with non-blocking flag, e.g, cudaStreamCreateWithFlags(&stream, cudaStreamNonBlocking)
  2. for CUDA 7.0 and beyond, use "--default-stream per-thread" compiler flag

Memcpy/Kernel concurrency is enabled by asynchronous memory copy, stream and pinned host memory. Asynchronous memory copy returns before the GPU performed the memcpy. Because the GPU operations autonomously and can read and write the host memory without any operating system involved, only pinned memory is eligible for asynchronous memcpy. Thus it is important that we use cudaHostMalloc for allocating the pinned host memory.

Here we list operations that blocks concurrency. We should watch out for those operations in ensuring concurrency.

  1. cudaDeviceSynchronize()
  2. cudaStreamSynchronize()
  3. cudaEventSynchronize()
  4. cudaMemcpy
  5. CUDA_LAUNCH_BLOCKING env variables set
  6. Memory allocation on device or host
  7. Destroying objects such as CUDA streams and CUDA events

CUDA events work by submitting a command, cudaEventRecord() asynchronously to the GPU that when the preceding GPU commands have been completed, causes the GPU to write a 32-bit memory location with a known value. Then we use cudaEventQuery() or cudaEventSynchronize() to examine that 32-bit value. The CPU will wait at cudaEventSynchronize for completion to proceed. cudaEventRecord and cudaEventSynchronize/Query can be used together to enable interstream synchronization. Note the stream parameter to cudaEventRecord() is for interstream synchronization, not for timing. When using CUDA event for timing, it is best to record them in the NULL stream.

Ref:

  1. The CUDA Handbook: A Comprehensive Guide to GPU Programming, https://learning.oreilly.com/library/view/the-cuda-handbook/9780133261516/ch06.html
  2. https://devblogs.nvidia.com/gpu-pro-tip-cuda-7-streams-simplify-concurrency/