Whether deadlock is possible when using Gemm with SplitKSerial=true ? #317

uiemUI · 2021-09-12T08:56:33Z

uiemUI
Sep 12, 2021

Gemm in Cutlass implements SplitKSerial Reduction using semaphores.
According to the following implementation code,the kth threadblock have to wait until the k-1th threadblock releases the lock. However the order in which threadblocks are scheduled is undefined,if the device does not have sufficient resources and sm to execute all the blocks required, there is no guarantee that the k-1th block is in the executing or completed state while the kth block is in the executing state, in which case a deadlock may occur.

 // Wait on the semaphore - this latency may have been covered by iterator construction
    if (kSplitKSerial && params.grid_tiled_shape.k() > 1) {
        
      // For subsequent threadblocks, the source matrix is held in the 'D' tensor.
      if (threadblock_tile_offset.k()) {
        iterator_C = iterator_D;
      }

      semaphore.wait(threadblock_tile_offset.k());

      __threadfence();
    }

    // Execute the epilogue operator to update the destination tensor.
    epilogue(output_op, iterator_D, accumulators, iterator_C); 
    
    //
    // Release the semaphore
    //

    if (kSplitKSerial && params.grid_tiled_shape.k() > 1) {
      
      int lock = 0;
      if (params.grid_tiled_shape.k() == threadblock_tile_offset.k() + 1) {

        // The final threadblock resets the semaphore for subsequent grids.
        lock = 0;
      }
      else {
        // Otherwise, the semaphore is incremented
        lock = threadblock_tile_offset.k() + 1;
      }

      __threadfence();
      semaphore.release(lock);
    }

Peter9606 · 2021-09-13T01:45:03Z

Peter9606
Sep 13, 2021

As you can see, semaphore wait is essentially a memory load instruction, and there's a dependency on the data right inside wait function state != status, and this is the chance that scheduler could switch to other warp/threadblock, so deadlock won't happen.

1 reply

uiemUI Sep 13, 2021
Author

The scheduler may not switch to execute blocks if sm has no enough resources.I referenced the link and used the following code on my GPU (Tesla T4, with 40 SMs, whose Max Threads per Multiprocessor is 1024) to do a test.

__device__ __forceinline__ uint32_t __smid() {
  uint32_t smid;
  asm volatile("mov.u32 %0, %%smid ;" : "=r"(smid));
  return smid;
}

__device__ __forceinline__ int acquire_load(int *addr) {
  int load_val;
  asm volatile("ld.global.acquire.gpu.b32 %0, [%1];\n"
               : "=r"(load_val)
               : "l"(addr));
  return load_val;
}


__device__ volatile int blocks_completed = 0;

__device__ int first_SM[MAX_SM];

__global__ void tkernel(int num_blocks, int num_SMs) {
  if (threadIdx.x == 0) {
    int my_SM = __smid();
    printf(" smid %d , block id %d\n",my_SM,blockIdx.x);
    int im_not_first = atomicCAS(first_SM + my_SM, 0, 1);
    if (!im_not_first) {
      while (acquire_load((int *)(&blocks_completed)) <1)
        ;
    }
    atomicAdd((int *)&blocks_completed, 1);
  }
}

int main(int argc, char *argv[]){
  unsigned my_dev = 0;
  if (argc > 1) my_dev = atoi(argv[1]);
  CUDA_CHECK(cudaSetDevice(my_dev));
  int tot_SM = 0;
  CUDA_CHECK(cudaDeviceGetAttribute(&tot_SM, cudaDevAttrMultiProcessorCount, my_dev));
  if (tot_SM > MAX_SM) {printf("program configuration error\n"); return 1;}
  printf("running on device %d, with %d SMs\n", my_dev, tot_SM);
  int temp[MAX_SM];
  for (int i = 0; i < MAX_SM; i++) temp[i] = 0;
  cudaMemcpyToSymbol(first_SM, temp, MAX_SM*sizeof(int));
  tkernel<<<tot_SM+1, 1024>>>(tot_SM+1, tot_SM);
  CUDA_CHECK(cudaDeviceSynchronize());
  printf("finish test\n");
}

The first block on each SM waits until either block completes, when the number of threads on each block is 1024, a deadlock occurs，in which case the scheduler does not switch blocks。

hwu36 · 2021-09-17T04:34:09Z

hwu36
Sep 17, 2021
Maintainer

There are two ways to do split-k. One is using semaphore, the other is using atomic add. Semaphore can guarantee deterministic. Atomic add is faster.

Cutlass kernel usually uses at most 8 warps. Dead lock won't happen because of thread number.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whether deadlock is possible when using Gemm with SplitKSerial=true ? #317

{{title}}

Replies: 2 comments 1 reply

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Whether deadlock is possible when using Gemm with SplitKSerial=true ? #317

uiemUI Sep 12, 2021

Replies: 2 comments · 1 reply

Peter9606 Sep 13, 2021

uiemUI Sep 13, 2021 Author

hwu36 Sep 17, 2021 Maintainer

uiemUI
Sep 12, 2021

Replies: 2 comments 1 reply

Peter9606
Sep 13, 2021

uiemUI Sep 13, 2021
Author

hwu36
Sep 17, 2021
Maintainer