From a8a8890521cbe530b55db963dbaf9e39a2071104 Mon Sep 17 00:00:00 2001 From: Julian Arnesino Date: Thu, 28 Nov 2024 14:50:51 -0300 Subject: [PATCH] Clarifications in the post "Parallel Programming with CUDA" --- ...ming-with-CUDA-Parallel-programs-optimized-for-GPU.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md b/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md index e46cea1..546d50e 100644 --- a/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md +++ b/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md @@ -397,13 +397,12 @@ Say the grid size is $G$. As soon as we load our data to shared memory in the block, we can make threads add all the elements outside the first G elements of the array. That way, when the blocks start working, we know we have an array of size G. -This comes with two benefits: -1. It can handle arrays bigger than the grid. - - The array is reduced to size G when loading the elements to shared memory. -2. It provides maximum memory coalescing. - - We sum the array until it is reduced to size G by looping with a grid-stride. +This comes with three benefits: +1. **It can handle arrays bigger than the grid.** The array is reduced to size G when loading the elements to shared memory. +2. **It provides maximum memory coalescing.** We sum the array until it is reduced to size G by looping with a grid-stride. Therefore, memory access between warps is unit-stride every time. This means all accesses from threads in the same warp are going to be in consecutive addresses. +3. **It amortizes the of the creation and destruction of threads** that comes from launching more than the grid size. The following is an implementation of that approach: