From a8a8890521cbe530b55db963dbaf9e39a2071104 Mon Sep 17 00:00:00 2001
From: Julian Arnesino <jarnesino@eryx.co>
Date: Thu, 28 Nov 2024 14:50:51 -0300
Subject: [PATCH] Clarifications in the post "Parallel Programming with CUDA"

---
 ...ming-with-CUDA-Parallel-programs-optimized-for-GPU.md | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md b/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md
index e46cea1..546d50e 100644
--- a/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md
+++ b/_posts/2024-11-27-Parallel-Programming-with-CUDA-Parallel-programs-optimized-for-GPU.md
@@ -397,13 +397,12 @@ Say the grid size is $G$.
 As soon as we load our data to shared memory in the block, we can make threads add all the elements outside the first G elements of the array.
 That way, when the blocks start working, we know we have an array of size G.
 
-This comes with two benefits:
-1. It can handle arrays bigger than the grid.
-  - The array is reduced to size G when loading the elements to shared memory.
-2. It provides maximum memory coalescing.
-   - We sum the array until it is reduced to size G by looping with a grid-stride. 
+This comes with three benefits:
+1. **It can handle arrays bigger than the grid.** The array is reduced to size G when loading the elements to shared memory.
+2. **It provides maximum memory coalescing.** We sum the array until it is reduced to size G by looping with a grid-stride. 
      Therefore, memory access between warps is unit-stride every time.
      This means all accesses from threads in the same warp are going to be in consecutive addresses.
+3. **It amortizes the of the creation and destruction of threads** that comes from launching more than the grid size. 
 
 The following is an implementation of that approach: