Skip to content

Commit

Permalink
Clarifications in the post "Parallel Programming with CUDA"
Browse files Browse the repository at this point in the history
  • Loading branch information
jarnesino committed Nov 28, 2024
1 parent 090013b commit a8a8890
Showing 1 changed file with 4 additions and 5 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -397,13 +397,12 @@ Say the grid size is $G$.
As soon as we load our data to shared memory in the block, we can make threads add all the elements outside the first G elements of the array.
That way, when the blocks start working, we know we have an array of size G.

This comes with two benefits:
1. It can handle arrays bigger than the grid.
- The array is reduced to size G when loading the elements to shared memory.
2. It provides maximum memory coalescing.
- We sum the array until it is reduced to size G by looping with a grid-stride.
This comes with three benefits:
1. **It can handle arrays bigger than the grid.** The array is reduced to size G when loading the elements to shared memory.
2. **It provides maximum memory coalescing.** We sum the array until it is reduced to size G by looping with a grid-stride.
Therefore, memory access between warps is unit-stride every time.
This means all accesses from threads in the same warp are going to be in consecutive addresses.
3. **It amortizes the of the creation and destruction of threads** that comes from launching more than the grid size.

The following is an implementation of that approach:

Expand Down

0 comments on commit a8a8890

Please sign in to comment.