-
Notifications
You must be signed in to change notification settings - Fork 29
/
lecture10_ex3.jl
496 lines (441 loc) · 23.9 KB
/
lecture10_ex3.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
md"""
## Exercise 3 — **Advanced data transfer optimisations (part 2)**
"""
#md # 👉 See [Logistics](/logistics/#submission) for submission details.
md"""
The goal of this exercise is to:
- learn how to control registers for storing intermediate results on-chip;
- learn how to use shared memory (on-chip) to communicate between threads.
Prerequisites:
- the introduction notebook *Benchmarking memory copy and establishing peak memory access performance* ([`l6_1-gpu-memcopy.ipynb`](https://github.com/eth-vaw-glaciology/course-101-0250-00/blob/main/slide-notebooks/notebooks/l6_1-gpu-memcopy.ipynb))
- the *Data transfer optimisation notebook* ([`lecture6_ex1.ipynb`](https://github.com/eth-vaw-glaciology/course-101-0250-00/blob/main/exercise-notebooks/notebooks/lecture6_ex1.ipynb))
- the second *Data transfer optimisation notebook* (Exercise 2 [`lecture10_ex2.ipynb`](https://github.com/eth-vaw-glaciology/course-101-0250-00/blob/main/exercise-notebooks/notebooks/lecture10_ex2.ipynb))
[*This content is distributed under MIT licence. Authors: S. Omlin (CSCS), L. Räss (ETHZ).*](https://github.com/eth-vaw-glaciology/course-101-0250-00/blob/main/LICENSE.md)
"""
md"""
### Getting started
👉 Download the [`lecture10_ex3.ipynb`](https://github.com/eth-vaw-glaciology/course-101-0250-00/blob/main/exercise-notebooks/notebooks/lecture10_ex3.ipynb) notebook and edit it.
"""
md"""
We will again use the packages `CUDA`, `BenchmarkTools` and `Plots` to create a little performance laboratory:
"""
] activate .
#-
] instantiate
#-
#src #using IJulia
using CUDA
using BenchmarkTools
using Plots
md"""
In the previous notebook ([`lecture10_ex2.ipynb`](https://github.com/eth-vaw-glaciology/course-101-0250-00/blob/main/exercise-notebooks/notebooks/lecture10_ex2.ipynb)), you learned how to explicitly control part of the the on-chip memory usage, using so called "shared memory". We will learn now how to control a second kind of fast memory on-chip: registers. To this purpose we will implement the `cumsum!` function on GPU - for the sake of simplicity, we will only write it for 3-D arrays.
Here is the documentation of the function `cumsum!`
```julia-repl
help?> CUDA.cumsum!
cumsum!(B, A; dims::Integer)
Cumulative sum of A along the dimension dims, storing the result in B. See
also cumsum.
```
And here is a small example of how cumsum! works for 3-D arrays:
"""
A = CUDA.ones(4,4,4)
B = CUDA.zeros(4,4,4)
cumsum!(B, A; dims=1)
cumsum!(B, A; dims=2)
cumsum!(B, A; dims=3)
md"""
For benchmarking activities, we will allocate again large arrays, matching closely the number of grid points of the array size found best in the introduction notebook (you can modify the value if it is not right for you):
"""
nx = ny = nz = 512
A = CUDA.rand(Float64, nx, ny, nz);
B = CUDA.zeros(Float64, nx, ny, nz);
md"""
Moreover, we preallocate also an array to store reference results obtained from `CUDA.cumsum!` for verification.
"""
B_ref = CUDA.zeros(Float64, nx, ny, nz);
md"""
Now, we are set up to get started.
If we only consider main memory access, then the `cumsum!` will ideally need to read one array (`A`) and write one array (`B`). No other main memory access should be required. Let us therefore create an ad hoc adaption of the effective memory throughput metric for iterative PDE solvers (presented in the second notebook) to this problem. We will call the metric `T_eff_cs` and compute it as follows:
`T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it`
The upper bound of `T_eff_cs` is given by the maximum throughput achievable when copying exactly one aray to another. For this specific case, you might have measured in our benchmarks in the introduction notebook a slightly less good performance (measured 550 GB/s with the Tesla P100 GPU) then what we established as `T_peak`. Thus, we will consider here this measured value to be the upper bound of `T_eff_cs`.
We will now adapt the corresponding 2-D memory copy kernel in the most straightforward way to work for 3-D arrays. Then, we will modify this memory copy code in little steps until we have a well performing `cumsum!` GPU function.
Here is the adapted memory copy kernel from the introduction notebook and the `T_tot` we achieve with it (for the first two dimensions, we use again the thread/block configuration that we found best in the introduction notebook and we use one thread in the third dimension; you can modify the values if it is not right for you):
"""
#nb # > 💡 note: The usage of the variables `A` and `B` is reversed in comparison with the previous notebooks in order to match the documentation of cumsum!.
#md # \note{The usage of the variables `A` and `B` is reversed in comparison with the previous notebooks in order to match the documentation of cumsum!.}
function memcopy!(B, A)
ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
@inbounds B[ix,iy,iz] = A[ix,iy,iz]
return nothing
end
#-
threads = (32, 8, 1)
blocks = nx.÷threads
t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy!($B, $A); synchronize() end
T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
Let us first implement the cummulative sum over the third dimension (index `iz`). The first step is to read in A (and write B) in a way that will later allow to easily do the required cumsums, which is an inherently serial operation. However, we want to try to avoid a serious degradation of the memory throughput when doing so.
"""
md"""
### Task 1 (grid-stride loops)
Modify the above `memcopy!` kernel to read in A and write B in a serial manner with respect to the third dimension (index `iz`), i.e., with parallelization only over the first two dimensions (index `ix` and `iy`). Launch the kernel with the same thread configuration as above and adapt the the block configuration for the third dimension correctly. Verify the correctness of your kernel. Then, compute `T_tot` and explain the measured performance.
"""
#nb # > 💡 Hint: you need to launch the kernel with only one thread and one block in the third dimension and, inside the kernel, do a loop over the third dimension. Use `iz` as loop index as it will replace the previous `iz` computed from the thread location in the grid.
#nb # >
#nb # > 💡 Note: The operator `≈` allows to check if two arrays contain the same values (within a tolerance). Use this to verify your memory copy kernel.
#md # \note{You need to launch the kernel with only one thread and one block in the third dimension and, inside the kernel, do a loop over the third dimension. Use `iz` as loop index as it will replace the previous `iz` computed from the thread location in the grid.}
#md # \note{The operator `≈` allows to check if two arrays contain the same values (within a tolerance). Use this to verify your memory copy kernel.}
#-
#hint=## hint
#hint=function memcopy!(B, A)
#hint= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#hint= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#hint= for iz #...
#hint= #...
#hint= end
#hint= return nothing
#hint=end
#hint=
#hint=threads = (32, 8, 1)
#hint=blocks = (nx÷threads[1], ny÷threads[2], 1)
#hint=#-
#hint=## Verification
#hint=B .= 0.0;
#hint=@cuda #...
#hint=B ≈ A
#hint=#-
#hint=## Performance
#hint=t_it = @belapsed begin #...# end
#hint=T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=function memcopy!(B, A)
#sol= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#sol= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#sol= for iz = 1:size(A,3)
#sol= @inbounds B[ix,iy,iz] = A[ix,iy,iz]
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=threads = (32, 8, 1)
#sol=blocks = (nx÷threads[1], ny÷threads[2], 1)
#sol=#-
#sol=## Verification
#sol=B .= 0.0;
#sol=@cuda blocks=blocks threads=threads memcopy!(B, A); synchronize()
#sol=B ≈ A
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy!($B, $A); synchronize() end
#sol=T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
Great! You just implemented a so called [grid-stride loop](https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/). It exhibits a good memory access pattern and it allows to easily reuse, e.g., intermediate results from previous iterations in the loop. You probably observe a `T_tot` that is a bit below the one measured in the previous experiment (measured 520 GB/s with the Tesla P100 GPU). There is certainly room to improve this memory copy kernel a bit, but we will consider it sufficient in order to continue with this exercise.
"""
md"""
### Task 2 (controlling registers)
Write a kernel `cumsum_dim3!` which computes the cumulative sum over the 3rd dimension of a 3-D array. To this purpose modify the memory copy kernel from Task 1. Verify the correctness of your kernel against `CUDA.cumsum!`. Then, compute `T_eff_cs` as defined above, explain the measured performance and compare it to the one obtained with the generic `CUDA.cumsum!`.
"""
#nb # > 💡 Hint: define a variable `cumsum_iz` before the loop and initialize it to 0.0 in order to cumulate the sum.
#nb # >
#nb # > 💡 Note: The operator `≈` allows to check if two arrays contain the same values (within a tolerance). Use this to verify your results against `CUDA.cumsum!` (remember that for the verification, we already preallocated an array `B_ref` at the beginning, which you can use now).
#md # \note{Define a variable `cumsum_iz` before the loop and initialize it to 0.0 in order to cumulate the sum.}
#md # \note{The operator `≈` allows to check if two arrays contain the same values (within a tolerance). Use this to verify your results against `CUDA.cumsum!` (remember that for the verification, we already preallocated an array `B_ref` at the beginning, which you can use now).}
#-
#hint=## hint
#hint=function cumsum_dim3!(B, A)
#hint= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#hint= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#hint= cumsum_iz = 0.0
#hint= for iz #...
#hint= #...
#hint= end
#hint= return nothing
#hint=end
#hint=
#hint=## Verification
#hint=@cuda #...
#hint=CUDA.cumsum!(B_ref, A; dims=3);
#hint=B ≈ B_ref
#hint=#-
#hint=## Performance
#hint=t_it = @belapsed begin @cuda #...# end
#hint=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#hint=#-
#hint=t_it = @belapsed begin CUDA.cumsum!($B, $A; dims=3); synchronize() end
#hint=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=function cumsum_dim3!(B, A)
#sol= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#sol= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#sol= cumsum_iz = 0.0
#sol= for iz = 1:size(A,3)
#sol= @inbounds cumsum_iz += A[ix,iy,iz]
#sol= @inbounds B[ix,iy,iz] = cumsum_iz
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=## Verification
#sol=@cuda blocks=blocks threads=threads cumsum_dim3!(B, A); synchronize()
#sol=CUDA.cumsum!(B_ref, A; dims=3);
#sol=B ≈ B_ref
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads cumsum_dim3!($B, $A); synchronize() end
#sol=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=#-
#sol=t_it = @belapsed begin CUDA.cumsum!($B, $A; dims=3); synchronize() end
#sol=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
You should observe no significant difference between `T_eff_cs` measured here and `T_tot` computed in Task 1 (measured 520 GB/s with the Tesla P100 GPU). In fact, scalar variables defined in a kernel, like `cumsum_iz`, will be stored in registers as long as there are enough available (if not, then so called [register spilling](https://developer.download.nvidia.com/CUDA/training/register_spilling.pdf) to main memory takes place). As access to register is extremely fast, the summation added in this Task did certainly not reduce the measured performance. It is, once again, the memory copy speed that completely determines the performance of the kernel, because we have successfully controlled the use of registers!
We will now implement the cummulative sum over the 2nd dimension (index `iy`). As for the previous implementation, the first step is to read in A (and write B) in a way that will later allow to easily do the required cumsums without exhibiting significant memory throughput degradation.
"""
md"""
### Task 3 (kernel loops)
Modify the `memcopy!` kernel given in the beginning to read in A and write B in a serial manner with respect to the second dimension (index `iy`), i.e., with parallelization only over the first and the last dimensions (index `ix` and `iz`). Launch the kernel with the same amount of threads as in Task 1, however, place them all in the first dimension (i.e. use one thread in the second and third dimensions); adapt the the block configuration correctly. Verify the correctness of your kernel. Then, compute `T_tot` and explain the measured performance.
"""
#-
#hint=## hint
#hint=function memcopy!(B, A)
#hint= #...
#hint= return nothing
#hint=end
#hint=
#hint=threads = (256, 1, 1)
#hint=blocks = #...
#hint=#-
#hint=## Verification
#hint=B .= 0.0;
#hint=@cuda #...
#hint=B ≈ A
#hint=#-
#hint=## Performance
#hint=t_it = @belapsed begin @cuda #...# end
#hint=T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=function memcopy!(B, A)
#sol= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#sol= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#sol= for iy = 1:size(A,2)
#sol= @inbounds B[ix,iy,iz] = A[ix,iy,iz]
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=threads = (256, 1, 1)
#sol=blocks = (nx÷threads[1], 1, nz÷threads[3])
#sol=#-
#sol=## Verification
#sol=B .= 0.0;
#sol=@cuda blocks=blocks threads=threads memcopy!(B, A); synchronize()
#sol=B ≈ A
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy!($B, $A); synchronize() end
#sol=T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
You should observe no big difference between `T_tot` measured here and `T_tot` computed in Task 1 (measured 519 GB/s with the Tesla P100 GPU) as this kernel is accessing memory also in large contiguous chunks.
"""
md"""
### Task 4 (controlling registers)
Write a kernel `cumsum_dim2!` which computes the cumulative sum over the 2nd dimension of a 3-D array. To this purpose modify the memory copy kernel from Task 3. Verify the correctness of your kernel against `CUDA.cumsum!`. Then, compute `T_eff_cs` as defined above, explain the measured performance and compare it to the one obtained with the generic `CUDA.cumsum!`.
"""
#-
#hint=## hint
#hint=function cumsum_dim2!(B, A)
#hint= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#hint= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#hint= cumsum_iy = 0.0
#hint= #...
#hint= return nothing
#hint=end
#hint=
#hint=## Verification
#hint=@cuda #...
#hint=CUDA.cumsum!(B_ref, A; dims=2);
#hint=B ≈ B_ref
#hint=#-
#hint=## Performance
#hint=t_it = @belapsed begin #...# end
#hint=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#hint=#-
#hint=t_it = @belapsed #...
#hint=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=function cumsum_dim2!(B, A)
#sol= ix = (blockIdx().x-1) * blockDim().x + threadIdx().x
#sol= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#sol= cumsum_iy = 0.0
#sol= for iy = 1:size(A,2)
#sol= @inbounds cumsum_iy += A[ix,iy,iz]
#sol= @inbounds B[ix,iy,iz] = cumsum_iy
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=## Verification
#sol=@cuda blocks=blocks threads=threads cumsum_dim2!(B, A); synchronize()
#sol=CUDA.cumsum!(B_ref, A; dims=2);
#sol=B ≈ B_ref
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads cumsum_dim2!($B, $A); synchronize() end
#sol=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=#-
#sol=t_it = @belapsed begin CUDA.cumsum!($B, $A; dims=2); synchronize() end
#sol=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
Again, you should observe no significant difference between `T_eff_cs` measured here and `T_tot` computed in Task 3 (measured 519 GB/s with the Tesla P100 GPU), as you probably expected.
We will now implement the cumulative sum over the 1st dimension (index `ix`). As for the previous implementations, the first step is to read in A (and write B) in a way that will later allow to easily do the required cumsums without exhibiting significant memory throughput degradation.
"""
md"""
### Task 5 (kernel loops)
Modify the `memcopy!` kernel given in the beginning to read in A and write B in a serial manner with respect to the first dimension (index `ix`), i.e., with parallelization only over the second and third dimensions (index `iy` and `iz`). Launch the kernel with the same amount of threads as in Task 1, however, place them all in the second dimension (i.e. use one thread in the first and third dimensions); adapt the the block configuration correctly. Verify the correctness of your kernel. Then, think about what performance you expect, compute `T_tot` and explain the measured performance.
"""
#-
#hint=## hint
#hint=function memcopy!(B, A)
#hint= #...
#hint= return nothing
#hint=end
#hint=
#hint=threads = (1, 256, 1)
#hint=blocks = #...
#hint=#-
#hint=## Verification
#hint=#...
#hint=#-
#hint=## Performance
#hint=t_it = #...
#hint=T_tot = #...
#sol=function memcopy!(B, A)
#sol= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#sol= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#sol= for ix = 1:size(A,1)
#sol= @inbounds B[ix,iy,iz] = A[ix,iy,iz]
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=threads = (1, 256, 1)
#sol=blocks = (1, ny÷threads[2], nz÷threads[3])
#sol=#-
#sol=## Verification
#sol=B .= 0.0;
#sol=@cuda blocks=blocks threads=threads memcopy!(B, A); synchronize()
#sol=B ≈ A
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy!($B, $A); synchronize() end
#sol=T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
You likely observe `T_tot` to be an order of magnitude or more below `T_tot` measured in Task 1 and 3 (measured 36 GB/s with the Tesla P100 GPU) because, in contrast to the previous kernels, this kernel accesses memory discontinuously with a large stride (of `nx` numbers) between each requested number.
There obviously is no point in creating a `cumsum!` kernel out of this `memcopy!` kernel: the performance could only be the same or worse (even though worse is not to expect). Hence, we will try to improve this memory copy kernel by accessing multiple contiguous numbers from memory, which we can achieve by launching more then one threads in the first dimension, adapting the loop accordingly.
"""
md"""
### Task 6 (kernel loops)
Modify the `memcopy!` kernel from Task 5 to enable reading in 32 numbers at a time in the first dimension (index `ix`) rather than one number at a time as before. Launch the kernel with just 32 threads, all placed in the first dimension; adapt the the block configuration if you need to. Verify the correctness of your kernel. Then, think about what performance you expect now, compute `T_tot` and explain the measured performance.
"""
#nb # > 💡 Hint: you could hardcode the kernel to read 32 numbers at a time, but we prefer to write it more generally allowing to launch the kernel with a different number of threads in the first dimension (however, we do not want to enable more then one block though in this dimension).
#md # \note{You could hardcode the kernel to read 32 numbers at a time, but we prefer to write it more generally allowing to launch the kernel with a different number of threads in the first dimension (however, we do not want to enable more then one block though in this dimension).}
#-
#hint=## hint
#hint=function memcopy!(B, A)
#hint= #...
#hint= return nothing
#hint=end
#hint=
#hint=threads = #...
#hint=blocks = #...
#hint=#-
#hint=## Verification
#hint=#...
#hint=#-
#hint=## Performance
#hint=t_it = #...
#hint=T_tot = #...
#sol=function memcopy!(B, A)
#sol= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#sol= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#sol= for ix_offset = 0 : blockDim().x : size(A,1)-1
#sol= ix = threadIdx().x + ix_offset
#sol= @inbounds B[ix,iy,iz] = A[ix,iy,iz]
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=threads = (32, 1, 1)
#sol=blocks = (1, ny÷threads[2], nz÷threads[3])
#sol=#-
#sol=## Verification
#sol=B .= 0.0;
#sol=@cuda blocks=blocks threads=threads memcopy!(B, A); synchronize()
#sol=B ≈ A
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads memcopy!($B, $A); synchronize() end
#sol=T_tot = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
`T_tot` should now be similar to the one obtained in Task 1 and 3 or even a bit better (measured 534 GB/s with the Tesla P100 GPU) thanks to the additional concurrency compared to the other `memcopy!` versions. We are therefore ready to implement the cummulative sum over the 1st dimension.
"""
md"""
### Task 7 (communication via shared memory)
Write a kernel `cumsum_dim1!` which computes the cumulative sum over the 1st dimension of a 3-D array. To this purpose modify the memory copy kernel from Task 6. Verify the correctness of your kernel against `CUDA.cumsum!`. Then, compute `T_eff_cs` as defined above, explain the measured performance and compare it to the one obtained with the generic `CUDA.cumsum!`.
"""
#nb # > 💡 Hint: if you read the data into shared memory, then you can compute the cumulative sum, e.g., with the first thread.
#md # \note{If you read the data into shared memory, then you can compute the cumulative sum, e.g., with the first thread.}
#-
#hint=## hint
#hint=function cumsum_dim1!(B, A)
#hint= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#hint= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#hint= tx = threadIdx().x
#hint= shmem = @cuDynamicSharedMem(eltype(A), blockDim().x)
#hint= cumsum_ix = 0.0
#hint= #...
#hint= return nothing
#hint=end
#hint=
#hint=## Verification
#hint=#...
#hint=#-
#hint=## Performance
#hint=t_it = #...
#hint=T_eff_cs = #...
#hint=#-
#hint=t_it = @belapsed begin CUDA.cumsum!($B, $A; dims=1); synchronize() end
#hint=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=function cumsum_dim1!(B, A)
#sol= iy = (blockIdx().y-1) * blockDim().y + threadIdx().y
#sol= iz = (blockIdx().z-1) * blockDim().z + threadIdx().z
#sol= tx = threadIdx().x
#sol= shmem = @cuDynamicSharedMem(eltype(A), blockDim().x)
#sol= cumsum_ix = 0.0
#sol= for ix_offset = 0 : blockDim().x : size(A,1)-1
#sol= ix = threadIdx().x + ix_offset
#sol= @inbounds shmem[tx] = A[ix,iy,iz] # Read the x-dimension chunk into shared memory.
#sol= sync_threads()
#sol= if tx == 1 # Compute the cumsum only with the first thread, accessing only shared memory
#sol= for i = 1:blockDim().x
#sol= @inbounds cumsum_ix += shmem[i]
#sol= @inbounds shmem[i] = cumsum_ix
#sol= end
#sol= end
#sol= sync_threads()
#sol= @inbounds B[ix,iy,iz] = shmem[tx] # Write the x-dimension chunk to main memory.
#sol= end
#sol= return nothing
#sol=end
#sol=
#sol=## Verification
#sol=@cuda blocks=blocks threads=threads shmem=prod(threads)*sizeof(Float64) cumsum_dim1!(B, A); synchronize()
#sol=CUDA.cumsum!(B_ref, A; dims=1);
#sol=B ≈ B_ref
#sol=#-
#sol=## Performance
#sol=t_it = @belapsed begin @cuda blocks=$blocks threads=$threads shmem=2*prod($threads)*sizeof(Float64) cumsum_dim1!($B, $A); synchronize() end
#sol=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
#sol=#-
#sol=t_it = @belapsed begin CUDA.cumsum!($B, $A; dims=1); synchronize() end
#sol=T_eff_cs = 2*1/1e9*nx*ny*nz*sizeof(Float64)/t_it
md"""
`T_eff_cs` is probably significantly less good than the one obtained for the cumulative sums over the other dimensions, but still quite good if we keep in mind the `T_tot` achieved with the first memcopy manner in Task 5. A good strategy for tackling an optimal implementation would certainly be to use warp-level functions (and if needed a more complex summation algorithm).
"""