-
Notifications
You must be signed in to change notification settings - Fork 15
CPU GPU Performance Portability
When you refactor codes to run efficiently on GPUs, you often find that your performance on CPUs goes down. This is not always the case if the CPU code was not written efficiently, but if the CPU code has been tuned for caching and vectorization at all, GPU refactoring nearly always degrades performance. There are two main reasons that happens:
- GPU-optimized code often touches DRAM more frequently than CPU-optimized code and uses cache less efficiently
- GPU-optimized code often vectorizes poorly on CPU code
The primary operation of refactoring code for GPU is exposing enough threading to keep GPUs busy. The Nvidia V100 GPU, for instance, has 5K cores distributed among 80 different vector units, called Streaming Multiprocessors (SMs). A GPU attains efficiency less by keeping data local in cache and more by hiding memory latency from main GPU memory. It hides latency by pipelining memory accesses and computations. One thread issues a memory fetch, then the SM switches to the next thread, which issues a memory fetch. Once it passes through all of the threads, it goes back to the first, and if the memory has arrived, it will execute the operation that depends on that memory. By switching through lightweight threads, the pipelining effectively hides the latency of fetching memory.
What this means is that you effectively need at least 512 threads per SM, or 40K threads per GPU. In reality, though, most kernels need significantly more threads than this in order for the kernel's runtime to be large enough (nominally > 5 microseconds). Exposing this kind of threading usually means gluing together more than one loop. In weather and climate codes, this usually requires us to expose all threading in tightly nested chunks, which end up looking like:
do ie=1,nelems !Loop over elements
do k=1,nlev !Loop over vertical levels
do j=1,np !Loop over y-direction basis functions
do i=1,np !Loop over x-direction basis functions
! operation_1
enddo
enddo
enddo
enddo
do ie=1,nelems !Loop over elements
do k=1,nlev !Loop over vertical levels
do j=1,np !Loop over y-direction basis functions
do i=1,np !Loop over x-direction basis functions
! operation_2
enddo
enddo
enddo
enddo
...
When we do this, we typically end up breaking caching. A cache-conscious CPU code will have much smaller loops inside operations, run sequentially inside a larger loop, e.g.:
do ie=1,nelems !Loop over elements
do k=1,nlev !Loop over vertical levels
call operation_1(...)
call operation_2(...)
call operation_3(...)
enddo
enddo
subroutine operation_[1|2|3](...)
do j = 1 , np
do i = 1 , np
...
enddo
enddo
end subroutine
...
The code above will cache well on the CPU if np
is sufficiently small because successive functions are reusing data that has already been fetched into cache but not yet evicted from cache. However, the GPU code, which must expose all of the threading for each kernel will perform poorly in CPU caching because the loops are large enough to evict data from cache before moving to operation_2
.
We ran into the same situation in another code, where we needed to expose the number of instances of a model as an independent dimension in the code to create more GPU threads. That code was originally:
do icrm=1,ncrms
call operation_1()
call operation_2()
call operation_3()
enddo
subroutine operation_[1|2|3] (...)
do k=1,nz
do j=1,ny
do i=1,nx
...
enddo
enddo
enddo
end subroutine
And after GPU refactoring, it became:
call operation_1()
call operation_2()
call operation_3()
subroutine operation_[1|2|3] (...)
do icrm = 1 , ncrms
do k=1,nz
do j=1,ny
do i=1,nx
...
enddo
enddo
enddo
enddo
end subroutine
In both of these cases, caching on the CPU has been harmed, leading to performance degradation on the CPU.
To resolve this issue, what was done in each case was to pass a flexible amount of looping into the subroutines. While this cannot always overcome all caching issues, it can certainly help. For instance, in the element-based code, we can pass a flexible number of elements down the callstack:
do ietile=1,nelems/elemTileSize+1 !Loop over element tiles
do ktile=1,nlev/levTileSize+1 !Loop over vertical level tiles
ie1 = (ietile-1)*elemTileSize + 1
ie2 = max(nelems,ietile*elemTileSize)
k1 = (ktile-1)*levTileSize + 1
k2 = max(nlev,ktile*levTileSize)
call operation_1(ie1,ie2,k1,k2,...)
call operation_2(ie1,ie2,k1,k2,...)
call operation_3(ie1,ie2,k1,k2,...)
enddo
enddo
subroutine operation_[1|2|3](ie1,ie2,k1,k2,...)
do ie = ie1,ie2
do k = k1,k2
do j = 1 , np
do i = 1 , np
...
enddo
enddo
enddo
enddo
end subroutine
...
This will solve the caching problems because we can always set levTileSize
and elemTileSize
each to 1
to get back to CPU caching. We might pay an overhead in looping, but it will be negligible if the workload is significant enough. For the other code, we would write the code as follows:
do icrmtile = 1 , ncrms / crmTileSize + 1
icrm1 = (icrmtile-1)*crmTileSize + 1
icrm2 = max(ncrms,ktile*crmTileSize)
call operation_1(icrm1,icrm2,...)
call operation_2(icrm1,icrm2,...)
call operation_3(icrm1,icrm2,...)
enddo
subroutine operation_[1|2|3] (icrm1,icrm2...)
do icrm = icrm1,icrm2
do k=1,nz
do j=1,ny
do i=1,nx
...
enddo
enddo
enddo
enddo
end subroutine
There is also the issue when exposing threading of having to promote variables to include extra dimensions, essentially turning some temporary variables into globally indexed variables, which will increase overall memory requirements. However, you only pull into cache what you ask for, and this should not affect overall touches to DRAM. The caching is the number one constraint on that.
- Give and take of kernel sizes: large enough to reduce launch overheads, small enough to avoid register pressure.
- Dangers of fixing a dimension size like vertical levels.
- If you're on the GPU, you're already vectorized. The problem with CPU is getting onto the vector unit.
- if statements
- non-inlined function calls
- Splitting kernels at if-statements
- kernel launch overheads for small kernels from fissioning
- caching problems associated with kernel fissioning on the CPU
- Manual tiling + pushing inner tile loop inside if-statements, challenge of loop leftovers
- Alignment in allocations (aligned_alloc)
- With no if-statements, allow template parameter for #pragma ivdep at CPU for loop
- Enforcing mandatory aligned chunk datatype (HOMMEXX)