-
Notifications
You must be signed in to change notification settings - Fork 11
Matrix edges
Each workgroup is assigned a macro tile of shape macro-A
x macro-B
in C
to work on, where C
is m
x n
. We need to consider the case where m
% macro-A
is non-zero or n
% macro-B
is non-zero. One approach is to pad the memory of edge macro tiles of C
with zeros, and then perform GEMM as if the matrix were of dimensions ceil(m
/macro-A
) x ceil(n
/macro-B
). This is the approach used by
Matsumoto et al., (2012) , by Garg and Hendren, (2014), and in certain cases by Nugteren and Codreanu, (2015) in CLBlast. One drawback of this approach is the requirement for additional GPU memory and the data copying in global memory.
A second approach involves branching between the workitems (threads) in the edge workgroups. The idea here is to have workitems always check if the element C
they are processing is indeed an element of C
. This approach is not optimal however, as branching between threads hurts parallelism and can result in poor performance.
We use an approach similar to the pointer shift approach proposed in Nath et al. (2011). In our approach, the macro tile which an edge workgroup processes is redefined so that it is contained entirely within C
. The approach is illustrated in the figure below. Workgroups 2,5 and 8 are naively assigned macro tiles which spill over the bottom of C
. They are instead assigned macro tiles slightly higher up. Workgroups 9 and 10 have their assigned macro tiles shifted left, while 11 is shifted both up and left.
Note that after the shift, the macro tiles of 2,3,8,9,10 and 11 overlap with 1,4,6 and 7. Therefore, when 2,3,8,9,10 and 11 write their final results to C
at the end, they must only write if not already written by 1,4,6 or 7. This branching is less costly than the branching in computing (the second approach above), as it is only done once at the end of the threads execution.
The remaining edge case to consider is when k
% UNR
is non-zero. We handle this case by branching between threads on the load tile. Instead of reading an out of bounds element of A
or B
into local memory, a thread places a zero in local memory. This branching seems unavoidable.