-
Notifications
You must be signed in to change notification settings - Fork 11
GEMM API
(TODO : link to updated documentation)
The GEMM API is in file gemm.hpp
. Two functions are
template <typename T>
GemmStatus xgemm(...)
and
template <typename T>
GemmStatus gemm0(...)
The parameters are all described in gemm.hpp
. gemm0
is very similar to clBLAS GEMM, except it only takes 1 command queue. There are 4 parameters for xgemm
which we discuss further here as they are non-standard in GEMM APIs. 3 pertain to workspace, 1 to kernel caching.
The first three are cl_mem w
, size_t w_offset
and size_t w_size
, which describe the workspace memory. If w_size
= 0 GEMM is performed inplace. xgemm
does not ever allocate GPU memory. If w_size
> 0, then xgemm might use memory buffer w
. w
must be of size at least (w_offset + w_size)*sizeof(T)
bytes. xgemm will not use the first w_offset*sizeof(T)
bytes. Currently only a few problems are faster when workspace is provided, and never by more than 10%. These are generally cases where the leading dimension of A or B is a large multiple of 2.
If no workspace memory is provided, pass w = nullptr
, w_offset = 0
and w_size = 0
.
The fourth new GEMM API parameter is int ID
. This can safely be passed as any negative number (example ID=-1
). For small GEMM geometry (like m
= n
= k
= 128) a small acceleration (~15%) can be obtained by passing a certain positive ID though. Read on for more information. Consider the GEMM geometry lda = ldb = ldc = m = n = k = 1000
, isColMajor = tA = tB = false
. The first time xgemm is called on this (or any) particular GEMM problem, xgemm
must,
- determine good kernel parameters by searching for the nearest neighbor (nearest GEMM problem) in its private problem cache.
- generate OpenCL kernel string(s) from the kernel parameters and GEMM geometry.
- build/compile the OpenCL kernel string(s) to a binary.
- run the binary (GEMM is run on the device)
(1), (2) and especially (3) all take time, so performance on the first call to xgemm will be bad. Subsequent calls for the same GEMM geometry will skip (1), (2) and (3) though, so it just the first call which is very slow.
The first call to xgemm with a new geometry must have ID
negative. xgemm returns a GemmStatus
object which has a field ID
. This ID
identifies where the binary for this (GEMM geometry and device) is cached. Subsequent calls to xgemm for the now cached GEMM problem can pass this ID. This accelerates xgemm as xgemm does not need to determine the ID itself from GEMM geometry parameters.
An example of xgemm being used is in examples/apiexample1.cpp. Another is apibench/gemmbench.cpp.