Skip to content
This repository has been archived by the owner on Feb 27, 2024. It is now read-only.

GEMM API

James Newling edited this page Sep 12, 2017 · 8 revisions

(TODO : link to updated documentation)

The GEMM API is in file gemm.hpp. Two functions are

template <typename T>
GemmStatus xgemm(...)

and

template <typename T>
GemmStatus gemm0(...)

The parameters are all described in gemm.hpp. gemm0 is very similar to clBLAS GEMM, except it only takes 1 command queue. There are 4 parameters for xgemm which we discuss further here as they are non-standard in GEMM APIs. 3 pertain to workspace, 1 to kernel caching.

Workspace

The first three are cl_mem w, size_t w_offset and size_t w_size, which describe the workspace memory. If w_size = 0 GEMM is performed inplace. xgemm does not ever allocate GPU memory. If w_size > 0, then xgemm might use memory buffer w. w must be of size at least (w_offset + w_size)*sizeof(T) bytes. xgemm will not use the first w_offset*sizeof(T) bytes. Currently only a few problems are faster when workspace is provided, and never by more than 10%. These are generally cases where the leading dimension of A or B is a large multiple of 2.

If no workspace memory is provided, pass w = nullptr, w_offset = 0 and w_size = 0.

ID

The fourth new GEMM API parameter is int ID. This can safely be passed as any negative number (example ID=-1). For small GEMM geometry (like m = n = k = 128) a small acceleration (~15%) can be obtained by passing a certain positive ID though. Read on for more information. Consider the GEMM geometry lda = ldb = ldc = m = n = k = 1000, isColMajor = tA = tB = false. The first time xgemm is called on this (or any) particular GEMM problem, xgemm must,

  1. determine good kernel parameters by searching for the nearest neighbor (nearest GEMM problem) in its private problem cache.
  2. generate OpenCL kernel string(s) from the kernel parameters and GEMM geometry.
  3. build/compile the OpenCL kernel string(s) to a binary.
  4. run the binary (GEMM is run on the device)

(1), (2) and especially (3) all take time, so performance on the first call to xgemm will be bad. Subsequent calls for the same GEMM geometry will skip (1), (2) and (3) though, so it just the first call which is very slow.

The first call to xgemm with a new geometry must have ID negative. xgemm returns a GemmStatus object which has a field ID. This ID identifies where the binary for this (GEMM geometry and device) is cached. Subsequent calls to xgemm for the now cached GEMM problem can pass this ID. This accelerates xgemm as xgemm does not need to determine the ID itself from GEMM geometry parameters.

An example of xgemm being used is in examples/apiexample1.cpp. Another is apibench/gemmbench.cpp.

Clone this wiki locally