Highly Optimized Low Precision Kernels

Our kernels are based on x64 template library BESTLA.

Support Matrix

Limited by the graph framework, we only add kernels which accept float tensor as input and output tensor.

input dtype	output dtype	compute type	compute ISA
float32	float32	float32	AVX2
float32	float32	float32	AVX512F
float32¹	float32²	int8	AVX512_VNNI
float32¹	float32²	int8	AVX_VNNI
float32¹	float32²	int8	AMX_INT8
float32/bf16	float32/bf16	bf16	AMX_BF16
float32/fp16	float32/fp16	fp16	AVX512_FP16

¹: per-batch and per-K group-wise dynamic quantization for input tensor, where per-K group-wise also applies to weight quantization group size of weight tensor; support both symmetric and asymmetric quantization. ²: per-batch dynamic dequantization for output tensor.

Weight-only Quantization Support

dtype	algo	group size
int4	symmetric int8 truncated quant²	multiplier of 8, -1¹
int4	symmetric int4 full range³	multiplier of 8, -1¹
int4	asymmetric int4 full range³	multiplier of 8, -1¹
int8	symmetric	multiplier of 8, -1¹
fp4		multiplier of 8
nf4		multiplier of 8

¹: group size=-1 means per channel quantization on output channel (or group size equals to input channel size). ²: truncated quant means keep the high 4 bits of int8 quantization result for model saving and computation. ³: full range is a quantization method that utilizes the -8 value of int4 range compared with the normal int4 range [-7,7].

NOTE: AMX_INT8 requires group size is aligend to 128 (best hardware efficiency)

Fusion Support

We support three kinds of kernel fusion for transformer models: QKV, MHA (multi-head attention), and FFN (feed-forward network) fusion.

fusion type	models	runtime ISA
QKV	GPT-J LLaMA	AMX_INT8, AVX512_VNNI, AVX_VNNI
FFN	GPT-J LLaMA BLOOM ChatGLM Falcon MPT	AMX_INT8, AVX512_VNNI, AVX512F, AMX_BF16, AVX_VNNI, AVX2
MHA	Referring the fused-attention doc for details

Fastest Configuration for CPUs

codename	weight config	runtime ISA
Sapphire Rapids	any int4 group size=-1 compute type=int8	AMX_INT8
Ice Lake Cascade Lake Cooper Lake Tiger Lake Rocket Lake	any int4 group size=-1 compute type=int8	AVX512_VNNI
Skylake	any 4bits group size=-1 compute type=fp32	AVX512F
Alder Lake (12th Gen) Raptor Lake (13th and 14th Gen)	any 4bits group size=-1 compute type=int8	AVX_VNNI
Older architecture (before 12th Gen)	any 4bits group size=-1 compute type=fp32	AVX2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Highly Optimized Low Precision Kernels

Support Matrix

Weight-only Quantization Support

Fusion Support

Fastest Configuration for CPUs

Files

README.md

Latest commit

History

README.md

File metadata and controls

Highly Optimized Low Precision Kernels

Support Matrix

Weight-only Quantization Support

Fusion Support

Fastest Configuration for CPUs