Add AVX512 implementation of GEMM - Q4_Kx8 #12829

Srihari-mcw · 2025-04-08T13:38:44Z

The PR contains AVX512 version of ggml_gemm_q4_K_8x8_q8_K introduced in Block interleaving support for Q4_K quantization for x86 AVX2 architecture #12332
Good gains were seen with prompt processing with the above changes compared to the AVX2 version

GCC Linux :

Q4_K_M Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	pp 512	66.83 ± 0.10		3f9da22 - Base Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	pp 512	84.00 ± 0.15	25.70%	559b050 - Updated Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	tg 128	14.02 ± 0.00		3f9da22 - Base Commit
llama 7B Q4_K_M	3.80 GiB	6.74 B	CPU	6	tg 128	14.01 ± 0.00	-0.07%	559b050 - Updated Commit

Q4_K_S Model :

model	size	params	backend	threads	test	t/s	speedup	Commit id
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	pp 512	73.66 ± 0.23		3f9da22 - Base Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	pp 512	98.35 ± 0.27	33.52%	559b050 - Updated Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	tg 128	14.78 ± 0.00		3f9da22 - Base Commit
llama 7B Q4_K_S	3.59 GiB	6.74 B	CPU	6	tg 128	14.77 ± 0.00	-0.10%	559b050 - Updated Commit

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

Further the perplexity was tested and found to be similar with the Q4_K_S model :

The perplexity results are tabulated as follows :

model	perplexity (Final estimate PPL)	Commit id
llama 7B Q4_K_S	5.8887 +/- 0.03281	3f9da22 - Base Commit
llama 7B Q4_K_S	5.8887 +/- 0.03281	559b050 - Updated Commit

Nor7th · 2025-04-09T07:58:34Z

Hi, may I ask how to test the correctness of such kernel optimization? Saying, is there a way to test whether the result of gemm/gemv by the simd implementation is correct?

Srihari-mcw · 2025-04-10T14:56:14Z

@Nor7th , we initially go with comparing the individual output values of GEMM and GEMV for with the mul mat outputs earlier without introduction of changes. Once there is sufficient evidence that there is not much divergence in the values, we take the perplexity measure as documented above.

Further, if there is not even precision differences observed, the llama-cli output tends to be the same. For example with the AVX2 and AVX512 Outputs in this function targeted for optimization, the output remains the same for different prompts and seeds. Thanks

Add AVX512 implementation of GEMM - q4kx8

559b050

github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 8, 2025

Srihari-mcw changed the title ~~Add AVX512 implementation of GEMM - q4kx8~~ Add AVX512 implementation of GEMM - Q4_Kx8 Apr 11, 2025

Update changes to remove unnecessary whitespaces

11123f8

slaren approved these changes Apr 14, 2025

View reviewed changes

ggerganov merged commit eccc7a1 into ggml-org:master Apr 15, 2025
47 of 51 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AVX512 implementation of GEMM - Q4_Kx8 #12829

Add AVX512 implementation of GEMM - Q4_Kx8 #12829

Srihari-mcw commented Apr 8, 2025

Nor7th commented Apr 9, 2025 •

edited

Loading

Srihari-mcw commented Apr 10, 2025

Add AVX512 implementation of GEMM - Q4_Kx8 #12829

Add AVX512 implementation of GEMM - Q4_Kx8 #12829

Conversation

Srihari-mcw commented Apr 8, 2025

Nor7th commented Apr 9, 2025 • edited Loading

Srihari-mcw commented Apr 10, 2025

Nor7th commented Apr 9, 2025 •

edited

Loading