Skip to content

Add AVX512 implementation of GEMM - Q4_Kx8 #12829

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Apr 15, 2025

Conversation

Srihari-mcw
Copy link
Collaborator

GCC Linux :

Q4_K_M Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 pp 512 66.83 ± 0.10 3f9da22 - Base Commit
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 pp 512 84.00 ± 0.15 25.70% 559b050 - Updated Commit
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 tg 128 14.02 ± 0.00 3f9da22 - Base Commit
llama 7B Q4_K_M 3.80 GiB 6.74 B CPU 6 tg 128 14.01 ± 0.00 -0.07% 559b050 - Updated Commit

Q4_K_S Model :

model size params backend threads test t/s speedup Commit id
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 pp 512 73.66 ± 0.23 3f9da22 - Base Commit
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 pp 512 98.35 ± 0.27 33.52% 559b050 - Updated Commit
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 tg 128 14.78 ± 0.00 3f9da22 - Base Commit
llama 7B Q4_K_S 3.59 GiB 6.74 B CPU 6 tg 128 14.77 ± 0.00 -0.10% 559b050 - Updated Commit

GCC Version = 12.3

The models were quantized and tested from meta-llama2 7B model - https://huggingface.co/meta-llama/Llama-2-7b

The PR was tested in AMD Granite Ridge 9600X which supports the following flags by default :

system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

Further the perplexity was tested and found to be similar with the Q4_K_S model :

The perplexity results are tabulated as follows :

model perplexity (Final estimate PPL) Commit id
llama 7B Q4_K_S 5.8887 +/- 0.03281 3f9da22 - Base Commit
llama 7B Q4_K_S 5.8887 +/- 0.03281 559b050 - Updated Commit

@github-actions github-actions bot added the ggml changes relating to the ggml tensor library for machine learning label Apr 8, 2025
@Nor7th
Copy link

Nor7th commented Apr 9, 2025

Hi, may I ask how to test the correctness of such kernel optimization? Saying, is there a way to test whether the result of gemm/gemv by the simd implementation is correct?

@Srihari-mcw
Copy link
Collaborator Author

@Nor7th , we initially go with comparing the individual output values of GEMM and GEMV for with the mul mat outputs earlier without introduction of changes. Once there is sufficient evidence that there is not much divergence in the values, we take the perplexity measure as documented above.

Further, if there is not even precision differences observed, the llama-cli output tends to be the same. For example with the AVX2 and AVX512 Outputs in this function targeted for optimization, the output remains the same for different prompts and seeds. Thanks

@Srihari-mcw Srihari-mcw changed the title Add AVX512 implementation of GEMM - q4kx8 Add AVX512 implementation of GEMM - Q4_Kx8 Apr 11, 2025
@ggerganov ggerganov merged commit eccc7a1 into ggml-org:master Apr 15, 2025
47 of 51 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants