Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326

Open
wants to merge 104 commits into
base: master
Choose a base branch
from

Conversation

zhouwg
Copy link
Contributor

@zhouwg zhouwg commented Mar 11, 2025

  • I have read the contributing guidelines
  • Self-reported review complexity:
    * [ ] Low
    * [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
    * [ ] High
  • Testing Done
    * [x] test-backend-ops and llama-cli through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 equipped Android phone
    * [x] test-backend-ops through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite(aka 8Gen4) equipped Android phone(modify enable_rpc_ion_mempool to 1 in scripts/ggml-hexagon.cfg before run "./scripts/build-run-android.sh run_testops")
    * [x] llama-cli through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite(aka 8Gen4) equipped Android phone

PR Description

this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:

  • how to utilize the Qualcomm Hexagon NPU maximally with the highly well-designed and highly compact ggml machine learning framework.

the fully and TLDR description of this PR can be found at my forked llama.cpp project:zhouwg#30.

Features

  • provide a concise reference implementation of HWACCEL_QNN in this PR: offload ggml op to QNN.

  • provide a very fast approach(HWACCEL_CDSP) which is exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl in this PR: offload ggml op to Hexagon cDSP directly.

  • the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared.

  • dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).
    Screenshot from 2025-03-31 22-13-15

  • probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:
    #v68 --- Snapdragon 888
    #v69 --- Snapdragon 8 Gen1
    #v73 --- Snapdragon 8 Gen2
    #v75 --- Snapdragon 8 Gen3(verified)
    #v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
    Screenshot from 2025-03-31 20-36-09
    Screenshot from 2025-03-31 20-37-07

  • provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.

How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone

Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions or Linux VM or WSL on Windows10/11 might be also ok):

  • utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.

  • we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:

    SM8450 (Snapdragon 8 Gen 1+)
    SM8550 (Snapdragon 8 Gen 2)
    SM8650 (Snapdragon 8 Gen 3)
    SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)

  git clone https://github.com/zhouwg/ggml-hexagon
  cd ggml-hexagon
  git checkout pr_to_upstream

 ./scripts/build-run-android.sh 
Usage:
  ./scripts/build-run-android.sh help
  ./scripts/build-run-android.sh print_oplist
  ./scripts/build-run-android.sh build
  ./scripts/build-run-android.sh updateqnnlib
  ./scripts/build-run-android.sh run_testops
  ./scripts/build-run-android.sh run_testop          [ADD/MUL_MAT]
  ./scripts/build-run-android.sh run_llamacli
  ./scripts/build-run-android.sh run_llamabench


we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon".

Hexagon NPU Performance

test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. following is a simple comparison between HWACCEL_QNN approach and HWACCEL_CDSP approach):

case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP(a self-made ggml-dsp-ut program used for performance comparison between QNN-NPU and cDSP):

Screenshot from 2025-04-02 11-17-49
Screenshot from 2025-04-02 11-18-20
Screenshot from 2025-04-02 11-18-47

case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP
mulmat through HWACCEL_CDSP(offload mulmat to cDSP directly):

./scripts/build-run-android.sh run_testop MUL_MAT

mulmat through HWACCEL_QNN(offload mulmat to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)

./scripts/build-run-android.sh run_testop MUL_MAT

we can/will clearly see(from adb logcat | grep ggml-hexagon) that the performance difference of mulmat between HWACCEL_QNN and HWACCEL_CDSP and the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool.

Big picture of ggml-hexagon backend

there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:

  • general approach through Qualcomm QNN SDK:offload ggml op to QNN (then QNN's internal will transfer to Hexagon cDSP)
  • general approach through Qualcomm Hexagon SDK:offload ggml op to Hexagon cDSP directly, which exactly similar to Qualcomm's ggml-opencl or Intel's ggml-sycl.
  • special approach through Qualcomm QNN SDK:mapping the entire ggml cgraph to a single QNN graph. the technical approach of "mapping the entire ggml computational graph to a single QNN graph" already discovered on 04/02024.
enum hwaccel_approach_type {
HWACCEL_QNN =0, (C API, before 03/11/2025, not easy because QNN SDK is a black-box or heavy SDK and many many tricks in the QNN SDK)
HWACCEL_QNN_SINGLEGRAPH=1,(C API, before 03/18/2025, very hard because the mechanism is a black black-box and workload is massive)
HWACCEL_CDSP=2,(C and assemble API, after 03/24/2025, hard but we can do anything on cDSP directly, because Hexagon SDK is a very lightweight/thin SDK and we can operate hardware directly through Hexagon SDK)
HWACCEL_SYCL=3,(this is personal proposal or assumption, general and modern C++ API, N/A at the moment because essential adaption layer should be provided by Qualcomm)
};

the tech details of "the special approach through QNN" can be found at my forked llama.cpp project:zhouwg#24.
10+ reasons why I think HWACCEL_CDSP is correct direction can be found at my forked llama.cpp project:zhouwg#28.

Todo tasks

  • fully understand/depict the tech detail in qidl otherwise we are trying to build a city on sand regardless tech approaches: qidl is a binary tool to generate some very complicated and hard-to customized bridge-layer codes between ARM-AP and cDSP. I personally think the mechanism in qidl/FastRPC is exactly similar to mechanism in TEE and the TEE's mechanism is much more flexible for developers. I personally think that the bridge layer codes generated by qidl will have a great impact on the NPU performance in the HWACCEL_CDSP approach at the moment(my understanding might-be incorrect and help/guidance/patch from domain tech experts is greatly appreciated). a workaround approach is manually modify the important data structure "struct ggml_tensor in ggml.h" but I don't think it's a make-sense/acceptable approach. this is a P0 task.

  • implement a highly-optimized(exquisite algorithm with HVX SIMD instructions and HVX multithreading) q6_k mulat in hexagon-kernels on cDSP side, qwen1_5-1_8b-chat-q4_0.gguf need q6_k mulmat. this is a P0 task and AI experts must be involved in this P0 task.

Acknowledgement

  1. the implementation of HWACCEL_QNN is mainly porting/reverse engineering from executorch(the implementation of QNN backend in executorch comes from Qualcomm). the implementation of HWACCEL_CDSP borrows some codes from Qualcomm's Hexagon SDK. one more important thing:I got breakthrough help from @chiwwang at Qualcomm Technologies Inc/Qualcomm Innovation Center on 04/2024. in the all: all fundamental techs of this topic(a specified ggml/llama.cpp backend for Qualcomm's Hexagon NPU) comes from Qualcomm.
  2. huge thanks to the excellent/great maintainers&original authors of ggml&llama.cpp,I learnt so much from ggml&llama.cpp: their open-minded spirits and standout contributions made a great public good for open-source community and our planet. one more important thing: the tiny ggml-dsp on Hexagon cDSP side(aka the existing implementation of hexagon kernels on cDSP side, because I'm not AI expert and this is a practical way for me) is completely ported/borrowed from the original ggml.
  3. huge thanks to a senior staff technical expert @max-krasnyansky from Qualcomm headquarter whom give an important/valuable/breakthrough guidance on direction on 03/18/2025:QNN is not the right solution here.

Conclusion

after spent too much efforts on ggml-hexagon backend, I personally think:

  • some work in the hexagon-kernels seems beyond my skillsets at the moment, AI experts must be involved in the rest parts of hexagon-kernels: AI experts only need to focus on hexagon-kernels, AI experts and other domain tech experts around the world can help to improve the hexagon-kernels(various mulmat and norm/rmsnorm/softmax/....) on cDSP side.
  • some design tricks from FFmpeg or GStreamer might-be/already used in GGML's backend subsystem: there are more than 1 backend implementation for the same hardware accelerator------ open source version from llama.cpp community and commercial version from Qualcomm.this policy will be very helpful for llama.cpp community and independent developers.

@max-krasnyansky, sorry to bother you, I understand your time is valuable, could you take another look at this PR again? ggml-hexagon.cpp v1.00 has released on 03/31/2025 and I try my best to fix all compiler warnings in ggml-hexagon.cpp(other warnings or warnings on cDSP side is caused by Hexagon SDK which is beyond my skillsets) make the code in ggml-hexagon.cpp more clear and more concise and more well-organized and bug-free(memory issue or segment fault in test-backend-ops is caused by codes on cDSP side and can be fixed by manually modify enable_rpc_ion_mempool to 1 in scripts/ggml-hexagon.cfg before run "./scripts/build-run-android.sh run_testops") and thread-safety".

[updated on 04/02/2025, 21:40] @max-krasnyansky, @chiwwang, sorry to bother you, could you help review this PR so that other domain technical experts and AI experts can help improve the hexagon-kernels on cDSP side?

[updated on 04/02/2025, 22:18] @ggerganov @slaren, sorry to bother you, I understand your time are both valuable, could you help to modify the label of this PR to "Qualcomm NPU" and remove the lable "testing" and "script" and "build"? thanks so much!

@github-actions github-actions bot added build Compilation issues script Script related ggml changes relating to the ggml tensor library for machine learning labels Mar 11, 2025
@github-actions github-actions bot added the testing Everything test related label Mar 11, 2025
@zhouwg zhouwg force-pushed the pr_to_upstream branch 5 times, most recently from 3ef106e to db890cc Compare March 11, 2025 09:09
@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 11, 2025

why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend

  1. general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample
    https://github.com/kantv-ai/kantv/blob/01a49bb2d3dc963d920ce7b6827524c8e6c0a70a/core/ggml/qnnsample/QnnSampleMain.cpp#L434-L484
    we can clearly see that there is a prebuilt binary module file(xxxxx.so) which generated by Qualcomm's dedicated tool(they called it as "qnn-pytorch-converter" or "qnn-tensorflow-converter" or "qnn-tflite-converter"), this binary module file can be converted from a complicated C++ source file which is also generated by Qualcomm's dedicated tool. an example of this very complicated C++ source file as following:
    https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp

the key-point function in this complicated C++ source file is:
https://github.com/kantv-ai/kantv/blob/kantv-poc-with-qnn/core/ggml/jni/Inception_v3.cpp#L20634

we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.

  1. approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's Genie(Generative AI Inference Extensions) software stack from Qualcomm's latest QNN SDK(2.32.0.250228, as of 03/11/2025)

https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html

Screenshot from 2025-02-28 19-05-50

Screenshot from 2025-03-12 08-10-17

Screenshot from 2025-03-12 08-21-17

we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.

  1. approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in XiaoMi StableDiffusionOnDevice

we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???

  1. approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in PowerInfer

this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:

the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????

  1. inference procedure in the existing implementation of llama.cpp
    Screenshot from 2025-03-11 18-30-59
    we can clearly see that the inference procedure in ggml-sycl is a typical skeleton of all existing ggml backends. accordingly, there is a similar code snippet in this PR(ggml-qnn backend):
    Screenshot from 2025-03-11 18-34-46
    Screenshot from 2025-03-11 19-44-17

ok, let me doing an interesting experiment with ggml-qnn backend in this PR:

  • uncomment line 3665 and line 3666 in function ggml_backend_qnn_graph_compute(ggml_backend_t backend, struct ggml_cgraph * cgraph)
  • modify configuration file <llama.cpp path>/scripts/ggml-qnn.cfg from
    Screenshot from 2025-03-11 18-41-35
    to
    Screenshot from 2025-03-11 18-42-46
  • running the script ./scripts/build-run-android.sh run_llamacli 2 accordingly(the meaning of this command is to launch LLM inference on the QNN NPU backend)

what we can see from the logs of adb logcat?

we can clearly see that there is no such an entire or complete GGML graph in this function:
Screenshot from 2025-03-11 19-44-17

accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp.

conclusion:

  • there is NO second technical approach in ggml-qnn backend because of limitation of the existing implementation of llama.cpp.
  • tech approach in this PR is a general approach and step in all ggml backends regardless coding style.

[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR.

I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech.

@Dampfinchen
Copy link

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

@zhouwg
Copy link
Contributor Author

zhouwg commented Mar 12, 2025

Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs?

thanks for your kind comment.

  1. Quacomm's Hexagon NPU support is really huge work for this project although now we clearly know the principle or know what, because Qualcomm provides some binary dedicated tools to do LLM model conversion in their dedicated AI sw stacks and some other closed-source implementation also use this similar approach exactly. so programmers must compose an ideal QNN graph according to the complete ggml cgraph manually in ggml-qnn backend if they use/chose the second tech approach in ggml-qnn backend("mapping the complete ggml cgraph to a single opcfg QNN graph"). there are 800+ cgraph nodes and 50+ ops in qwen1_5-1_8b-chat-q4_0.gguf, accordingly, "(Hexgon) NPU support is huge for this project", real AI experts must be involved in the rest parts of ggml-qnn.
  2. I think I can make it(ggml-exynos or ggml-samsung) work on Exynos 2200 if I can get a necessary phone(I can try to buy it) and SDK&tech docs(this might-be not easy because of strict IPR policy in some big IT companys as my personal understanding at the moment) and follow the principle "make it run and then make it right and finally make it fast",this is one of my areas of expertise.

zhouwg

This comment was marked as resolved.

@zhouwg zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 0065122 to 1f702df Compare March 16, 2025 08:12
zhouwg

This comment was marked as resolved.

@zhouwg zhouwg force-pushed the pr_to_upstream branch 2 times, most recently from 1e98561 to e4b0d8c Compare March 16, 2025 09:51
@zhouwg zhouwg force-pushed the pr_to_upstream branch 3 times, most recently from 967be44 to a26806a Compare March 18, 2025 03:34
zhouwg

This comment was marked as resolved.

zhouwg added 20 commits April 3, 2025 23:27
@zhouwg zhouwg force-pushed the pr_to_upstream branch 8 times, most recently from d976fe9 to 39e361f Compare April 4, 2025 05:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build Compilation issues ggml changes relating to the ggml tensor library for machine learning Qualcomm NPU script Script related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants