-
Notifications
You must be signed in to change notification settings - Fork 11.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PR: Refine ggml-hexagon backend(Qualcomm Hexagon NPU backend) for latest ggml,whisper.cpp,llama.cpp #12326
base: master
Are you sure you want to change the base?
Conversation
3ef106e
to
db890cc
Compare
why "mapping the entire ggml's computational graph to QNN graph"(the second technical approach in above post) is not practical in ggml-qnn backend
the key-point function in this complicated C++ source file is: we can clearly see that an ideal or expected QNN graph which has only single QNN graph with many many graph nodes would be generated/composed in this function. then we can understand that the codes in QnnSampleMain.cpp is just a route work or skeleton codes. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
https://docs.qualcomm.com/bundle/publicresource/topics/80-63442-100/introduction.html we can clearly see that the core process of offload inference to NPU(HTP) backend is 90%-99% same to the general approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in Qualcomm's QNN Sample after tracking all relative codes in QNN SDK. in this case, we clearly know the single QNN graph was generated by Qualcomm's dedicated tool.
we can clearly see that a customized model which was trained and provided by XiaoMi's AI team and this customized binary model will be used in this open source project: they claimed they can got a 10x performance gain with NPU inference . at the same time, we can clearly see that the main logic of this open source project is 90% same to Qualcomm's QNN Sample after tracking codes carefully, but we still don't know how that single QNN graph was generated? what should we think at the moment???
this open source project comes from a famous China top university and it can be considered a derived or highly-customized project of llama.cpp. one of the highlights in this derived project is that the R&D developers implemented a closed-source QNN backend. recently I found a highly related project on GitHub with help from an unknown(here means I don't know) programmer @zhuipiaochen. we can clearly see the approach of "utilize the Hexagon NPU maximally in QNN NPU backend" in this interesting project is 90% same to the approach in Qualcomm's Genie or 90% same to the approach in Qualcomm's QNN Sample after tracking codes carefully:
the last 3 steps are exactly similar to offload 2D/3D matrix mulipilication to QNN backend in this PR. the difference between these two scenarios is that there are only 2 QNN graph nodes in the QNN graph of 2D/3D mulmat on QNN backend. in this case, we still don't know how the single QNN graph was generated? what should we think at the moment?????
ok, let me doing an interesting experiment with ggml-qnn backend in this PR:
what we can see from the logs of adb logcat? we can clearly see that there is no such an entire or complete GGML graph in this function: accordingly, the logic or inference procedure in this function is exactly same to the original approach or general approach in all ggml backends.this is the limitation of the existing implementation of inference procedure or inference architecture in llama.cpp. conclusion:
[updated on 21:56, 03/12/2025], the conclusion here is incorrect because the analysis in case 5 is WRONG, the first tech approach in this PR is still meaningful(because all op functions can be used in the second tech approach after some minor adjustment) and the second tech approach should be finished in this PR or other similar PR, but the analysis in case 1/2/3/4 is completely correct and logic in this tech doc is correct:Qualcomm provides some binary dedicated tools to do LLM model conversion which is exactly hard work(compose an ideal QNN graph according to the complete ggml cgraph or mapping the complete ggml cgraph to a single QNN graph) in the second tech approach of ggml-qnn backend. the second tech approach can be also implemented in this PR but I think I can't completely finish it because of my limited AI knowledge(there are hundreds of cgraph nodes and there are about 50+ ops) and real AI experts must be involved in the rest parts of ggml-qnn. so, good lucky to other similar PR. I made a wrong analysis in step 5 or misunderstanding in #12342 which already explained by slaren, the rootcause of these two-stupid mistakes is that I have very limited knowledge about real hard-core AI tech. |
2ceaaf5
to
3402e2c
Compare
Nice job. NPU support is huge for this project. Do you think its also possible to make it work on Exynos 2200 and 2400 NPUs? |
thanks for your kind comment.
|
0065122
to
1f702df
Compare
1e98561
to
e4b0d8c
Compare
967be44
to
a26806a
Compare
…ble rpc ion memory pool for HWACCEL_CDSP
…HWACCEL_CDSP approach at the first time
…NN and HWACCEL_CDSP easily
…on ggmlhexagon_check_valid_appcfg
…EL_CDSP and HWACCEL_QNN
…mprove NPU performance significantly
d976fe9
to
39e361f
Compare
…S_NORM/GGML_OP_POOL_2D to Hexagon cDSP
* [ ] Low
* [x] Medium(complexity of codes on ARM-AP side is medium, complexity of codes on cDSP side(hexagon-kernels) is high
* [ ] High
* [x]
test-backend-ops
andllama-cli
through HWACCEL_QNN on Qualcomm Snapdragon 8Gen3 equipped Android phone* [x]
test-backend-ops
through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite(aka 8Gen4) equipped Android phone(modify enable_rpc_ion_mempool to 1 in scripts/ggml-hexagon.cfg before run "./scripts/build-run-android.sh run_testops")* [x]
llama-cli
through HWACCEL_CDSP on Qualcomm Snapdragon 8Gen3 & 8Elite(aka 8Gen4) equipped Android phonePR Description
this PR is a continued effort of my original PR #6869 on 04/2024, focus on the final mission:
the fully and TLDR description of this PR can be found at my forked llama.cpp project:zhouwg#30.
Features
provide a concise reference implementation of HWACCEL_QNN in this PR: offload ggml op to QNN.
provide a very fast approach(HWACCEL_CDSP) which is exactly similar to Intel's ggml-sycl or Qualcomm's ggml-opencl in this PR: offload ggml op to Hexagon cDSP directly.
the Hexagon NPU performance between HWACCEL_QNN approach and HWACCEL_CDSP approach can be easily compared.
dynamic running parameter adjustment through ggml-hexagon.cfg(this idea comes from @ngxson in his draft AI-dedicated PR and more parameters can be added in this configuration file).

probe/detect Snapdragon SoC information at runtime, accordingly, code might-be/should-be running well on following Qualcomm dsp:


#v68 --- Snapdragon 888
#v69 --- Snapdragon 8 Gen1
#v73 --- Snapdragon 8 Gen2
#v75 --- Snapdragon 8 Gen3(verified)
#v79 --- Snapdragon 8 Elite(aka 8 Gen4) (verified)
provide big picture of ggm-hexagon backend in this PR for further or other relative dev activity in this great pure-tech community.
How to build ggml‐hexagon source code for Android and verify ggml-hexagon backend on Snapdragon based phone
Ubuntu 20.04,22.04 is validated and recommended as host machine(other Linux distributions or Linux VM or WSL on Windows10/11 might be also ok):
utilize build-run-android.sh to download Android NDK and Qualcomm QNN SDK automatically, Qualcomm Hexagon SDK must be obtained with a Qualcomm Developer Account and cannot be downloaded automatically in this script.
we will need an Android smartphone with adb-connected running on one of below Qualcomm SoCs:
SM8450 (Snapdragon 8 Gen 1+)
SM8550 (Snapdragon 8 Gen 2)
SM8650 (Snapdragon 8 Gen 3)
SM8750-AB (Snapdragon 8 Elite) (aka Snapdragon 8 Gen 4)
we can find that this backend works fine as expected from the log output of "adb logcat | grep ggml-hexagon".
Hexagon NPU Performance
test phone is a Snapdragon 8 Gen3 Android phone and a Snapdragon 8 Elite(aka 8 Gen4) Android phone, test model is qwen1_5-1_8b-chat-q4_0.gguf. following is a simple comparison between HWACCEL_QNN approach and HWACCEL_CDSP approach):
case-1: GGML_OP_ADD performance comparison between QNN-NPU and cDSP(a self-made ggml-dsp-ut program used for performance comparison between QNN-NPU and cDSP):
case-2: GGML_OP_MUL_MAT performance comparison between QNN-NPU and cDSP
mulmat through HWACCEL_CDSP(offload mulmat to cDSP directly):
mulmat through HWACCEL_QNN(offload mulmat to QNN_NPU)(modify hwaccel_approach to 0 --- hwaccel approach through QNN--- in scripts/ggml-hexagon.cfg and then run)
we can/will clearly see(from adb logcat | grep ggml-hexagon) that the performance difference of mulmat between HWACCEL_QNN and HWACCEL_CDSP and the NPU performance is really good and much faster then QNN solution when disable cDSP rpc ion memory pool.
Big picture of ggml-hexagon backend
there are three tech approaches to implement the ggml-hexagon backend for Qualcomm's Hexagon NPU:
the tech details of "the special approach through QNN" can be found at my forked llama.cpp project:zhouwg#24.
10+ reasons why I think HWACCEL_CDSP is correct direction can be found at my forked llama.cpp project:zhouwg#28.
Todo tasks
fully understand/depict the tech detail in qidl otherwise we are trying to build a city on sand regardless tech approaches: qidl is a binary tool to generate some very complicated and hard-to customized bridge-layer codes between ARM-AP and cDSP. I personally think the mechanism in qidl/FastRPC is exactly similar to mechanism in TEE and the TEE's mechanism is much more flexible for developers. I personally think that the bridge layer codes generated by qidl will have a great impact on the NPU performance in the HWACCEL_CDSP approach at the moment(my understanding might-be incorrect and help/guidance/patch from domain tech experts is greatly appreciated). a workaround approach is manually modify the important data structure "struct ggml_tensor in ggml.h" but I don't think it's a make-sense/acceptable approach. this is a P0 task.
implement a highly-optimized(exquisite algorithm with HVX SIMD instructions and HVX multithreading) q6_k mulat in hexagon-kernels on cDSP side, qwen1_5-1_8b-chat-q4_0.gguf need q6_k mulmat. this is a P0 task and AI experts must be involved in this P0 task.
Acknowledgement
Conclusion
after spent too much efforts on ggml-hexagon backend, I personally think:
@max-krasnyansky, sorry to bother you, I understand your time is valuable, could you take another look at this PR again? ggml-hexagon.cpp v1.00 has released on 03/31/2025 and I try my best to fix all compiler warnings in ggml-hexagon.cpp(other warnings or warnings on cDSP side is caused by Hexagon SDK which is beyond my skillsets) make the code in ggml-hexagon.cpp more clear and more concise and more well-organized and bug-free(memory issue or segment fault in test-backend-ops is caused by codes on cDSP side and can be fixed by manually modify enable_rpc_ion_mempool to 1 in scripts/ggml-hexagon.cfg before run "./scripts/build-run-android.sh run_testops") and thread-safety".
[updated on 04/02/2025, 21:40] @max-krasnyansky, @chiwwang, sorry to bother you, could you help review this PR so that other domain technical experts and AI experts can help improve the hexagon-kernels on cDSP side?
[updated on 04/02/2025, 22:18] @ggerganov @slaren, sorry to bother you, I understand your time are both valuable, could you help to modify the label of this PR to "Qualcomm NPU" and remove the lable "testing" and "script" and "build"? thanks so much!