Skip to content

Latest commit

 

History

History
143 lines (124 loc) · 9 KB

README.md

File metadata and controls

143 lines (124 loc) · 9 KB

Inference RWKV on Qualcomm HTP (Hexagon Tensor Processor) using QNN SDK

Features

  • Support for RWKV v5, v6 and experimentally v7 models(WIP)
  • Inference RWKV using QNN SDK, with Qualcomm CPU, GPU or HTP (Hexagon Tensor Processor) as the backend.
  • Support for whole-model float16 inference (since Qualcomm HTP cannot do float32 math).
  • Support for activation INT16 and weights INT8 quantized inference (with some key operations running with float16).
  • Support for activation INT16 and weights INT4/INT8 mixed quantized inference.

Prerequisites

  • Download and install the QNN SDK from the Qualcomm Developer Network.
  • Setup the QNN SDK environment by following the instructions in Qualcomm's documents.
  • Setup the $QNN_SDK_ROOT environment variable to point to the QNN SDK installation directory. It should by default be installed at /opt/qcom/aistack/qnn/{version}.
  • (Optional) Install the AIMET toolkit for aimet quantization methods: https://quic.github.io/aimet-pages/releases/latest/install/index.html#quick-install
  • This project has been verified with:
    • QNN SDK 2.26.0
    • python==3.10 (as is recommended by QNN SDK documentation)
    • onnx==1.16.1
    • protobuf==3.20.2 (Mandatory for making both QNN's onnx-converter and onnx==1.16.1 working properly)
    • torch==2.2.2 (although QNN SDK is verified to work with torch==1.13.0, it's okay to use latest version of torch since we are only using torch for model conversion and onnx exporting) (2.2.2 is recommended by AIMET toolkit)
    • Hardware: Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14)

Usage

1. Convert model weights to QNN model library file.

Converting a FP16 model

  • convert_model.py: usage: convert_model.py [-h] [--chunks CHUNKS] [--use_qnn_quant] [--act_bitwidth ACT_BITWIDTH] [--weights_bitwidth WEIGHTS_BITWIDTH] [--ext_embedding] model
  • Convert the model: python convert_model.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth --chunks 4

Converting an A16W8 model

  • make_calibration_samples.py: usage: make_calibration_samples.py [-h] [--ext_embedding] model output chunks
  • Make calibration samples: python make_calibration_samples.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth ./samples_1b6 2
  • Convert the model file: python convert_model.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth --chunks 2 --use_qnn_quant --calib_data_path ./samples_1b6
  • The act_bitwidth and weights_bitwidth default to 16 and 8 respectively.
  • Note: Please keep the chunks parameter the same in both scripts.

Converting an A16W4 model

  • make_calibration_samples.py: usage: make_calibration_samples.py [-h] [--ext_embedding] model output chunks
  • Make calibration samples: python make_calibration_samples.py ../models/RWKV-x060-World-1B6-v2.1-20240328-ctx4096.pth ./samples_1b6 2
  • Convert the model file: --linear_param_encodings quant_encodings/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_mse_rwkv_gptq_exceptions_asym_torch_w4.encodings (The quantization encodings are either from the pre-calculated ones (GDrive), or generated using AIMET. Refer to: AIMET_quant.md)
  • Some large Linear modules are quantized to 4-bit weights, while some are kept 8-bit for better accuracy.
  • Note: Please keep the chunks parameter the same in both scripts.

The outputs will be in lib/ directory. The model library contains weights, as well as the functions to prepare the graph. This can either be called on device using libraries in lib/aarch64-android/, or be prepared on the x86 host machine using lib/x86_64-linux-clang/ to generate an HTP context cache. Qualcomm HTP has a limitation on the size of the model library file, so the model will be split into multiple chunks.

2. Generate HTP context cache

  • make_context_cache_binary.py: usage: make_context_cache_binary.py [-h] model_lib output_path {SM8650,SM8550,SC8380}
  • Example:
$ python make_context_cache_binary.py ./lib/x86_64-linux-clang/libRWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.so output/ SM8650
  • The script will automatically process each of the chunks together.
  • The output would be in output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin and output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin.

3. Run inference on the device

3.1. Running on Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14)

  • Build the demo code: make -C librwkv-qualcomm
  • Push the binary and the HTP context cache to the device: adb push librwkv-qualcomm/obj/local/arm64-v8a/rwkv-qualcomm-demo /data/local/tmp/ && adb push output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin /data/local/tmp/ && adb push output/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin /data/local/tmp/
  • Push the tokenizer model to the device: adb push assets/brwkv_vocab_v20230424.txt /data/local/tmp/
  • Push these QNN libs to the device /data/local/tmp/ (Please change the HTP V75 version to the one you have):
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpNetRunExtensions.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpNetRunExtensions.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnSystem.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/aarch64-android/libQnnHtpV75Stub.so
/opt/qcom/aistack/qairt/2.22.6.240515/lib/hexagon-v75/unsigned/libQnnHtpV75Skel.so
  • If using external embedding, please push onnx/RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.emb to /data/local/tmp/rwkv/ too.
  • Finally run the demo code:
adb shell
$ cd /data/local/tmp
$ export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/data/local/tmp
$ # Specify the path to the first model chunk. The second chunk will be loaded automatically.
$ ./rwkv-qualcomm-demo brwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin

3.2. Running on Qualcomm Snapdragon X Elite laptops

  • TODO

Example output:

RWKV v6 1B6 A16W4

130|houji:/data/local/tmp/rwkv $ ./rwkv-qualcomm-demo b_rwkv_vocab_v20230424.txt RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Loading model context binary from RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk1of2.bin
Buffer size: 719802320
Reading chunk: RWKV-x060-World-1B6-v2.1-20240328-ctx4096_chunk2of2.bin
Buffer size: 586727640
User: 请为我写一首诗。

Assistant: 当然,请告诉我你喜欢什么类型的诗歌。

User: 请写一首描写秋天景色的诗。

Assistant: 秋意渐浓,寒意渐深,
大地已是金黄如火,
落英纷飞,树影绰约,
人心也随之变得清静。
夜空中的繁星在闪闪,
思念似要被所有握住,
但又像是永不消散的孤注,
在这个秋天里如此特别。

请问这首诗符合您需求吗?

Average time per token: 0.0235644s
Average tokens per second: 42.4368

Performance

Running on the Qualcomm Snapdragon SM8650 with HTP v75 (Xiaomi Mi 14)

Model Precision Generation Tokens per second LAMBADA ppl, acc
RWKV v6 1.6B att-a16w8 + ffn-a16w4 42.4368 5.09183,65.4182%
RWKV v6 1.6B a16w8 31.6564 4.75009,66.3497%
RWKV v6 1.6B fp16 15.0434 4.63598,67.2618%
RWKV v6 3B att-a16w8 + ffn-a16w4 21.3172 4.46606,68.8725%
RWKV v6 3B a16w8 16.2146 3.9039,71.3647%

(Currently QNN's INT4 quantization is the naive linear per-channel quantization, together with the INT16 activation quantization, the perplexity gets a bit worse than the INT8 models. LAMBADA test accuracy seems lower but still acceptable.)

(Experimental) Running with custom WKV kernel

Model Precision Generation Tokens per second LAMBADA ppl, acc
RWKV v6 1.6B att-a16w8 + ffn-a16w4 47.6698 5.09183,65.4182%
RWKV v6 7B a16w4 12.9782 TODO

Obsolete data in previous versions for comparison:

Model Precision Generation Tokens per second LAMBADA ppl, acc
RWKV v6 1.6B att-a16w8 + ffn-a16w4 32.6703 4.65837,66.7378%
RWKV v6 1.6B a16w8 26.0707 4.59243,67.3006%
RWKV v6 1.6B fp16 15.0434 4.63598,67.2618%
RWKV v6 3B att-a16w8 + ffn-a16w4 17.3968 4.46606,68.8725%
  • RWKV-5-World-0.4B-v2-20231113-ctx4096, fp16: Average tokens per second: 50.7313
  • RWKV-5-ABC-82M-v1-20230901-ctx1024, fp16: Average tokens per second: 142.286

TODO

  • Add demo code for running inference on the device.
  • Add support for A16W8 quantized inference.
  • Add support for A16W4 quantized inference with AIMET quantization.
  • Add document for running on Snapdragon X Elite laptops.
  • Sequential prefilling on device.
  • Package a library for easy use and integration.