Skip to content

Latest commit

 

History

History
24 lines (17 loc) · 1.36 KB

non_cpu_backends.md

File metadata and controls

24 lines (17 loc) · 1.36 KB

Running Llama 3/3.1 8B on non-CPU backends

QNN

Please follow the instructions to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.

MPS

Export:

python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32

After exporting the MPS model .pte file, the iOS LLAMA app can support running the model. --embedding-quantize 4,32 is an optional args for quantizing embedding to reduce the model size.

CoreML

Export:

python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w

After exporting the CoreML model .pte file, please follow the instruction to build llama runner with CoreML flags enabled as the instruction described.

MTK

Please follow the instructions to deploy llama3 8b to an Android phones with MediaTek chip