Running Llama 3/3.1 8B on non-CPU backends

QNN

Please follow the instructions to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.

MPS

Export:

python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32

After exporting the MPS model .pte file, the iOS LLAMA app can support running the model. --embedding-quantize 4,32 is an optional args for quantizing embedding to reduce the model size.

CoreML

Export:

python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w

After exporting the CoreML model .pte file, please follow the instruction to build llama runner with CoreML flags enabled as the instruction described.

MTK

Please follow the instructions to deploy llama3 8b to an Android phones with MediaTek chip

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non_cpu_backends.md

non_cpu_backends.md

Running Llama 3/3.1 8B on non-CPU backends

QNN

MPS

CoreML

MTK

Files

non_cpu_backends.md

Latest commit

History

non_cpu_backends.md

File metadata and controls

Running Llama 3/3.1 8B on non-CPU backends

QNN

MPS

CoreML

MTK