Please follow the instructions to deploy Llama 3 8B to an Android smartphone with Qualcomm SoCs.
Export:
python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --mps --use_sdpa_with_kv_cache -d fp32 -qmode 8da4w -G 32 --embedding-quantize 4,32
After exporting the MPS model .pte file, the iOS LLAMA app can support running the model. --embedding-quantize 4,32
is an optional args for quantizing embedding to reduce the model size.
Export:
python -m examples.models.llama2.export_llama --checkpoint llama3.pt --params params.json -kv --disable_dynamic_shape --coreml --coreml-ios 18 --coreml-quantize b4w
After exporting the CoreML model .pte file, please follow the instruction to build llama runner with CoreML flags enabled as the instruction described.
Please follow the instructions to deploy llama3 8b to an Android phones with MediaTek chip