To deploy LLM to mobile, we need a quantized model in GGUF format for CPU inference.
llama.cpp has developed a barebone ios app using swiftui. You can find the example here. You can also find a lengthy discussion on Performance of llama.cpp on Apple Silicon A-series
where some models have been benchmarked and also instruction is given how you can benchmark as well.
The general steps to follow to run the app in simulator:
git clone https://github.com/ggerganov/llama.cpp
Download xcode in mac from app store. Don't forget to install ios stimulator as well. You will be prompted to install during xcode installation.
In your terminal where you have cloned the repo, type
cd llama.cpp/examples/llama.swiftui
xed .
This will open llama.swiftui
in xcode.
Select the ios simulator from top (iphone 15 pro max in my case)
Click on run
icon to build and start the ios simulator.
I have downloaded and loaded bigcode/starcoderbase-1b in GGUF format which I have quantized. Here is the download link for the GGUF format cosmo3769/starcoderbase-1b-GGUF.
I have downloaded and loaded bigcode/starcoderbase-3b in GGUF format which I have quantized. Here is the download link for the GGUF format cosmo3769/starcoderbase-3b-GGUF.