Generation speed issue #26

eagle705 · 2024-09-20T13:29:23Z

I load llama2 model like example successfully but the speed to generate text is really slow.

[1] I'm not sure it use mps to accelerate generation.
How to confirm it?
[2] Is there a smaller LLM than 7B?

Here is my env

Macbook Air / M2 / 16GB / Sonoma 14.5
Xcode 15.4
ckpt: coreml-projects/Llama-2-7b-chat-coreml

eemilk · 2024-09-20T18:20:57Z

There is 1B and 3B OpenELM converted into coreml
https://huggingface.co/corenet-community/coreml-OpenELM-1_1B-Instruct
https://huggingface.co/corenet-community/coreml-OpenELM-3B-Instruct

Also you can try to upgrade to macOS 15 sequoia. There is a lot of performance optimisation on on-device LLMs in that

sl5035 · 2025-01-09T09:10:33Z

Same here, I'm running a 1B model on Xcode using an iPhone 16 Pro Max Simulator and the generation speed is 0.07 tokens/s. I am not sure if this is actually the right generation speed or if I am missing something (e.g. GPU acceleration). I have "Prefer Discrete GPU" selected on my simulator's GPU selection.

This is my spec:
Macbook Pro / M2 / 16GB / Sonoma 14.7.

Did upgrading to macOS 15 Sequoia help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generation speed issue #26

Generation speed issue #26

eagle705 commented Sep 20, 2024 •

edited

Loading

eemilk commented Sep 20, 2024

sl5035 commented Jan 9, 2025

Generation speed issue #26

Generation speed issue #26

Comments

eagle705 commented Sep 20, 2024 • edited Loading

eemilk commented Sep 20, 2024

sl5035 commented Jan 9, 2025

eagle705 commented Sep 20, 2024 •

edited

Loading