Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generation speed issue #26

Open
eagle705 opened this issue Sep 20, 2024 · 2 comments
Open

Generation speed issue #26

eagle705 opened this issue Sep 20, 2024 · 2 comments

Comments

@eagle705
Copy link

eagle705 commented Sep 20, 2024

I load llama2 model like example successfully but the speed to generate text is really slow.

image

[1] I'm not sure it use mps to accelerate generation.
How to confirm it?
[2] Is there a smaller LLM than 7B?

Here is my env

  • Macbook Air / M2 / 16GB / Sonoma 14.5
  • Xcode 15.4
  • ckpt: coreml-projects/Llama-2-7b-chat-coreml
@eemilk
Copy link

eemilk commented Sep 20, 2024

There is 1B and 3B OpenELM converted into coreml
https://huggingface.co/corenet-community/coreml-OpenELM-1_1B-Instruct
https://huggingface.co/corenet-community/coreml-OpenELM-3B-Instruct

Also you can try to upgrade to macOS 15 sequoia. There is a lot of performance optimisation on on-device LLMs in that

@sl5035
Copy link

sl5035 commented Jan 9, 2025

Same here, I'm running a 1B model on Xcode using an iPhone 16 Pro Max Simulator and the generation speed is 0.07 tokens/s. I am not sure if this is actually the right generation speed or if I am missing something (e.g. GPU acceleration). I have "Prefer Discrete GPU" selected on my simulator's GPU selection.

This is my spec:
Macbook Pro / M2 / 16GB / Sonoma 14.7.

Did upgrading to macOS 15 Sequoia help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants