Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question] About CPU performance #1666

Open
mingfeima opened this issue Sep 3, 2024 · 3 comments
Open

[Question] About CPU performance #1666

mingfeima opened this issue Sep 3, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@mingfeima
Copy link

Hi, I am an engineer from Intel and I work mostly on the performance optimization of PyTorch on intel Xeon CPUs (also I am the pytorch module maintainer for cpu performance). Just come across this amazing project and from this blog fast-llama-2-on-cpus-with-sparse-fine-tuning-and-deepsparse the chart says DeepSparse accelerates the sparse-quantized Llama models to 6-8x faster over the dense FP32 baseline.

image

The 6-8x speedup of sparse model against dense model is a fascinating result. My purpose is to check if there is a chance to further improve the performance with our previous effort on LLM optimizations.

I run according the script from https://github.com/neuralmagic/deepsparse?tab=readme-ov-file#try-it-now, however from the hardware profiler I can tell the hardware efficiency is still not very high (only ~12 cores in use on average from a 40-core machine, leading to significant sync overhead and very high CPI (cycles per instructions)). Maybe I can do something to improve this, but I am not very familiar with this codebase, and I need some guidance here:

  • how can I reproduce the above results?
  • how the model is deployed ? with onnx-runtime?

Additionally, do you continue this sparse fine tuning job on other models, for example Llama3 ? Also how about int4 ?

@mingfeima mingfeima added the enhancement New feature or request label Sep 3, 2024
@Nafay-0
Copy link

Nafay-0 commented Oct 21, 2024

Hey @mingfeima, I was curious if you’ve found a solution to the core utilization issue or made any progress with optimizing performance? I’m tackling a similar challenge and would love to hear about any updates or insights you’ve gained!

@mingfeima
Copy link
Author

Hey @mingfeima, I was curious if you’ve found a solution to the core utilization issue or made any progress with optimizing performance? I’m tackling a similar challenge and would love to hear about any updates or insights you’ve gained!

I need additional information about how the model is being deployed to investigate how to optimize the performance.

@Nafay-0
Copy link

Nafay-0 commented Oct 28, 2024

Currently just trying to run a mode on CPU locally to optimize its performance

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants