-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
To many memory using when using utlra 200V's npu #12841
Comments
Hi , could you please provide more information about the memory usage (e.g. which model and configuration of I'm afraid that we have no plan to open source recently. |
max-context-len:1024 I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit(); But when i check it with Ultra 125H, the memory usage is basically the same for all times. As a contrast, Hundreds MB to 3GB is used by CPU,when using iGPU. |
Which model is used? We will try to reproduce it first : ) |
qwen/qwen2.5-7B, and --low-bit set sym_int4 |
Hi, I failed to reproduce this on Ultra 7 258V with qwen2 example. With |
Before calling load_model_from_file(save_directory)
CPU memory Usage:4GB
NPU memory Usage:0GB
After calling load_model_from_file(save_directory)
CPU memory Usage:12.5GB
NPU memory Usage:8.9GB
It seems that both CPU and NPU allocated the same memory for model.
Is this an issue, or was it designed for any reason?
By the way, does any plan to open source of bigdl-core-npu/npu-llm-cpp?
The text was updated successfully, but these errors were encountered: