Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

To many memory using when using utlra 200V's npu #12841

Open
shichang00 opened this issue Feb 18, 2025 · 5 comments
Open

To many memory using when using utlra 200V's npu #12841

shichang00 opened this issue Feb 18, 2025 · 5 comments

Comments

@shichang00
Copy link

Before calling load_model_from_file(save_directory)
CPU memory Usage:4GB
NPU memory Usage:0GB

After calling load_model_from_file(save_directory)
CPU memory Usage:12.5GB
NPU memory Usage:8.9GB

It seems that both CPU and NPU allocated the same memory for model.
Is this an issue, or was it designed for any reason?

By the way, does any plan to open source of bigdl-core-npu/npu-llm-cpp?

@plusbang
Copy link
Contributor

plusbang commented Feb 19, 2025

Hi , could you please provide more information about the memory usage (e.g. which model and configuration of max-context-len / max-prompt-len is used). And some CPU memory usage is expected because embedding and kv cache.

I'm afraid that we have no plan to open source recently.

@shichang00
Copy link
Author

Hi @dockerg , could you please provide more information about the memory usage (e.g. which model and configuration of max-context-len / max-prompt-len is used). And some CPU memory usage is expected because embedding and kv cache.

I'm afraid that we have no plan to open source recently.

max-context-len:1024
max-prompt-len:960

I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit();

But when i check it with Ultra 125H, the memory usage is basically the same for all times.

As a contrast, Hundreds MB to 3GB is used by CPU,when using iGPU.

@plusbang
Copy link
Contributor

max-context-len:1024 max-prompt-len:960

I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit();

Which model is used? We will try to reproduce it first : )

@shichang00
Copy link
Author

max-context-len:1024 max-prompt-len:960
I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit();

Which model is used? We will try to reproduce it first : )

qwen/qwen2.5-7B, and --low-bit set sym_int4

@plusbang
Copy link
Contributor

plusbang commented Feb 19, 2025

qwen/qwen2.5-7B, and --low-bit set sym_int4

Hi, I failed to reproduce this on Ultra 7 258V with qwen2 example.

With 32.0.100.3104 NPU driver and ipex-llm==2.2.0b20250218 (no runtime configuration is required), the whole memory usage during inference (using load_low_bit()) is ~6.2 GB in which NPU shared memory usage is 4 GB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants