To many memory using when using utlra 200V's npu #12841

shichang00 · 2025-02-18T03:00:56Z

Before calling load_model_from_file(save_directory)
CPU memory Usage:4GB
NPU memory Usage:0GB

After calling load_model_from_file(save_directory)
CPU memory Usage:12.5GB
NPU memory Usage:8.9GB

It seems that both CPU and NPU allocated the same memory for model.
Is this an issue, or was it designed for any reason?

By the way, does any plan to open source of bigdl-core-npu/npu-llm-cpp?

plusbang · 2025-02-19T03:28:58Z

Hi , could you please provide more information about the memory usage (e.g. which model and configuration of max-context-len / max-prompt-len is used). And some CPU memory usage is expected because embedding and kv cache.

I'm afraid that we have no plan to open source recently.

shichang00 · 2025-02-19T03:49:48Z

Hi @dockerg , could you please provide more information about the memory usage (e.g. which model and configuration of max-context-len / max-prompt-len is used). And some CPU memory usage is expected because embedding and kv cache.

I'm afraid that we have no plan to open source recently.

max-context-len:1024
max-prompt-len:960

I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit();

But when i check it with Ultra 125H, the memory usage is basically the same for all times.

As a contrast, Hundreds MB to 3GB is used by CPU,when using iGPU.

plusbang · 2025-02-19T05:24:56Z

max-context-len:1024 max-prompt-len:960

I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit();

Which model is used? We will try to reproduce it first : )

shichang00 · 2025-02-19T05:35:35Z

max-context-len:1024 max-prompt-len:960
I check this issue at two device; With Ultra 258V, CPU memory usage is just 4GB when the first time to run with AutoModel.from_pretrained(), but 9-10GB when using AutoModel.load_low_bit();

Which model is used? We will try to reproduce it first : )

qwen/qwen2.5-7B, and --low-bit set sym_int4

plusbang · 2025-02-19T08:59:57Z

qwen/qwen2.5-7B, and --low-bit set sym_int4

Hi, I failed to reproduce this on Ultra 7 258V with qwen2 example.

With 32.0.100.3104 NPU driver and ipex-llm==2.2.0b20250218 (no runtime configuration is required), the whole memory usage during inference (using load_low_bit()) is ~6.2 GB in which NPU shared memory usage is 4 GB.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

To many memory using when using utlra 200V's npu #12841

To many memory using when using utlra 200V's npu #12841

shichang00 commented Feb 18, 2025

plusbang commented Feb 19, 2025 •

edited

Loading

shichang00 commented Feb 19, 2025

plusbang commented Feb 19, 2025

shichang00 commented Feb 19, 2025

plusbang commented Feb 19, 2025 •

edited

Loading

To many memory using when using utlra 200V's npu #12841

To many memory using when using utlra 200V's npu #12841

Comments

shichang00 commented Feb 18, 2025

plusbang commented Feb 19, 2025 • edited Loading

shichang00 commented Feb 19, 2025

plusbang commented Feb 19, 2025

shichang00 commented Feb 19, 2025

plusbang commented Feb 19, 2025 • edited Loading

plusbang commented Feb 19, 2025 •

edited

Loading

plusbang commented Feb 19, 2025 •

edited

Loading