feat: add more max_length constraint for resource limit machines #41

fann1993814 · 2023-11-15T07:38:20Z

Hi, @simonJJJ I am so glad to see you update for M1/M2 support, thanks.
So, I can close my previous PR #39

There are some useful features can help resource limit machines for computing.

Modify max_length for initialization into pipeline setting (when pipeline create), because MacBook Air M1's RAM cannot hold original setting (origin model training setting is too long), It will let gpu out of memory easily, also minimize compute space in lower length setting for kv
Scale MEM_SIZE and SCRATCH_SIZE make reasonable for max_length modification.

There are my experiments in this PR.

Experiments Setting

My Env
- MacBook Air M1
- Memory 8G
- MacOS 14.1.1
Commend Parameter
- ./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello
CPU (M1 CPU, master branch)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello
system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 |
loaded qwen model from qwen7b-ggml.bin within: 75.465 ms

Hello! How can I help you today?

prompt time: 4385.79 ms / 20 tokens (219.289 ms/token)
output time: 67534 ms / 10 tokens (6753.4 ms/token)
total time: 71919.8 ms

GPU (M1 GPU, master branch)

I cannot run, because OOM.

CPU (M1 CPU, this PR)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello
system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 0 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 |
loaded qwen model from qwen7b-ggml.bin within: 80.018 ms

Hello! How can I help you today? Is there something you would like to talk about or learn more about? I'm here to answer any questions you may have.

prompt time: 5553.58 ms / 20 tokens (277.679 ms/token)
output time: 3417.43 ms / 35 tokens (97.64 ms/token)
total time: 8971.01 ms

GPU (M1 GPU with Metal, this PR)

./build/bin/main -m qwen7b-ggml.bin -l 128 -v --tiktoken ~/Project/llm/Qwen-7B-Chat/qwen.tiktoken -p hello
system info: | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | METAL = 1 | BLAS = 1 | SSE3 = 0 | VSX = 0 |
inference config: | max_length = 128 | max_context_length = 512 | top_k = 0 | top_p = 0.5 | temperature = 0.95 | num_threads = 0 |
loaded qwen model from qwen7b-ggml.bin within: 122.671 ms

Hello! How can I help you today?

prompt time: 460.668 ms / 20 tokens (23.033 ms/token)
output time: 811.612 ms / 10 tokens (81.161 ms/token)
total time: 1272.28 ms

Spend Time (Output Time, Lower is better)

CPU(master)	GPU(master)	CPU(this PR)	GPU(this PR)
6753.4 ms/token	OOM	97.64 ms/token	81.161 ms/token

fann1993814 added 2 commits November 15, 2023 15:22

feat: modify max_length and scale parma by max_length

cfb9490

feat: modify max_length and scale parma by max_length

9fae4c6

fann1993814 changed the title ~~feat: add max_length control~~ feat: add more max_length control for resource limit machines Nov 15, 2023

fann1993814 mentioned this pull request Nov 16, 2023

Qwen-7B-Q4_0 works well on Mac M1, but Qwen-7B-Q8_0 cannot work with a ggml-metal error. #42

Open

fann1993814 changed the title ~~feat: add more max_length control for resource limit machines~~ feat: add more max_length constraint for resource limit machines Nov 20, 2023

fann1993814 mentioned this pull request Nov 24, 2023

Support --gpu-layers #45

Closed

Minami-su approved these changes Dec 3, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add more max_length constraint for resource limit machines #41

feat: add more max_length constraint for resource limit machines #41

fann1993814 commented Nov 15, 2023 •

edited

Loading

feat: add more max_length constraint for resource limit machines #41

Are you sure you want to change the base?

feat: add more max_length constraint for resource limit machines #41

Conversation

fann1993814 commented Nov 15, 2023 • edited Loading

fann1993814 commented Nov 15, 2023 •

edited

Loading