Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

Closed
s-u opened this issue Mar 22, 2025 · 2 comments · Fixed by #12545
Closed

Regression: e0dbec0 (aka #12181) breaks pooled embeddings: mean #12517

s-u opened this issue Mar 22, 2025 · 2 comments · Fixed by #12545

Comments

@s-u
Copy link

s-u commented Mar 22, 2025

Name and Version

Affects all llama builds since e0dbec0, tested up to

version: 4941 (ba932df)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

bug not present in

version: 4879 (f08f4b3)
built with cc (Ubuntu 13.3.0-6ubuntu2-24.04) 13.3.0 for x86_64-linux-gnu

Operating systems

Linux

Which llama.cpp modules do you know to be affected?

libllama (core library)

Command line

# Can be replicated with any model, here using Llama-3.3
# (-b/-c to reduce memory usages, but not relevant to the bug - can use model ctx size)
llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

Problem description & steps to reproduce

Fails in llm_graph_context::build_pooling with:
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

Reproduce with any model using llama-embedding --pooling mean, for example:

llama-embedding -m Llama-3.3-70B-Instruct-Q6_K-00001-of-00002.gguf \
   -ngl 90 -b 2048 -c 2048 -p 'hello, world' --pooling mean

The error is due to mismatch between inp and inp_mean tensors in llama-graph.cpp@:1626.

Run with additional output printing nelements and nrows of inp and inp_mean:

llama_context: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_context: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 16777216, nrow = 2048
imp_mean nel = 1, nrow = 1
llama.cpp/ggml/src/ggml.c:2738: GGML_ASSERT(ggml_can_mul_mat(a, b)) failed

run before with llama 4879 (f08f4b3), i.e., before e0dbec0 (#12181):

llama_init_from_model: KV self size  =  640.00 MiB, K (f16):  320.00 MiB, V (f16):  320.00 MiB
llama_init_from_model:  CUDA_Host  output buffer size =     0.00 MiB
llama_init_from_model: pipeline parallelism enabled (n_copies=4)
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
inp nel = 8192, nrow = 1
imp_mean nel = 1, nrow = 1
inp nel = 16777216, nrow = 2048
imp_mean nel = 4194304, nrow = 2048
llama_init_from_model:      CUDA0 compute buffer size =  1600.03 MiB
llama_init_from_model:      CUDA1 compute buffer size =  1664.06 MiB
llama_init_from_model:  CUDA_Host compute buffer size =   192.09 MiB
llama_init_from_model: graph nodes  = 2569
llama_init_from_model: graph splits = 3
common_init_from_params: setting dry_penalty_last_n to ctx_size = 2048
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
inp nel = 16384, nrow = 2
imp_mean nel = 4, nrow = 2
[...]
batch_decode: n_tokens = 3, n_seq = 1
inp nel = 24576, nrow = 3
imp_mean nel = 9, nrow = 3

First Bad Commit

e0dbec0

Relevant log output

@ggerganov
Copy link
Member

Check if #12545 fixes the issue.

@s-u
Copy link
Author

s-u commented Mar 24, 2025

I can confirm that this fixes the issue. Thanks, the fast response is much appreciated!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants