Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on #10397

Open
KevinRSX opened this issue Nov 19, 2024 · 0 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@KevinRSX
Copy link

KevinRSX commented Nov 19, 2024

What happened?

As the title suggests, this will cause the server to hang,

taskset -c 1-10 ./llama-server -m models/SmolLM-360M.Q8_0.gguf -t 11

while this will not

taskset -c 1-10 ./llama-server -m models/SmolLM-360M.Q8_0.gguf -t 10

To reproduce, the client can be simply curl as in the provided example:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

The client gets no response or error in the ill case.

Name and Version

$ ./llama-cli --version
version: 4126 (d3481e63)
built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) for x86_64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

Server log in the success case (start-up boilerplate logs truncated)
====================================================================
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot      release: id  0 | task 0 | stop processing: n_past = 140, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =      25.09 ms /    13 tokens (    1.93 ms per token,   518.07 tokens per second)
       eval time =    1270.83 ms /   128 tokens (    9.93 ms per token,   100.72 tokens per second)
      total time =    1295.92 ms /   141 tokens
request: POST /completion 127.0.0.1 200
srv  update_slots: all slots are idle

Server log in the failure case
==============================
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 13, n_tokens = 13
(SERVER HANGS HERE AFTER PREFILL IS DONE)
@KevinRSX KevinRSX added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 19, 2024
@KevinRSX KevinRSX changed the title Bug: Server hangs when number of threads used for decode > the number CPUs it runs on Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

No branches or pull requests

1 participant