Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on #10397

KevinRSX · 2024-11-19T02:47:36Z

What happened?

As the title suggests, this will cause the server to hang,

taskset -c 1-10 ./llama-server -m models/SmolLM-360M.Q8_0.gguf -t 11

while this will not

taskset -c 1-10 ./llama-server -m models/SmolLM-360M.Q8_0.gguf -t 10

To reproduce, the client can be simply curl as in the provided example:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "Building a website can be done in 10 simple steps:","n_predict": 128}'

The client gets no response or error in the ill case.

Name and Version

$ ./llama-cli --version
version: 4126 (d3481e63)
built with cc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22) for x86_64-redhat-linux

What operating system are you seeing the problem on?

Linux

Relevant log output

Server log in the success case (start-up boilerplate logs truncated)
====================================================================
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 13, n_tokens = 13
slot      release: id  0 | task 0 | stop processing: n_past = 140, truncated = 0
slot print_timing: id  0 | task 0 |
prompt eval time =      25.09 ms /    13 tokens (    1.93 ms per token,   518.07 tokens per second)
       eval time =    1270.83 ms /   128 tokens (    9.93 ms per token,   100.72 tokens per second)
      total time =    1295.92 ms /   141 tokens
request: POST /completion 127.0.0.1 200
srv  update_slots: all slots are idle

Server log in the failure case
==============================
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 13
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 13, n_tokens = 13, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 13, n_tokens = 13
(SERVER HANGS HERE AFTER PREFILL IS DONE)

The text was updated successfully, but these errors were encountered:

KevinRSX added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Nov 19, 2024

KevinRSX changed the title ~~Bug: Server hangs when number of threads used for decode > the number CPUs it runs on~~ Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on #10397

Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on #10397

KevinRSX commented Nov 19, 2024 •

edited

Loading

Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on #10397

Bug: Server hangs when number of threads used for decoding > number of CPUs it runs on #10397

Comments

KevinRSX commented Nov 19, 2024 • edited Loading

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

KevinRSX commented Nov 19, 2024 •

edited

Loading