Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash in rpc mode #34

Open
MatheMatrix opened this issue Feb 9, 2025 · 6 comments
Open

Crash in rpc mode #34

MatheMatrix opened this issue Feb 9, 2025 · 6 comments
Assignees

Comments

@MatheMatrix
Copy link

MatheMatrix commented Feb 9, 2025

Distro: Rocky LInux 8.4
GPU: 8x NVIDIA RTX 2080 Driver Version: 555.42.02 CUDA Version: 12.5

nvidia-smi output
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:08:00.0 Off |                  N/A |
| 30%   35C    P8             23W /  250W |     157MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:0C:00.0 Off |                  N/A |
| 30%   31C    P8             17W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:10:00.0 Off |                  N/A |
| 30%   32C    P8              7W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:14:00.0 Off |                  N/A |
| 30%   31C    P8              7W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:18:00.0 Off |                  N/A |
| 30%   25C    P8              6W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:1C:00.0 Off |                  N/A |
| 30%   28C    P8             11W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:20:00.0 Off |                  N/A |
| 30%   28C    P8             22W /  250W |     201MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:24:00.0 Off |                  N/A |
| 30%   27C    P8              6W /  250W |     157MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    469679      C   ...third_party/bin/llama-box/llama-box        154MiB |
|    1   N/A  N/A    449599      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    2   N/A  N/A    449596      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    3   N/A  N/A    449600      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    4   N/A  N/A    449592      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    5   N/A  N/A    449603      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    6   N/A  N/A    449595      C   ...third_party/bin/llama-box/llama-box        198MiB |
|    7   N/A  N/A    449612      C   ...third_party/bin/llama-box/llama-box        154MiB |
+-----------------------------------------------------------------------------------------+

Version:

0.02.559.012 I version    : v0.0.108 (7d23755)
0.02.559.013 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.559.013 I target     : x86_64-redhat-linux
0.02.559.014 I vendor     : llama.cpp 80d0d6b4 (4519), stable-diffusion.cpp 102953d (203)

Issue: I use rpc to run DeepSeek-R1-UD-IQ1_S-1.58.gguf with two nodes(each has 8 x RTX 2080), when I use evalscope to benchmark service performance with llama-box --parallel 2 it always crashes. If I use llama-box --parallel 1, it's fine...

this is messages and coredump:

Feb 09 21:37:54 172-20-10-59 kernel: llama-box[463192]: segfault at 10 ip 0000000000d8ca90 sp 00007fcd212de538 error 4 in llama-box[407000+ab9000]
Feb 09 21:37:54 172-20-10-59 kernel: Code: 01 00 00 00 48 39 47 30 77 09 48 8b 57 40 48 39 d0 76 07 44 89 c0 c3 0f 1f 00 48 3b 57 48 41 0f 97 c0 44 89 c0 c3 0f 1f 40 >
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/th'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000d8ca90 in ggml_is_empty ()
[Current thread is 1 (Thread 0x7fcd21300000 (LWP 463192))]
(gdb) bt
#0  0x0000000000d8ca90 in ggml_is_empty ()
#1  0x0000000000ba34bc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#2  0x0000000000da1c5d in ggml_backend_graph_compute ()
#3  0x0000000000499a24 in rpcserver::graph_compute(std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> >&) ()
#4  0x000000000049b08f in std::thread::_State_impl<std::thread::_Invoker<std::tuple<rpcserver_start(rpcserver_params&)::{lambda()#1}> > >::_M_run() ()
#5  0x0000000000e65d90 in execute_native_thread_routine ()
#6  0x00007fcd2a35a1ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fcd29fb58d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

llama-box main process log:

(base) [root@172-20-10-58 ~]# /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244  --no-context-shift --no-cache-prompt -n 6144 --metrics
0.00.330.256 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
0.00.330.264 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.330.265 I ggml_cuda_init: found 8 CUDA devices:
0.00.337.877 I   Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.341.108 I   Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.344.309 I   Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.348.131 I   Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.351.370 I   Device 4: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.354.593 I   Device 5: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.357.854 I   Device 6: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.361.119 I   Device 7: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.02.558.960 I
0.02.559.011 I arguments  : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244 --no-context-shift --no-cache-prompt -n 6144 --metrics
0.02.559.012 I version    : v0.0.108 (7d23755)
0.02.559.013 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.559.013 I target     : x86_64-redhat-linux
0.02.559.014 I vendor     : llama.cpp 80d0d6b4 (4519), stable-diffusion.cpp 102953d (203)
0.02.559.029 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |
0.02.559.029 I
0.02.559.139 I srv                      main: listening, hostname = 0.0.0.0, port = 40324, n_threads = 4 + 2
0.02.560.295 I srv                      main: loading model
0.02.560.301 I srv                load_model: loading model '/root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf'
0.02.771.442 I llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.469 I llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.487 I llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.503 I llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.520 I llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.541 I llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.557 I llama_model_load_from_file_impl: using device CUDA6 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.576 I llama_model_load_from_file_impl: using device CUDA7 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.772.724 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50389] (RPC[172.20.10.59:50389]) - 21805 MiB free
0.02.773.414 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50556] (RPC[172.20.10.59:50556]) - 21805 MiB free
0.02.773.934 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50195] (RPC[172.20.10.59:50195]) - 21805 MiB free
0.02.774.502 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50162] (RPC[172.20.10.59:50162]) - 21805 MiB free
0.02.775.174 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50750] (RPC[172.20.10.59:50750]) - 21805 MiB free
0.02.775.771 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50883] (RPC[172.20.10.59:50883]) - 21805 MiB free
0.02.776.374 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50244] (RPC[172.20.10.59:50244]) - 21807 MiB free
0.02.862.160 I llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf (version GGUF V3 (latest))
0.02.862.203 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.02.862.216 I llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
0.02.862.217 I llama_model_loader: - kv   1:                               general.type str              = model
0.02.862.219 I llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
0.02.862.220 I llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
0.02.862.221 I llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
0.02.862.222 I llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
0.02.862.226 I llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
0.02.862.227 I llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
0.02.862.228 I llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
0.02.862.229 I llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
0.02.862.230 I llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
0.02.862.231 I llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
0.02.862.235 I llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
0.02.862.237 I llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
0.02.862.238 I llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
0.02.862.239 I llama_model_loader: - kv  15:        deepseek2.leading_dense_block_count u32              = 3
0.02.862.240 I llama_model_loader: - kv  16:                       deepseek2.vocab_size u32              = 129280
0.02.862.241 I llama_model_loader: - kv  17:            deepseek2.attention.q_lora_rank u32              = 1536
0.02.862.242 I llama_model_loader: - kv  18:           deepseek2.attention.kv_lora_rank u32              = 512
0.02.862.243 I llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 192
0.02.862.253 I llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 128
0.02.862.254 I llama_model_loader: - kv  21:       deepseek2.expert_feed_forward_length u32              = 2048
0.02.862.255 I llama_model_loader: - kv  22:                     deepseek2.expert_count u32              = 256
0.02.862.256 I llama_model_loader: - kv  23:              deepseek2.expert_shared_count u32              = 1
0.02.862.257 I llama_model_loader: - kv  24:             deepseek2.expert_weights_scale f32              = 2.500000
0.02.862.258 I llama_model_loader: - kv  25:              deepseek2.expert_weights_norm bool             = true
0.02.862.259 I llama_model_loader: - kv  26:               deepseek2.expert_gating_func u32              = 2
0.02.862.260 I llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
0.02.862.261 I llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
0.02.862.262 I llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
0.02.862.277 I llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
0.02.862.279 I llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
0.02.862.287 I llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
0.02.862.289 I llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-v3
0.02.917.634 I llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,129280]  = ["<|begin▁of▁sentence>", "<...
0.02.937.899 I llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.02.990.803 I llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
0.02.990.814 I llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 0
0.02.990.815 I llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
0.02.990.817 I llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 128815
0.02.990.818 I llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
0.02.990.820 I llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
0.02.990.823 I llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
0.02.990.833 I llama_model_loader: - kv  43:               general.quantization_version u32              = 2
0.02.990.835 I llama_model_loader: - kv  44:                          general.file_type u32              = 24
0.02.990.837 I llama_model_loader: - kv  45:                      quantize.imatrix.file str              = DeepSeek-R1.imatrix
0.02.990.839 I llama_model_loader: - kv  46:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
0.02.990.840 I llama_model_loader: - kv  47:             quantize.imatrix.entries_count i32              = 720
0.02.990.841 I llama_model_loader: - kv  48:              quantize.imatrix.chunks_count i32              = 124
0.02.990.843 I llama_model_loader: - kv  49:                                   split.no u16              = 0
0.02.990.844 I llama_model_loader: - kv  50:                        split.tensors.count i32              = 1025
0.02.990.845 I llama_model_loader: - kv  51:                                split.count u16              = 0
0.02.990.845 I llama_model_loader: - type  f32:  361 tensors
0.02.990.846 I llama_model_loader: - type q4_K:  190 tensors
0.02.990.847 I llama_model_loader: - type q5_K:  116 tensors
0.02.990.847 I llama_model_loader: - type q6_K:  184 tensors
0.02.990.848 I llama_model_loader: - type iq2_xxs:    6 tensors
0.02.990.849 I llama_model_loader: - type iq1_s:  168 tensors
0.02.990.851 I print_info: file format = GGUF V3 (latest)
0.02.990.852 I print_info: file type   = IQ1_S - 1.5625 bpw
0.02.990.856 I print_info: file size   = 130.60 GiB (1.67 BPW)
0.03.199.480 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.03.199.637 I load: special tokens cache size = 819
0.03.441.353 I load: token to piece cache size = 0.8223 MB
0.03.441.374 I print_info: arch             = deepseek2
0.03.441.376 I print_info: vocab_only       = 0
0.03.441.377 I print_info: n_ctx_train      = 163840
0.03.441.380 I print_info: n_embd           = 7168
0.03.441.381 I print_info: n_layer          = 61
0.03.441.393 I print_info: n_head           = 128
0.03.441.395 I print_info: n_head_kv        = 128
0.03.441.396 I print_info: n_rot            = 64
0.03.441.397 I print_info: n_swa            = 0
0.03.441.397 I print_info: n_embd_head_k    = 192
0.03.441.398 I print_info: n_embd_head_v    = 128
0.03.441.400 I print_info: n_gqa            = 1
0.03.441.402 I print_info: n_embd_k_gqa     = 24576
0.03.441.403 I print_info: n_embd_v_gqa     = 16384
0.03.441.405 I print_info: f_norm_eps       = 0.0e+00
0.03.441.406 I print_info: f_norm_rms_eps   = 1.0e-06
0.03.441.407 I print_info: f_clamp_kqv      = 0.0e+00
0.03.441.407 I print_info: f_max_alibi_bias = 0.0e+00
0.03.441.408 I print_info: f_logit_scale    = 0.0e+00
0.03.441.410 I print_info: n_ff             = 18432
0.03.441.411 I print_info: n_expert         = 256
0.03.441.412 I print_info: n_expert_used    = 8
0.03.441.412 I print_info: causal attn      = 1
0.03.441.414 I print_info: pooling type     = 0
0.03.441.415 I print_info: rope type        = 0
0.03.441.416 I print_info: rope scaling     = yarn
0.03.441.418 I print_info: freq_base_train  = 10000.0
0.03.441.419 I print_info: freq_scale_train = 0.025
0.03.441.420 I print_info: n_ctx_orig_yarn  = 4096
0.03.441.422 I print_info: rope_finetuned   = unknown
0.03.441.422 I print_info: ssm_d_conv       = 0
0.03.441.422 I print_info: ssm_d_inner      = 0
0.03.441.423 I print_info: ssm_d_state      = 0
0.03.441.423 I print_info: ssm_dt_rank      = 0
0.03.441.424 I print_info: ssm_dt_b_c_rms   = 0
0.03.441.424 I print_info: model type       = 671B
0.03.441.426 I print_info: model params     = 671.03 B
0.03.441.427 I print_info: general.name     = DeepSeek R1 BF16
0.03.441.427 I print_info: n_layer_dense_lead   = 3
0.03.441.428 I print_info: n_lora_q             = 1536
0.03.441.429 I print_info: n_lora_kv            = 512
0.03.441.429 I print_info: n_ff_exp             = 2048
0.03.441.430 I print_info: n_expert_shared      = 1
0.03.441.430 I print_info: expert_weights_scale = 2.5
0.03.441.431 I print_info: expert_weights_norm  = 1
0.03.441.431 I print_info: expert_gating_func   = sigmoid
0.03.441.432 I print_info: rope_yarn_log_mul    = 0.1000
0.03.441.434 I print_info: vocab type       = BPE
0.03.441.434 I print_info: n_vocab          = 129280
0.03.441.435 I print_info: n_merges         = 127741
0.03.441.436 I print_info: BOS token        = 0 '<|begin▁of▁sentence>'
0.03.441.437 I print_info: EOS token        = 1 '<|end▁of▁sentence>'
0.03.441.437 I print_info: EOT token        = 1 '<|end▁of▁sentence>'
0.03.441.438 I print_info: PAD token        = 128815 '<|PAD▁TOKEN >'
0.03.441.439 I print_info: LF token         = 131 'Ä'
0.03.441.440 I print_info: FIM PRE token    = 128801 '<|fim▁begin >'
0.03.441.440 I print_info: FIM SUF token    = 128800 '<|fim▁hole >'
0.03.441.441 I print_info: FIM MID token    = 128802 '<|fim▁end >'
0.03.441.442 I print_info: EOG token        = 1 '<|end▁of▁sentence|>'
0.03.441.442 I print_info: max token length = 256
0.03.520.383 I load_tensors: offloading 61 repeating layers to GPU
0.03.520.393 I load_tensors: offloading output layer to GPU
0.03.520.393 I load_tensors: offloaded 62/62 layers to GPU
0.03.520.399 I load_tensors:          CPU model buffer size =   497.11 MiB
0.03.520.402 I load_tensors:        CUDA0 model buffer size =  5986.67 MiB
0.03.520.405 I load_tensors:        CUDA1 model buffer size =  9869.24 MiB
0.03.520.407 I load_tensors:        CUDA2 model buffer size =  8973.24 MiB
0.03.520.408 I load_tensors:        CUDA3 model buffer size =  8973.24 MiB
0.03.520.410 I load_tensors:        CUDA4 model buffer size =  8973.24 MiB
0.03.520.411 I load_tensors:        CUDA5 model buffer size =  8973.24 MiB
0.03.520.412 I load_tensors:        CUDA6 model buffer size =  8973.24 MiB
0.03.520.413 I load_tensors:        CUDA7 model buffer size =  8973.24 MiB
0.03.520.415 I load_tensors: RPC[172.20.10.59:50389] model buffer size = 11216.55 MiB
0.03.520.416 I load_tensors: RPC[172.20.10.59:50556] model buffer size =  8973.24 MiB
0.03.520.417 I load_tensors: RPC[172.20.10.59:50195] model buffer size =  8973.24 MiB
9.03.583.292 I llama_init_from_model: RPC[172.20.10.59:50195] compute buffer size =  2218.00 MiB
9.03.583.293 I llama_init_from_model: RPC[172.20.10.59:50162] compute buffer size =  2218.00 MiB
9.03.583.294 I llama_init_from_model: RPC[172.20.10.59:50750] compute buffer size =  2218.00 MiB
9.03.583.295 I llama_init_from_model: RPC[172.20.10.59:50883] compute buffer size =  2218.00 MiB
9.03.583.296 I llama_init_from_model: RPC[172.20.10.59:50244] compute buffer size =  2218.00 MiB
9.03.583.297 I llama_init_from_model:  CUDA_Host compute buffer size =    30.01 MiB
9.03.583.299 I llama_init_from_model: graph nodes  = 5025
9.03.583.299 I llama_init_from_model: graph splits = 16
9.03.583.303 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
9.03.584.053 I srv                load_model: prompt caching disabled
9.03.660.275 I srv                load_model: chat template, built_in: true, alias: deepseek3, tool call: unsupported, example:
You are a helpful assistant.

<|User|>Hello.<|Assistant|>Hi there.<|end▁of▁sentence|><|User|>What's the weather like in Paris today?<|Assistant|>
9.03.660.282 I srv                      main: initializing server
9.03.660.284 I srv                      init: initializing slots, n_slots = 4
9.03.660.477 I srv                      main: starting server
10.44.233.238 I srv        log_server_request: rid 87452521950 | POST /v1/chat/completions 172.20.10.59:54880
10.44.233.420 I srv oaicompat_completions_req: rid 87452521950 | {"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","stream":true,"temperature":0.6}
10.54.137.487 I srv        log_server_request: rid 87462426199 | POST /v1/chat/completions 172.20.10.59:54738
10.54.137.572 I srv oaicompat_completions_req: rid 87462426199 | {"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","stream":true,"temperature":0.6}
^@^@^@^@15.54.707.063 I srv oaicompat_completions_res: rid 87452521950 | prompt_tokens: 44, completion_tokens: 1572, draft_tokens: 0, ttft: 2685.93ms, tpot: 195.79ms, tps: 5.11, dta: 0.00%
15.54.707.156 I srv       log_server_response: rid 87452521950 | POST /v1/chat/completions 172.20.10.59:54880 | status 200 | cost 310.47s
^@^@^@18.11.247.158 I srv oaicompat_completions_res: rid 87462426199 | prompt_tokens: 44, completion_tokens: 2445, draft_tokens: 0, ttft: 1224.72ms, tpot: 178.24ms, tps: 5.61, dta: 0.00%
18.11.247.230 I srv       log_server_response: rid 87462426199 | POST /v1/chat/completions 172.20.10.59:54738 | status 200 | cost 437.11s
^@19.10.193.809 I srv        log_server_request: rid 87958482520 | POST /v1/chat/completions 172.25.17.201:38750
19.10.193.911 I srv oaicompat_completions_req: rid 87958482520 | {"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S"}
19.12.254.839 I srv oaicompat_completions_res: rid 87958482520 | prompt_tokens: 4, completion_tokens: 16, draft_tokens: 0, ttft: 523.08ms, tpot: 95.98ms, tps: 10.42, dta: 0.00%
19.12.254.963 I srv       log_server_response: rid 87958482520 | POST /v1/chat/completions 172.25.17.201:38750 | status 200 | cost 2.06s
19.14.902.932 I srv        log_server_request: rid 87963191644 | POST /v1/chat/completions 172.25.17.201:38766
19.14.903.081 I srv oaicompat_completions_req: rid 87963191644 | {"max_tokens":2048,"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","seed":42,"stream":true,"stream_options":{"include_usage":true},"temperature":0.6}
19.14.903.203 I srv        log_server_request: rid 87963191916 | POST /v1/chat/completions 172.25.17.201:38780
19.14.903.294 I srv oaicompat_completions_req: rid 87963191916 | {"max_tokens":2048,"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","seed":42,"stream":true,"stream_options":{"include_usage":true},"temperature":0.6}
^@20.20.154.688 I srv oaicompat_completions_res: rid 87963191916 | prompt_tokens: 26, completion_tokens: 354, draft_tokens: 0, ttft: 1097.28ms, tpot: 181.22ms, tps: 5.52, dta: 0.00%
20.20.154.760 I srv       log_server_response: rid 87963191916 | POST /v1/chat/completions 172.25.17.201:38780 | status 200 | cost 65.25s
20.20.200.360 I srv        log_server_request: rid 88028489072 | POST /v1/chat/completions 172.25.17.201:51324
20.20.200.429 I srv oaicompat_completions_req: rid 88028489072 | {"max_tokens":2048,"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","seed":42,"stream":true,"stream_options":{"include_usage":true},"temperature":0.6}
/home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:682: GGML_ASSERT(status) failed
No symbol table is loaded.  Use the "file" command.
[New LWP 563152]
[New LWP 563153]
...
[New LWP 566060]
[New LWP 566061]
[New LWP 566062]
[New LWP 566063]
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/06/c063009eff89df487eea6c9c93acbfbd36d28d.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fefdff306a2 in waitpid () from /lib64/libpthread.so.0
No symbol "frame" in current context.
[Inferior 1 (process 563151) detached]
Aborted (core dumped)
(base) [root@172-20-10-58 ~]#
@thxCode
Copy link
Collaborator

thxCode commented Feb 10, 2025

link gpustack/gpustack#1137

@thxCode
Copy link
Collaborator

thxCode commented Feb 11, 2025

please try with --no-cache-prompt.

@MatheMatrix
Copy link
Author

MatheMatrix commented Feb 11, 2025

please try with --no-cache-prompt.

😢 I have tried but still crashs, my arguments:

0.02.559.011 I arguments : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244 --no-context-shift --no-cache-prompt -n 6144 --metrics

@thxCode
Copy link
Collaborator

thxCode commented Feb 11, 2025

please try with --no-cache-prompt.

😢 I have tried but still crashs, my arguments:

0.02.559.011 I arguments : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244 --no-context-shift --no-cache-prompt -n 6144 --metrics

can you try with the latest version? we should track it in the main branch as the frequency changes in AI development.

@MatheMatrix
Copy link
Author

seems better than pervious version 👍 , but now the local llambox would crash 😢

Core was generated by `/root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/th'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fafb87a7000 (LWP 479365))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-251.el8_10.2.x86_64 libgcc-8.5.0-22.el8_10.x86_64
(gdb) bt
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
#1  0x00007faf91304e65 in abort () from /lib64/libc.so.6
#2  0x0000000000dc6f53 in ggml_abort ()
#3  0x0000000000dc50f4 in ggml_backend_rpc_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000dddac4 in ggml_backend_sched_compute_splits(ggml_backend_sched*) ()
#5  0x00000000006176a0 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) ()
#6  0x000000000061c0eb in llama_kv_cache_update_impl(llama_context&) ()
#7  0x000000000061d0a8 in llama_decode_impl(llama_context&, llama_batch) ()
#8  0x000000000061df17 in llama_decode ()
#9  0x0000000000547f76 in server_context::update_slots() ()
#10 0x000000000050356d in server_task_queue::start_loop() ()
#11 0x0000000000439d6c in main ()

command and version:

0.02.600.960 I arguments  : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-bo
x --host 0.0.0.0 --gpu-layers 62 --parallel 4 --ctx-size 11264 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.g
guf --alias DeepSeek-R1-UD-IQ1_S-1.58 --rpc 172.20.10.59:50903,172.20.10.59:51017,172.20.10.59:50003,172.20.10.59:50628,172.20.10.59:50204,
172.20.10.59:50817,172.20.10.59:50098 --no-context-shift -n 4096 --metrics --warmup --no-cache-prompt
0.02.600.962 I version    : v0.0.115 (268f80a)
0.02.600.963 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.600.963 I target     : x86_64-redhat-linux
0.02.600.964 I vendor     : llama.cpp 27d135c9 (4598), stable-diffusion.cpp 102953d (203)                                                  0.02.600.992 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |               0.02.600.993 I

@thxCode
Copy link
Collaborator

thxCode commented Feb 14, 2025

seems better than pervious version 👍 , but now the local llambox would crash 😢

Core was generated by `/root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/th'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fafb87a7000 (LWP 479365))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-251.el8_10.2.x86_64 libgcc-8.5.0-22.el8_10.x86_64
(gdb) bt
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
#1  0x00007faf91304e65 in abort () from /lib64/libc.so.6
#2  0x0000000000dc6f53 in ggml_abort ()
#3  0x0000000000dc50f4 in ggml_backend_rpc_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000dddac4 in ggml_backend_sched_compute_splits(ggml_backend_sched*) ()
#5  0x00000000006176a0 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) ()
#6  0x000000000061c0eb in llama_kv_cache_update_impl(llama_context&) ()
#7  0x000000000061d0a8 in llama_decode_impl(llama_context&, llama_batch) ()
#8  0x000000000061df17 in llama_decode ()
#9  0x0000000000547f76 in server_context::update_slots() ()
#10 0x000000000050356d in server_task_queue::start_loop() ()
#11 0x0000000000439d6c in main ()

command and version:

0.02.600.960 I arguments  : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-bo
x --host 0.0.0.0 --gpu-layers 62 --parallel 4 --ctx-size 11264 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.g
guf --alias DeepSeek-R1-UD-IQ1_S-1.58 --rpc 172.20.10.59:50903,172.20.10.59:51017,172.20.10.59:50003,172.20.10.59:50628,172.20.10.59:50204,
172.20.10.59:50817,172.20.10.59:50098 --no-context-shift -n 4096 --metrics --warmup --no-cache-prompt
0.02.600.962 I version    : v0.0.115 (268f80a)
0.02.600.963 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.600.963 I target     : x86_64-redhat-linux
0.02.600.964 I vendor     : llama.cpp 27d135c9 (4598), stable-diffusion.cpp 102953d (203)                                                  0.02.600.992 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |               0.02.600.993 I

can you offload the full log and the reproducing steps here? we have closed a similar issue gpustack/gpustack#1137, can you verify again with gpustack v0.5.1? remember, as v0.0.117 introduces a new RPC command, you should upgrade all your agents to v0.5.1 too.

@thxCode thxCode self-assigned this Feb 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants