Crash in rpc mode #34

MatheMatrix · 2025-02-09T14:00:08Z

Distro: Rocky LInux 8.4
GPU: 8x NVIDIA RTX 2080 Driver Version: 555.42.02 CUDA Version: 12.5

nvidia-smi output

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:08:00.0 Off |                  N/A |
| 30%   35C    P8             23W /  250W |     157MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:0C:00.0 Off |                  N/A |
| 30%   31C    P8             17W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:10:00.0 Off |                  N/A |
| 30%   32C    P8              7W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:14:00.0 Off |                  N/A |
| 30%   31C    P8              7W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:18:00.0 Off |                  N/A |
| 30%   25C    P8              6W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:1C:00.0 Off |                  N/A |
| 30%   28C    P8             11W /  250W |     203MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:20:00.0 Off |                  N/A |
| 30%   28C    P8             22W /  250W |     201MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA GeForce RTX 2080 Ti     Off |   00000000:24:00.0 Off |                  N/A |
| 30%   27C    P8              6W /  250W |     157MiB /  22528MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A    469679      C   ...third_party/bin/llama-box/llama-box        154MiB |
|    1   N/A  N/A    449599      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    2   N/A  N/A    449596      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    3   N/A  N/A    449600      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    4   N/A  N/A    449592      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    5   N/A  N/A    449603      C   ...third_party/bin/llama-box/llama-box        200MiB |
|    6   N/A  N/A    449595      C   ...third_party/bin/llama-box/llama-box        198MiB |
|    7   N/A  N/A    449612      C   ...third_party/bin/llama-box/llama-box        154MiB |
+-----------------------------------------------------------------------------------------+

Version:

0.02.559.012 I version    : v0.0.108 (7d23755)
0.02.559.013 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.559.013 I target     : x86_64-redhat-linux
0.02.559.014 I vendor     : llama.cpp 80d0d6b4 (4519), stable-diffusion.cpp 102953d (203)

Issue: I use rpc to run DeepSeek-R1-UD-IQ1_S-1.58.gguf with two nodes(each has 8 x RTX 2080), when I use evalscope to benchmark service performance with llama-box --parallel 2 it always crashes. If I use llama-box --parallel 1, it's fine...

this is messages and coredump:

Feb 09 21:37:54 172-20-10-59 kernel: llama-box[463192]: segfault at 10 ip 0000000000d8ca90 sp 00007fcd212de538 error 4 in llama-box[407000+ab9000]
Feb 09 21:37:54 172-20-10-59 kernel: Code: 01 00 00 00 48 39 47 30 77 09 48 8b 57 40 48 39 d0 76 07 44 89 c0 c3 0f 1f 00 48 3b 57 48 41 0f 97 c0 44 89 c0 c3 0f 1f 40 >

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/th'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x0000000000d8ca90 in ggml_is_empty ()
[Current thread is 1 (Thread 0x7fcd21300000 (LWP 463192))]
(gdb) bt
#0  0x0000000000d8ca90 in ggml_is_empty ()
#1  0x0000000000ba34bc in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*) ()
#2  0x0000000000da1c5d in ggml_backend_graph_compute ()
#3  0x0000000000499a24 in rpcserver::graph_compute(std::vector<unsigned char, std::allocator<unsigned char> > const&, std::vector<unsigned char, std::allocator<unsigned char> >&) ()
#4  0x000000000049b08f in std::thread::_State_impl<std::thread::_Invoker<std::tuple<rpcserver_start(rpcserver_params&)::{lambda()#1}> > >::_M_run() ()
#5  0x0000000000e65d90 in execute_native_thread_routine ()
#6  0x00007fcd2a35a1ca in start_thread (arg=<optimized out>) at pthread_create.c:479
#7  0x00007fcd29fb58d3 in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

llama-box main process log:

(base) [root@172-20-10-58 ~]# /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244  --no-context-shift --no-cache-prompt -n 6144 --metrics
0.00.330.256 I ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
0.00.330.264 I ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
0.00.330.265 I ggml_cuda_init: found 8 CUDA devices:
0.00.337.877 I   Device 0: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.341.108 I   Device 1: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.344.309 I   Device 2: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.348.131 I   Device 3: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.351.370 I   Device 4: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.354.593 I   Device 5: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.357.854 I   Device 6: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.00.361.119 I   Device 7: NVIDIA GeForce RTX 2080 Ti, compute capability 7.5, VMM: yes
0.02.558.960 I
0.02.559.011 I arguments  : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244 --no-context-shift --no-cache-prompt -n 6144 --metrics
0.02.559.012 I version    : v0.0.108 (7d23755)
0.02.559.013 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.559.013 I target     : x86_64-redhat-linux
0.02.559.014 I vendor     : llama.cpp 80d0d6b4 (4519), stable-diffusion.cpp 102953d (203)
0.02.559.029 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |
0.02.559.029 I
0.02.559.139 I srv                      main: listening, hostname = 0.0.0.0, port = 40324, n_threads = 4 + 2
0.02.560.295 I srv                      main: loading model
0.02.560.301 I srv                load_model: loading model '/root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf'
0.02.771.442 I llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.469 I llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.487 I llama_model_load_from_file_impl: using device CUDA2 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.503 I llama_model_load_from_file_impl: using device CUDA3 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.520 I llama_model_load_from_file_impl: using device CUDA4 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.541 I llama_model_load_from_file_impl: using device CUDA5 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.557 I llama_model_load_from_file_impl: using device CUDA6 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.771.576 I llama_model_load_from_file_impl: using device CUDA7 (NVIDIA GeForce RTX 2080 Ti) - 21695 MiB free
0.02.772.724 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50389] (RPC[172.20.10.59:50389]) - 21805 MiB free
0.02.773.414 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50556] (RPC[172.20.10.59:50556]) - 21805 MiB free
0.02.773.934 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50195] (RPC[172.20.10.59:50195]) - 21805 MiB free
0.02.774.502 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50162] (RPC[172.20.10.59:50162]) - 21805 MiB free
0.02.775.174 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50750] (RPC[172.20.10.59:50750]) - 21805 MiB free
0.02.775.771 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50883] (RPC[172.20.10.59:50883]) - 21805 MiB free
0.02.776.374 I llama_model_load_from_file_impl: using device RPC[172.20.10.59:50244] (RPC[172.20.10.59:50244]) - 21807 MiB free
0.02.862.160 I llama_model_loader: loaded meta data with 52 key-value pairs and 1025 tensors from /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf (version GGUF V3 (latest))
0.02.862.203 I llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
0.02.862.216 I llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
0.02.862.217 I llama_model_loader: - kv   1:                               general.type str              = model
0.02.862.219 I llama_model_loader: - kv   2:                               general.name str              = DeepSeek R1 BF16
0.02.862.220 I llama_model_loader: - kv   3:                       general.quantized_by str              = Unsloth
0.02.862.221 I llama_model_loader: - kv   4:                         general.size_label str              = 256x20B
0.02.862.222 I llama_model_loader: - kv   5:                           general.repo_url str              = https://huggingface.co/unsloth
0.02.862.226 I llama_model_loader: - kv   6:                      deepseek2.block_count u32              = 61
0.02.862.227 I llama_model_loader: - kv   7:                   deepseek2.context_length u32              = 163840
0.02.862.228 I llama_model_loader: - kv   8:                 deepseek2.embedding_length u32              = 7168
0.02.862.229 I llama_model_loader: - kv   9:              deepseek2.feed_forward_length u32              = 18432
0.02.862.230 I llama_model_loader: - kv  10:             deepseek2.attention.head_count u32              = 128
0.02.862.231 I llama_model_loader: - kv  11:          deepseek2.attention.head_count_kv u32              = 128
0.02.862.235 I llama_model_loader: - kv  12:                   deepseek2.rope.freq_base f32              = 10000.000000
0.02.862.237 I llama_model_loader: - kv  13: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
0.02.862.238 I llama_model_loader: - kv  14:                deepseek2.expert_used_count u32              = 8
0.02.862.239 I llama_model_loader: - kv  15:        deepseek2.leading_dense_block_count u32              = 3
0.02.862.240 I llama_model_loader: - kv  16:                       deepseek2.vocab_size u32              = 129280
0.02.862.241 I llama_model_loader: - kv  17:            deepseek2.attention.q_lora_rank u32              = 1536
0.02.862.242 I llama_model_loader: - kv  18:           deepseek2.attention.kv_lora_rank u32              = 512
0.02.862.243 I llama_model_loader: - kv  19:             deepseek2.attention.key_length u32              = 192
0.02.862.253 I llama_model_loader: - kv  20:           deepseek2.attention.value_length u32              = 128
0.02.862.254 I llama_model_loader: - kv  21:       deepseek2.expert_feed_forward_length u32              = 2048
0.02.862.255 I llama_model_loader: - kv  22:                     deepseek2.expert_count u32              = 256
0.02.862.256 I llama_model_loader: - kv  23:              deepseek2.expert_shared_count u32              = 1
0.02.862.257 I llama_model_loader: - kv  24:             deepseek2.expert_weights_scale f32              = 2.500000
0.02.862.258 I llama_model_loader: - kv  25:              deepseek2.expert_weights_norm bool             = true
0.02.862.259 I llama_model_loader: - kv  26:               deepseek2.expert_gating_func u32              = 2
0.02.862.260 I llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
0.02.862.261 I llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
0.02.862.262 I llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
0.02.862.277 I llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
0.02.862.279 I llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
0.02.862.287 I llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
0.02.862.289 I llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-v3
0.02.917.634 I llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,129280]  = ["<｜begin▁of▁sentence>", "<...
0.02.937.899 I llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,129280]  = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
0.02.990.803 I llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,127741]  = ["Ġ t", "Ġ a", "i n", "Ġ Ġ", "h e...
0.02.990.814 I llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 0
0.02.990.815 I llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 1
0.02.990.817 I llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 128815
0.02.990.818 I llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
0.02.990.820 I llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
0.02.990.823 I llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
0.02.990.833 I llama_model_loader: - kv  43:               general.quantization_version u32              = 2
0.02.990.835 I llama_model_loader: - kv  44:                          general.file_type u32              = 24
0.02.990.837 I llama_model_loader: - kv  45:                      quantize.imatrix.file str              = DeepSeek-R1.imatrix
0.02.990.839 I llama_model_loader: - kv  46:                   quantize.imatrix.dataset str              = /training_data/calibration_datav3.txt
0.02.990.840 I llama_model_loader: - kv  47:             quantize.imatrix.entries_count i32              = 720
0.02.990.841 I llama_model_loader: - kv  48:              quantize.imatrix.chunks_count i32              = 124
0.02.990.843 I llama_model_loader: - kv  49:                                   split.no u16              = 0
0.02.990.844 I llama_model_loader: - kv  50:                        split.tensors.count i32              = 1025
0.02.990.845 I llama_model_loader: - kv  51:                                split.count u16              = 0
0.02.990.845 I llama_model_loader: - type  f32:  361 tensors
0.02.990.846 I llama_model_loader: - type q4_K:  190 tensors
0.02.990.847 I llama_model_loader: - type q5_K:  116 tensors
0.02.990.847 I llama_model_loader: - type q6_K:  184 tensors
0.02.990.848 I llama_model_loader: - type iq2_xxs:    6 tensors
0.02.990.849 I llama_model_loader: - type iq1_s:  168 tensors
0.02.990.851 I print_info: file format = GGUF V3 (latest)
0.02.990.852 I print_info: file type   = IQ1_S - 1.5625 bpw
0.02.990.856 I print_info: file size   = 130.60 GiB (1.67 BPW)
0.03.199.480 W load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
0.03.199.637 I load: special tokens cache size = 819
0.03.441.353 I load: token to piece cache size = 0.8223 MB
0.03.441.374 I print_info: arch             = deepseek2
0.03.441.376 I print_info: vocab_only       = 0
0.03.441.377 I print_info: n_ctx_train      = 163840
0.03.441.380 I print_info: n_embd           = 7168
0.03.441.381 I print_info: n_layer          = 61
0.03.441.393 I print_info: n_head           = 128
0.03.441.395 I print_info: n_head_kv        = 128
0.03.441.396 I print_info: n_rot            = 64
0.03.441.397 I print_info: n_swa            = 0
0.03.441.397 I print_info: n_embd_head_k    = 192
0.03.441.398 I print_info: n_embd_head_v    = 128
0.03.441.400 I print_info: n_gqa            = 1
0.03.441.402 I print_info: n_embd_k_gqa     = 24576
0.03.441.403 I print_info: n_embd_v_gqa     = 16384
0.03.441.405 I print_info: f_norm_eps       = 0.0e+00
0.03.441.406 I print_info: f_norm_rms_eps   = 1.0e-06
0.03.441.407 I print_info: f_clamp_kqv      = 0.0e+00
0.03.441.407 I print_info: f_max_alibi_bias = 0.0e+00
0.03.441.408 I print_info: f_logit_scale    = 0.0e+00
0.03.441.410 I print_info: n_ff             = 18432
0.03.441.411 I print_info: n_expert         = 256
0.03.441.412 I print_info: n_expert_used    = 8
0.03.441.412 I print_info: causal attn      = 1
0.03.441.414 I print_info: pooling type     = 0
0.03.441.415 I print_info: rope type        = 0
0.03.441.416 I print_info: rope scaling     = yarn
0.03.441.418 I print_info: freq_base_train  = 10000.0
0.03.441.419 I print_info: freq_scale_train = 0.025
0.03.441.420 I print_info: n_ctx_orig_yarn  = 4096
0.03.441.422 I print_info: rope_finetuned   = unknown
0.03.441.422 I print_info: ssm_d_conv       = 0
0.03.441.422 I print_info: ssm_d_inner      = 0
0.03.441.423 I print_info: ssm_d_state      = 0
0.03.441.423 I print_info: ssm_dt_rank      = 0
0.03.441.424 I print_info: ssm_dt_b_c_rms   = 0
0.03.441.424 I print_info: model type       = 671B
0.03.441.426 I print_info: model params     = 671.03 B
0.03.441.427 I print_info: general.name     = DeepSeek R1 BF16
0.03.441.427 I print_info: n_layer_dense_lead   = 3
0.03.441.428 I print_info: n_lora_q             = 1536
0.03.441.429 I print_info: n_lora_kv            = 512
0.03.441.429 I print_info: n_ff_exp             = 2048
0.03.441.430 I print_info: n_expert_shared      = 1
0.03.441.430 I print_info: expert_weights_scale = 2.5
0.03.441.431 I print_info: expert_weights_norm  = 1
0.03.441.431 I print_info: expert_gating_func   = sigmoid
0.03.441.432 I print_info: rope_yarn_log_mul    = 0.1000
0.03.441.434 I print_info: vocab type       = BPE
0.03.441.434 I print_info: n_vocab          = 129280
0.03.441.435 I print_info: n_merges         = 127741
0.03.441.436 I print_info: BOS token        = 0 '<｜begin▁of▁sentence>'
0.03.441.437 I print_info: EOS token        = 1 '<｜end▁of▁sentence>'
0.03.441.437 I print_info: EOT token        = 1 '<｜end▁of▁sentence>'
0.03.441.438 I print_info: PAD token        = 128815 '<｜PAD▁TOKEN >'
0.03.441.439 I print_info: LF token         = 131 'Ä'
0.03.441.440 I print_info: FIM PRE token    = 128801 '<｜fim▁begin >'
0.03.441.440 I print_info: FIM SUF token    = 128800 '<｜fim▁hole >'
0.03.441.441 I print_info: FIM MID token    = 128802 '<｜fim▁end >'
0.03.441.442 I print_info: EOG token        = 1 '<｜end▁of▁sentence｜>'
0.03.441.442 I print_info: max token length = 256
0.03.520.383 I load_tensors: offloading 61 repeating layers to GPU
0.03.520.393 I load_tensors: offloading output layer to GPU
0.03.520.393 I load_tensors: offloaded 62/62 layers to GPU
0.03.520.399 I load_tensors:          CPU model buffer size =   497.11 MiB
0.03.520.402 I load_tensors:        CUDA0 model buffer size =  5986.67 MiB
0.03.520.405 I load_tensors:        CUDA1 model buffer size =  9869.24 MiB
0.03.520.407 I load_tensors:        CUDA2 model buffer size =  8973.24 MiB
0.03.520.408 I load_tensors:        CUDA3 model buffer size =  8973.24 MiB
0.03.520.410 I load_tensors:        CUDA4 model buffer size =  8973.24 MiB
0.03.520.411 I load_tensors:        CUDA5 model buffer size =  8973.24 MiB
0.03.520.412 I load_tensors:        CUDA6 model buffer size =  8973.24 MiB
0.03.520.413 I load_tensors:        CUDA7 model buffer size =  8973.24 MiB
0.03.520.415 I load_tensors: RPC[172.20.10.59:50389] model buffer size = 11216.55 MiB
0.03.520.416 I load_tensors: RPC[172.20.10.59:50556] model buffer size =  8973.24 MiB
0.03.520.417 I load_tensors: RPC[172.20.10.59:50195] model buffer size =  8973.24 MiB
9.03.583.292 I llama_init_from_model: RPC[172.20.10.59:50195] compute buffer size =  2218.00 MiB
9.03.583.293 I llama_init_from_model: RPC[172.20.10.59:50162] compute buffer size =  2218.00 MiB
9.03.583.294 I llama_init_from_model: RPC[172.20.10.59:50750] compute buffer size =  2218.00 MiB
9.03.583.295 I llama_init_from_model: RPC[172.20.10.59:50883] compute buffer size =  2218.00 MiB
9.03.583.296 I llama_init_from_model: RPC[172.20.10.59:50244] compute buffer size =  2218.00 MiB
9.03.583.297 I llama_init_from_model:  CUDA_Host compute buffer size =    30.01 MiB
9.03.583.299 I llama_init_from_model: graph nodes  = 5025
9.03.583.299 I llama_init_from_model: graph splits = 16
9.03.583.303 I common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
9.03.584.053 I srv                load_model: prompt caching disabled
9.03.660.275 I srv                load_model: chat template, built_in: true, alias: deepseek3, tool call: unsupported, example:
You are a helpful assistant.

<｜User｜>Hello.<｜Assistant｜>Hi there.<｜end▁of▁sentence｜><｜User｜>What's the weather like in Paris today?<｜Assistant｜>
9.03.660.282 I srv                      main: initializing server
9.03.660.284 I srv                      init: initializing slots, n_slots = 4
9.03.660.477 I srv                      main: starting server
10.44.233.238 I srv        log_server_request: rid 87452521950 | POST /v1/chat/completions 172.20.10.59:54880
10.44.233.420 I srv oaicompat_completions_req: rid 87452521950 | {"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","stream":true,"temperature":0.6}
10.54.137.487 I srv        log_server_request: rid 87462426199 | POST /v1/chat/completions 172.20.10.59:54738
10.54.137.572 I srv oaicompat_completions_req: rid 87462426199 | {"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","stream":true,"temperature":0.6}
^@^@^@^@15.54.707.063 I srv oaicompat_completions_res: rid 87452521950 | prompt_tokens: 44, completion_tokens: 1572, draft_tokens: 0, ttft: 2685.93ms, tpot: 195.79ms, tps: 5.11, dta: 0.00%
15.54.707.156 I srv       log_server_response: rid 87452521950 | POST /v1/chat/completions 172.20.10.59:54880 | status 200 | cost 310.47s
^@^@^@18.11.247.158 I srv oaicompat_completions_res: rid 87462426199 | prompt_tokens: 44, completion_tokens: 2445, draft_tokens: 0, ttft: 1224.72ms, tpot: 178.24ms, tps: 5.61, dta: 0.00%
18.11.247.230 I srv       log_server_response: rid 87462426199 | POST /v1/chat/completions 172.20.10.59:54738 | status 200 | cost 437.11s
^@19.10.193.809 I srv        log_server_request: rid 87958482520 | POST /v1/chat/completions 172.25.17.201:38750
19.10.193.911 I srv oaicompat_completions_req: rid 87958482520 | {"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S"}
19.12.254.839 I srv oaicompat_completions_res: rid 87958482520 | prompt_tokens: 4, completion_tokens: 16, draft_tokens: 0, ttft: 523.08ms, tpot: 95.98ms, tps: 10.42, dta: 0.00%
19.12.254.963 I srv       log_server_response: rid 87958482520 | POST /v1/chat/completions 172.25.17.201:38750 | status 200 | cost 2.06s
19.14.902.932 I srv        log_server_request: rid 87963191644 | POST /v1/chat/completions 172.25.17.201:38766
19.14.903.081 I srv oaicompat_completions_req: rid 87963191644 | {"max_tokens":2048,"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","seed":42,"stream":true,"stream_options":{"include_usage":true},"temperature":0.6}
19.14.903.203 I srv        log_server_request: rid 87963191916 | POST /v1/chat/completions 172.25.17.201:38780
19.14.903.294 I srv oaicompat_completions_req: rid 87963191916 | {"max_tokens":2048,"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","seed":42,"stream":true,"stream_options":{"include_usage":true},"temperature":0.6}
^@20.20.154.688 I srv oaicompat_completions_res: rid 87963191916 | prompt_tokens: 26, completion_tokens: 354, draft_tokens: 0, ttft: 1097.28ms, tpot: 181.22ms, tps: 5.52, dta: 0.00%
20.20.154.760 I srv       log_server_response: rid 87963191916 | POST /v1/chat/completions 172.25.17.201:38780 | status 200 | cost 65.25s
20.20.200.360 I srv        log_server_request: rid 88028489072 | POST /v1/chat/completions 172.25.17.201:51324
20.20.200.429 I srv oaicompat_completions_req: rid 88028489072 | {"max_tokens":2048,"messages":"[...]","model":"DeepSeek-R1-UD-IQ1_S","seed":42,"stream":true,"stream_options":{"include_usage":true},"temperature":0.6}
/home/runner/work/llama-box/llama-box/llama.cpp/ggml/src/ggml-rpc/ggml-rpc.cpp:682: GGML_ASSERT(status) failed
No symbol table is loaded.  Use the "file" command.
[New LWP 563152]
[New LWP 563153]
...
[New LWP 566060]
[New LWP 566061]
[New LWP 566062]
[New LWP 566063]
Missing separate debuginfo for /lib64/libcuda.so.1
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/06/c063009eff89df487eea6c9c93acbfbd36d28d.debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
0x00007fefdff306a2 in waitpid () from /lib64/libpthread.so.0
No symbol "frame" in current context.
[Inferior 1 (process 563151) detached]
Aborted (core dumped)
(base) [root@172-20-10-58 ~]#

The text was updated successfully, but these errors were encountered:

thxCode · 2025-02-10T04:00:01Z

link gpustack/gpustack#1137

thxCode · 2025-02-11T08:46:48Z

please try with --no-cache-prompt.

MatheMatrix · 2025-02-11T10:01:53Z

please try with --no-cache-prompt.

😢 I have tried but still crashs, my arguments:

0.02.559.011 I arguments : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244 --no-context-shift --no-cache-prompt -n 6144 --metrics

thxCode · 2025-02-11T13:11:23Z

please try with --no-cache-prompt.

😢 I have tried but still crashs, my arguments:

0.02.559.011 I arguments : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-box --host 0.0.0.0 --gpu-layers 62 --parallel 2 --ctx-size 12288 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.gguf --alias DeepSeek-R1-UD-IQ1_S-1.58 --no-mmap --no-warmup --rpc 172.20.10.59:50389,172.20.10.59:50556,172.20.10.59:50195,172.20.10.59:50162,172.20.10.59:50750,172.20.10.59:50883,172.20.10.59:50244 --no-context-shift --no-cache-prompt -n 6144 --metrics

can you try with the latest version? we should track it in the main branch as the frequency changes in AI development.

MatheMatrix · 2025-02-14T02:50:33Z

seems better than pervious version 👍 , but now the local llambox would crash 😢

Core was generated by `/root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/th'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fafb87a7000 (LWP 479365))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-251.el8_10.2.x86_64 libgcc-8.5.0-22.el8_10.x86_64
(gdb) bt
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
#1  0x00007faf91304e65 in abort () from /lib64/libc.so.6
#2  0x0000000000dc6f53 in ggml_abort ()
#3  0x0000000000dc50f4 in ggml_backend_rpc_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000dddac4 in ggml_backend_sched_compute_splits(ggml_backend_sched*) ()
#5  0x00000000006176a0 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) ()
#6  0x000000000061c0eb in llama_kv_cache_update_impl(llama_context&) ()
#7  0x000000000061d0a8 in llama_decode_impl(llama_context&, llama_batch) ()
#8  0x000000000061df17 in llama_decode ()
#9  0x0000000000547f76 in server_context::update_slots() ()
#10 0x000000000050356d in server_task_queue::start_loop() ()
#11 0x0000000000439d6c in main ()

command and version:

0.02.600.960 I arguments  : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-bo
x --host 0.0.0.0 --gpu-layers 62 --parallel 4 --ctx-size 11264 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.g
guf --alias DeepSeek-R1-UD-IQ1_S-1.58 --rpc 172.20.10.59:50903,172.20.10.59:51017,172.20.10.59:50003,172.20.10.59:50628,172.20.10.59:50204,
172.20.10.59:50817,172.20.10.59:50098 --no-context-shift -n 4096 --metrics --warmup --no-cache-prompt
0.02.600.962 I version    : v0.0.115 (268f80a)
0.02.600.963 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.600.963 I target     : x86_64-redhat-linux
0.02.600.964 I vendor     : llama.cpp 27d135c9 (4598), stable-diffusion.cpp 102953d (203)                                                  0.02.600.992 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |               0.02.600.993 I

thxCode · 2025-02-14T14:50:00Z

seems better than pervious version 👍 , but now the local llambox would crash 😢

Core was generated by `/root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/th'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
[Current thread is 1 (Thread 0x7fafb87a7000 (LWP 479365))]
Missing separate debuginfos, use: yum debuginfo-install glibc-2.28-251.el8_10.2.x86_64 libgcc-8.5.0-22.el8_10.x86_64
(gdb) bt
#0  0x00007faf9133152f in raise () from /lib64/libc.so.6
#1  0x00007faf91304e65 in abort () from /lib64/libc.so.6
#2  0x0000000000dc6f53 in ggml_abort ()
#3  0x0000000000dc50f4 in ggml_backend_rpc_graph_compute(ggml_backend*, ggml_cgraph*) ()
#4  0x0000000000dddac4 in ggml_backend_sched_compute_splits(ggml_backend_sched*) ()
#5  0x00000000006176a0 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) ()
#6  0x000000000061c0eb in llama_kv_cache_update_impl(llama_context&) ()
#7  0x000000000061d0a8 in llama_decode_impl(llama_context&, llama_batch) ()
#8  0x000000000061df17 in llama_decode ()
#9  0x0000000000547f76 in server_context::update_slots() ()
#10 0x000000000050356d in server_task_queue::start_loop() ()
#11 0x0000000000439d6c in main ()

command and version:

0.02.600.960 I arguments  : /root/.local/share/pipx/venvs/gpustack/lib/python3.11/site-packages/gpustack/third_party/bin/llama-box/llama-bo
x --host 0.0.0.0 --gpu-layers 62 --parallel 4 --ctx-size 11264 --port 40324 --model /root/DeepSeek-R1-UD-IQ1_S_1.53b/DeepSeek-R1-UD-IQ1_S.g
guf --alias DeepSeek-R1-UD-IQ1_S-1.58 --rpc 172.20.10.59:50903,172.20.10.59:51017,172.20.10.59:50003,172.20.10.59:50628,172.20.10.59:50204,
172.20.10.59:50817,172.20.10.59:50098 --no-context-shift -n 4096 --metrics --warmup --no-cache-prompt
0.02.600.962 I version    : v0.0.115 (268f80a)
0.02.600.963 I compiler   : cc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
0.02.600.963 I target     : x86_64-redhat-linux
0.02.600.964 I vendor     : llama.cpp 27d135c9 (4598), stable-diffusion.cpp 102953d (203)                                                  0.02.600.992 I system_info: n_threads = 16 (n_threads_batch = 16) / 16 | CUDA : ARCHS = 600,610,700,750,800,860,890,900 | F16 = 1 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 |               0.02.600.993 I

can you offload the full log and the reproducing steps here? we have closed a similar issue gpustack/gpustack#1137, can you verify again with gpustack v0.5.1? remember, as v0.0.117 introduces a new RPC command, you should upgrade all your agents to v0.5.1 too.

thxCode self-assigned this Feb 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crash in rpc mode #34

Crash in rpc mode #34

MatheMatrix commented Feb 9, 2025 •

edited

Loading

thxCode commented Feb 10, 2025

thxCode commented Feb 11, 2025

MatheMatrix commented Feb 11, 2025 •

edited

Loading

thxCode commented Feb 11, 2025

MatheMatrix commented Feb 14, 2025

thxCode commented Feb 14, 2025

Crash in rpc mode #34

Crash in rpc mode #34

Comments

MatheMatrix commented Feb 9, 2025 • edited Loading

thxCode commented Feb 10, 2025

thxCode commented Feb 11, 2025

MatheMatrix commented Feb 11, 2025 • edited Loading

thxCode commented Feb 11, 2025

MatheMatrix commented Feb 14, 2025

thxCode commented Feb 14, 2025

MatheMatrix commented Feb 9, 2025 •

edited

Loading

MatheMatrix commented Feb 11, 2025 •

edited

Loading