Llama-3_1-Nemotron-Ultra-253B-v1 support #12843

ymcki · 2025-04-09T08:04:03Z

Make sure to read the contributing guidelines before submitting a PR

Dear all,

I was the person who made the PR for Llama-3_1-Nemotron-51B support.
#10669

I noticed that there is a new Deci model called Llama-3_1-Nemotron-Ultra-253B-v1.

Based on my understanding, in addition to the original three types of layers, it
added a new type of layer that has no attention as well as no ffn which I call
it a dummy layer.

So I modified convert_hf_to_gguf.py and src/llama-model.* to support this dummy
layer.

I tested the code against the original 51B model and it seems to have no error
during conversion and inference.

However, I don't have the resource to test it on the 253B model. Is it possible
someone here can try this PR and see if it works for the 253B model? Thanks a lot in advance.

nicoboss · 2025-04-09T09:17:35Z

Thanks a lot for your contribution! I will try Llama-3_1-Nemotron-Ultra-253B-v1 and let you know shortly. I'm currently running convert_hf_to_gguf.py and everything is working great so far.

nicoboss · 2025-04-09T11:59:44Z

@ymcki Something seams to unfortunately be broken. The output seams to just be random tokens.

Prompt

What is the meaning of life?

Response

8!B"(1D<)<,4@3-A'3(<5,72A9.F62AC"%D08);E)6CDHCA0C.HC!%85>8DD(3!=&;48<"802=A,%0,6%D@0/'D<%(11@=&:.F0A)!91.#;2,&;)

Note

I canceled token generation after a while as it likely would have continued generating garbage until reaching the context size.

Steps to reproduce

rm -rf llama.cpp
git clone --recursive https://github.com/ymcki/llama.cpp.git
cd llama.cpp
python -m venv venv
venv/bin/pip install -r requirements.txt
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j
venv/bin/python convert_hf_to_gguf.py /mradermacher/tmp/quant/Llama-3_1-Nemotron-Ultra-253B-v1 --outfile /dpool/Llama-3_1-Nemotron-Ultra-253B-v1.gguf
cd build/bin
./llama-quantize /dpool/Llama-3_1-Nemotron-Ultra-253B-v1.gguf /bpool/Llama-3_1-Nemotron-Ultra-253B-v1.Q4_K_M.gguf Q4_K_M
./llama-cli -m /bpool/Llama-3_1-Nemotron-Ultra-253B-v1.Q4_K_M.gguf

Log

root@AI:/apool/llama.cpp/build/bin# ./llama-cli -m /bpool/Llama-3_1-Nemotron-Ultra-253B-v1.Q4_K_M.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
  Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
build: 5099 (80af2e33) with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4090) - 23628 MiB free
llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 4090) - 23663 MiB free
llama_model_loader: loaded meta data with 34 key-value pairs and 648 tensors from /bpool/Llama-3_1-Nemotron-Ultra-253B-v1.Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deci
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Llama_Nemotron_Ultra
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.finetune str              = 3_1-Nemotron-Ultra
llama_model_loader: - kv   5:                           general.basename str              = Llama
llama_model_loader: - kv   6:                         general.size_label str              = 253B
llama_model_loader: - kv   7:                            general.license str              = other
llama_model_loader: - kv   8:                       general.license.name str              = nvidia-open-model-license
llama_model_loader: - kv   9:                       general.license.link str              = https://www.nvidia.com/en-us/agreemen...
llama_model_loader: - kv  10:                               general.tags arr[str,4]       = ["nvidia", "llama-3", "pytorch", "tex...
llama_model_loader: - kv  11:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  12:                        deci.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  13:               deci.attention.head_count_kv arr[i32,162]     = [8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, ...
llama_model_loader: - kv  14:                  deci.attention.head_count arr[i32,162]     = [128, 128, 128, 128, 128, 128, 128, 1...
llama_model_loader: - kv  15:                   deci.feed_forward_length arr[i32,162]     = [5376, 10752, 16128, 16128, 16128, 16...
llama_model_loader: - kv  16:                           deci.block_count u32              = 162
llama_model_loader: - kv  17:                        deci.context_length u32              = 131072
llama_model_loader: - kv  18:                      deci.embedding_length u32              = 16384
llama_model_loader: - kv  19:      deci.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  20:                  deci.attention.key_length u32              = 128
llama_model_loader: - kv  21:                deci.attention.value_length u32              = 128
llama_model_loader: - kv  22:                            deci.vocab_size u32              = 128256
llama_model_loader: - kv  23:                  deci.rope.dimension_count u32              = 128
llama_model_loader: - kv  24:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  25:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  26:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  27:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  28:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  29:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  30:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  31:                    tokenizer.chat_template str              = {{- bos_token }}{%- if messages[0]['r...
llama_model_loader: - kv  32:               general.quantization_version u32              = 2
llama_model_loader: - kv  33:                          general.file_type u32              = 15
llama_model_loader: - type  f32:  147 tensors
llama_model_loader: - type q4_K:  428 tensors
llama_model_loader: - type q6_K:   73 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 140.56 GiB (4.76 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = deci
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 16384
print_info: n_layer          = 162
print_info: n_head           = [128, 128, 128, 128, 128, 128, 128, 128, 128, 0, 0, 0, 0, 128, 128, 128, 128, 128, 0, 0, 0, 0, 0, 0, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 128, 128, 128, 0, 128, 0, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 0, 0, 0, 128, 128, 128, 128, 0, 0, 128, 128, 128, 128, 0, 0, 128, 0, 0, 0, 0, 0, 0, 0, 0, 128, 0, 0, 0, 0, 0, 128, 128, 0, 128, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 128, 128, 0, 128, 128, 128, 128, 128, 128, 128, 128]
print_info: n_head_kv        = [8, 8, 8, 8, 8, 8, 8, 8, 8, 0, 0, 0, 0, 8, 8, 8, 8, 8, 0, 0, 0, 0, 0, 0, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 8, 8, 8, 0, 8, 0, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 0, 0, 0, 8, 8, 8, 8, 0, 0, 8, 8, 8, 8, 0, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 8, 0, 0, 0, 0, 0, 8, 8, 0, 8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 8, 8, 0, 8, 8, 8, 8, 8, 8, 8, 8]
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_swa_pattern    = 1
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = [16, 16, 16, 16, 16, 16, 16, 16, 16, 0, 0, 0, 0, 16, 16, 16, 16, 16, 0, 0, 0, 0, 0, 0, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 16, 16, 16, 0, 16, 0, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 0, 0, 0, 16, 16, 16, 16, 0, 0, 16, 16, 16, 16, 0, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 16, 0, 0, 0, 0, 0, 16, 16, 0, 16, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 16, 0, 16, 16, 16, 16, 16, 16, 16, 16]
print_info: n_embd_k_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 1024, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
print_info: n_embd_v_gqa     = [1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 1024, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 0, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 1024, 1024, 1024, 0, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1024, 1024, 0, 1024, 1024, 1024, 1024, 1024, 1024, 1024, 1024]
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = [5376, 10752, 16128, 16128, 16128, 16128, 16128, 16128, 21504, 0, 0, 0, 0, 21504, 21504, 21504, 53248, 53248, 0, 0, 0, 0, 0, 0, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 26624, 0, 0, 0, 21504, 21504, 21504, 21504, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 53248, 53248, 53248, 0, 0, 0, 0, 0, 53248, 37376, 37376, 37376, 0, 0, 32000, 26624, 26624, 26624, 26624, 26624, 26624, 0, 26624, 26624, 26624, 26624, 26624, 26624, 26624, 26624, 0, 0, 0, 0, 0, 32000, 53248, 53248, 53248, 0, 0, 0, 0, 0, 0, 0, 0, 399360, 0, 0, 0, 0, 0, 0, 0, 0, 425984, 0, 0, 0, 0, 0, 0, 0, 0, 343040, 0, 0, 0, 0, 0, 301056, 21504, 21504, 26624, 0, 26624, 26624, 37376, 53248, 53248, 53248, 53248, 26624]
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = linear
print_info: freq_base_train  = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 405B
print_info: model params     = 253.40 B
print_info: general.name     = Llama_Nemotron_Ultra
print_info: vocab type       = BPE
print_info: n_vocab          = 128256
print_info: n_merges         = 280147
print_info: BOS token        = 128000 '<|begin_of_text|>'
print_info: EOS token        = 128009 '<|eot_id|>'
print_info: EOT token        = 128009 '<|eot_id|>'
print_info: EOM token        = 128008 '<|eom_id|>'
print_info: LF token         = 198 'Ċ'
print_info: EOG token        = 128008 '<|eom_id|>'
print_info: EOG token        = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 0 repeating layers to GPU
load_tensors: offloaded 0/163 layers to GPU
load_tensors:   CPU_Mapped model buffer size = 143937.13 MiB
...................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     0.49 MiB
init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 162, can_shift = 1
init:        CPU KV buffer size =  1024.00 MiB
llama_context: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_context:      CUDA0 compute buffer size = 10132.00 MiB
llama_context:  CUDA_Host compute buffer size =    40.01 MiB
llama_context: graph nodes  = 2399
llama_context: graph splits = 777 (with bs=512), 1 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 32
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
<|start_header_id|>system<|end_header_id|>

You are a helpful assistant<|eot_id|><|start_header_id|>user<|end_header_id|>

Hello<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Hi there<|eot_id|><|start_header_id|>user<|end_header_id|>

How are you?<|eot_id|><|start_header_id|>assistant<|end_header_id|>



system_info: n_threads = 32 (n_threads_batch = 32) / 64 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: interactive mode on.
sampler seed: 1575017034
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        dry_multiplier = 0.000, dry_base = 1.750, dry_allowed_length = 2, dry_penalty_last_n = 4096
        top_k = 40, top_p = 0.950, min_p = 0.050, xtc_probability = 0.000, xtc_threshold = 0.100, typical_p = 1.000, top_n_sigma = -1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> dry -> top-k -> typical -> top-p -> min-p -> xtc -> temp-ext -> dist
generate: n_ctx = 4096, n_batch = 2048, n_predict = -1, n_keep = 1

== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - Press Return to return control to the AI.
 - To return control without starting a new line, end your input with '/'.
 - If you want to submit another line, end your input with '\'.
 - Not using system message. To change it, set a different value via -sys PROMPT


> What is the meaning of life?
8!B"(1D<)<,4@*3-A'3(<5,72A9.F62AC"%D08);E)6CDHCA0C.HC!%85>8DD(3!=&;48<"802=A,%0,6%D@0/'D<%(11@=&:.F*0A)!91.#;2,&;))
>
llama_perf_sampler_print:    sampling time =       5.23 ms /   131 runs   (    0.04 ms per token, 25052.59 tokens per second)
llama_perf_context_print:        load time =  213563.07 ms
llama_perf_context_print: prompt eval time =    4242.12 ms /    17 tokens (  249.54 ms per token,     4.01 tokens per second)
llama_perf_context_print:        eval time =   90563.05 ms /   114 runs   (  794.41 ms per token,     1.26 tokens per second)
llama_perf_context_print:       total time =  108716.49 ms /   131 tokens
Interrupted by user

ymcki · 2025-04-09T14:31:00Z

Thanks for your update. I will take a closer look to compare modeling_*.py files to see if I can spot more differences.

Can your llama-cli binary work with the 49B/51B ggufs?

csabakecskemeti · 2025-04-09T14:32:59Z

Im just wondering if we can rely on ffn.ffn_mult is None or a safer approach would be ffn.no_op is True to decide is the layer is dummy?
Based on nvidia's response on my question:

https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1/discussions/1

ymcki · 2025-04-09T15:05:47Z

I believe the error can be due to me didn't skip this part of the code in the layer loop when we have a dummy layer.

            cur = ggml_add(ctx0, cur, ffn_inp);
            cb(cur, "ffn_out", il);

            cur = build_cvec(cur, il);
            cb(cur, "l_out", il);

I made a fix to skip this when we have a dummy layer. It doesn't break 51B inference.

Can you give this a try? I believe you don't need to re-convert the gguf. Just re-compile and run llama-cli.

Thanks a lot in advance.

ymcki · 2025-04-09T15:17:18Z

Im just wondering if we can rely on ffn.ffn_mult is None or a safer approach would be ffn.no_op is True to decide is the layer is dummy? Based on nvidia's response on my question:

https://huggingface.co/nvidia/Llama-3_1-Nemotron-Ultra-253B-v1/discussions/1

There are 10 layers with ffn_mult 1.95 but all of them has ffn no_op False.

I believe you are talking about attention no_op True for some 1.95 layers. For those cases, they belong to the attention-free layers that I specifically handle them when n_head==0.

nicoboss · 2025-04-09T15:49:00Z

Can you give this a try? I believe you don't need to re-convert the gguf. Just re-compile and run llama-cli.

@ymcki Wow amazing this fixed it. Thanks a lot for the quick fix!

Here the result of my latest test. The model gave a perfect answer.

> What is the meaning of life?

The question of the meaning of life is one of the most profound and enduring inquiries in human history, touching upon philosophy, theology, science, and personal reflection. The answer varies widely depending on cultural, individual, and contextual factors. Here's an overview of how different perspectives approach this question:

1. **Philosophical Perspectives**:
   - **Existentialism**: Many existentialist philosophers, like Jean-Paul Sartre, argue that life has no inherent meaning, and it's up to each individual to create their own purpose through choices and actions.
   - **Absurdism** (e.g., Albert Camus): Suggests that the human desire for meaning in a seemingly indifferent universe is inherently at odds with the fact that the universe does not provide one. The response to this absurdity can be to live passionately and rebelliously.
   - **Stoicism**: Emphasizes living in accordance with nature and virtue, finding meaning through the cultivation of wisdom, courage, justice, and temperance.

2. **Religious and Spiritual Perspectives**:
   - **Theistic Religions** (e.g., Christianity, Islam, Judaism): Often posit that the meaning of life is to fulfill God's purpose, which may involve following divine commandments, achieving spiritual enlightenment, or preparing for an afterlife.
   - **Eastern Religions** (e.g., Buddhism, Hinduism): May focus on concepts like achieving enlightenment (Nirvana or Moksha), escaping the cycle of rebirth, or realizing the unity of all existence.
   - **Secular Spirituality**: Some find meaning through connection with nature, personal growth, or contributing to the well-being of others without reference to a deity.

3. **Scientific Perspectives**:
   - **Evolutionary Biology**: From a purely biological standpoint, the "meaning" of life might be seen as the perpetuation of genes and the survival of species. However, many scientists and thinkers argue that understanding the mechanisms of life doesn't necessarily address the subjective experience of meaning.
   - **Cosmology**: Considering the vastness of the universe and the laws of physics, some find meaning in understanding our place in the cosmos, while others find the universe's indifference to human existence a challenge to traditional notions of meaning.

4. **Personal and Subjective Meaning**:
   - Many people find meaning through relationships, creativity, the pursuit of knowledge, the enjoyment of beauty, or the contribution to society. This subjective approach suggests that meaning is not a universal truth but a personal construct.

In essence, the meaning of life might be that there is no single, universal meaning. Instead, it may be a deeply personal question that each individual must explore and answer for themselves through reflection, experience, and engagement with the world. The journey of seeking meaning might, in itself, be a significant part of the human experience.

> 
llama_perf_sampler_print:    sampling time =      45.53 ms /   599 runs   (    0.08 ms per token, 13156.74 tokens per second)
llama_perf_context_print:        load time =  215003.85 ms
llama_perf_context_print: prompt eval time =    4276.97 ms /    17 tokens (  251.59 ms per token,     3.97 tokens per second)
llama_perf_context_print:        eval time =  474769.16 ms /   582 runs   (  815.75 ms per token,     1.23 tokens per second)
llama_perf_context_print:       total time =  595790.49 ms /   599 tokens

ymcki · 2025-04-09T15:52:43Z

Can you give this a try? I believe you don't need to re-convert the gguf. Just re-compile and run llama-cli.

@ymcki Wow amazing this fixed it. Thanks a lot for the quick fix!

Here the result of my latest test. The model gave a perfect answer.

> What is the meaning of life?

The question of the meaning of life is one of the most profound and enduring inquiries in human history, touching upon philosophy, theology, science, and personal reflection. The answer varies widely depending on cultural, individual, and contextual factors. Here's an overview of how different perspectives approach this question:

1. **Philosophical Perspectives**:
   - **Existentialism**: Many existentialist philosophers, like Jean-Paul Sartre, argue that life has no inherent meaning, and it's up to each individual to create their own purpose through choices and actions.
   - **Absurdism** (e.g., Albert Camus): Suggests that the human desire for meaning in a seemingly indifferent universe is inherently at odds with the fact that the universe does not provide one. The response to this absurdity can be to live passionately and rebelliously.
   - **Stoicism**: Emphasizes living in accordance with nature and virtue, finding meaning through the cultivation of wisdom, courage, justice, and temperance.

2. **Religious and Spiritual Perspectives**:
   - **Theistic Religions** (e.g., Christianity, Islam, Judaism): Often posit that the meaning of life is to fulfill God's purpose, which may involve following divine commandments, achieving spiritual enlightenment, or preparing for an afterlife.
   - **Eastern Religions** (e.g., Buddhism, Hinduism): May focus on concepts like achieving enlightenment (Nirvana or Moksha), escaping the cycle of rebirth, or realizing the unity of all existence.
   - **Secular Spirituality**: Some find meaning through connection with nature, personal growth, or contributing to the well-being of others without reference to a deity.

3. **Scientific Perspectives**:
   - **Evolutionary Biology**: From a purely biological standpoint, the "meaning" of life might be seen as the perpetuation of genes and the survival of species. However, many scientists and thinkers argue that understanding the mechanisms of life doesn't necessarily address the subjective experience of meaning.
   - **Cosmology**: Considering the vastness of the universe and the laws of physics, some find meaning in understanding our place in the cosmos, while others find the universe's indifference to human existence a challenge to traditional notions of meaning.

4. **Personal and Subjective Meaning**:
   - Many people find meaning through relationships, creativity, the pursuit of knowledge, the enjoyment of beauty, or the contribution to society. This subjective approach suggests that meaning is not a universal truth but a personal construct.

In essence, the meaning of life might be that there is no single, universal meaning. Instead, it may be a deeply personal question that each individual must explore and answer for themselves through reflection, experience, and engagement with the world. The journey of seeking meaning might, in itself, be a significant part of the human experience.

> 
llama_perf_sampler_print:    sampling time =      45.53 ms /   599 runs   (    0.08 ms per token, 13156.74 tokens per second)
llama_perf_context_print:        load time =  215003.85 ms
llama_perf_context_print: prompt eval time =    4276.97 ms /    17 tokens (  251.59 ms per token,     3.97 tokens per second)
llama_perf_context_print:        eval time =  474769.16 ms /   582 runs   (  815.75 ms per token,     1.23 tokens per second)
llama_perf_context_print:       total time =  595790.49 ms /   599 tokens

Good news. Now bartowski can start making the ggufs. :)

nicoboss · 2025-04-09T18:10:17Z

I uploaded my Q4_K_M quants I made for testing to https://huggingface.co/nicoboss/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF in case anyone wants to try out this PR and Llama-3_1-Nemotron-Ultra-253B-v1 model.

Edit: I added Q3_K_M quants to above repository for those unable to run Q4_K_M. While doing so I also retested convert_hf_to_gguf.py and it still worked perfectly fine.
Edit2: I added Q5_K_M and Q2_K quants to above repository as well.

csabakecskemeti · 2025-04-09T19:18:51Z

I can also confirm it works, Q2 quants uploading to DevQuasar/nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF

Panchovix · 2025-04-09T22:55:44Z

I can confirm it works, tested Q3_K_M that @nicoboss uploaded.

samr7 · 2025-04-12T04:55:48Z

I tried out this branch.
Was able to run the convert_hf_to_gguf.py script on the 253B model - that appears to have worked well.
The model was able to process a Python programming test prompt on the CPU.
Ran into some difficulty running the model on a GPU setup with 8x P40 24GB - CUDA out of memory errors on the last GPU. It looks like the layers of this model are of uneven sizes and might be confusing the memory allocator? Likely something beyond the scope of your work here.

ymcki · 2025-04-12T05:26:47Z

I tried out this branch. Was able to run the convert_hf_to_gguf.py script on the 253B model - that appears to have worked well. The model was able to process a Python programming test prompt on the CPU. Ran into some difficulty running the model on a GPU setup with 8x P40 24GB - CUDA out of memory errors on the last GPU. It looks like the layers of this model are of uneven sizes and might be confusing the memory allocator? Likely something beyond the scope of your work here.

#12654

This is a known problem that is reported before. I don't have the resource to implement this as I only have one card.

DRVBSS · 2025-04-12T17:11:52Z

Unfortunately, when I try to run the model in LM Studio, I get this error:

🥲 Failed to load the model
Failed to load model
error loading model: missing tensor 'blk.9.ffn_norm.weight'

I search high and low and can't figure out why the committed changes for pull/12843 don't allow it to run. Thoughts?

ymcki · 2025-04-12T22:41:30Z

Unfortunately, when I try to run the model in LM Studio, I get this error:
🥲 Failed to load the model
Failed to load model
error loading model: missing tensor 'blk.9.ffn_norm.weight'
I search high and low and can't figure out why the committed changes for pull/12843 don't allow it to run. Thoughts?

Does other people have the same error? The 10th layer is the new dummy layer that has no self attention and no ffn. So the gguf itself should have no ffn norm weight in it. The reason why you are having this error is that your LM Studio doesn't have code to skip reading this weight. Please try it by compiling llama.cpp and run with llama-cli to make sure you have the new code.

csabakecskemeti · 2025-04-13T01:28:54Z

Someone has left a comment with the same issue on my quant:

https://huggingface.co/DevQuasar/nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/discussions/1

It was worked fine for me with llama.cpp built from your branch.

DRVBSS · 2025-04-13T02:17:40Z

Someone has left a comment with the same issue on my quant:

https://huggingface.co/DevQuasar/nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/discussions/1

It was worked fine for me with llama.cpp built from your branch.

I thought it was merged into main - am I wrong. I recompiled it today( the main version) and no joy

ymcki · 2025-04-13T02:40:46Z

Someone has left a comment with the same issue on my quant:
https://huggingface.co/DevQuasar/nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/discussions/1
It was worked fine for me with llama.cpp built from your branch.

I thought it was merged into main - am I wrong. I recompiled it today( the main version) and no joy

Well, the status of this PR is "Open" not "Merged". Apparently, it is not merged into main. Now all we need is a llama.cpp god to approve the PR...

danielhanchen · 2025-04-13T02:47:29Z

I tried your PR @ymcki - great work! It does work, however offloading to the GPU seems partially broken - using --split-mode row fixes offloading - I'm assuming llama.cpp still allocates KV cache space for the not initialized layers, thus it's not working as expected

danielhanchen · 2025-04-13T02:50:54Z

But overall, I tried using NVIDIA's suggested thinking = on and off, and the output looks correct

Panchovix · 2025-04-13T02:59:48Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.

Where do you set the thinking = on? (Sorry, new to GGUF)

ymcki · 2025-04-13T04:39:16Z

I tried your PR @ymcki - great work! It does work, however offloading to the GPU seems partially broken - using --split-mode row fixes offloading - I'm assuming llama.cpp still allocates KV cache space for the not initialized layers, thus it's not working as expected

What does it mean by "offloading to the GPU"? Do you mean "Offloading to CPU"?

Do you mean "--split-mode row" can make VRAM distribution more even?

ymcki · 2025-04-13T04:40:26Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.

Where do you set the thinking = on? (Sorry, new to GGUF)

Can you try "--split-mode row" and see if it makes any difference to the VRAM distribution? Thanks

Panchovix · 2025-04-13T04:53:11Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.
Where do you set the thinking = on? (Sorry, new to GGUF)

Can you try "--split-mode row" and see if it makes any difference to the VRAM distribution? Thanks

I get OOM, I guess -sm row is like tensor parallel? Since Q3_K_M is ~110GB, and my GPU with less VRAM is 24GB, it seems to load up to ~90-93GB until a 24GB VRAM GPU goes OOM.

ymcki · 2025-04-13T05:25:42Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.
Where do you set the thinking = on? (Sorry, new to GGUF)

Can you try "--split-mode row" and see if it makes any difference to the VRAM distribution? Thanks

I get OOM, I guess -sm row is like tensor parallel? Since Q3_K_M is ~110GB, and my GPU with less VRAM is 24GB, it seems to load up to ~90-93GB until a 24GB VRAM GPU goes OOM.

Thanks for the info. Seems like "-sm row" will use the main_gpu for small tensors and intermediate results. Supposedly it will run faster. So it doesn't do much to VRAM distribution. Probably combining "-sm row" with tuning "-ts" manually can work?

Panchovix · 2025-04-14T14:30:15Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.
Where do you set the thinking = on? (Sorry, new to GGUF)

Can you try "--split-mode row" and see if it makes any difference to the VRAM distribution? Thanks

I get OOM, I guess -sm row is like tensor parallel? Since Q3_K_M is ~110GB, and my GPU with less VRAM is 24GB, it seems to load up to ~90-93GB until a 24GB VRAM GPU goes OOM.

Thanks for the info. Seems like "-sm row" will use the main_gpu for small tensors and intermediate results. Supposedly it will run faster. So it doesn't do much to VRAM distribution. Probably combining "-sm row" with tuning "-ts" manually can work?

Sorry for no answer, I still yet have to test this correctly. Tested a bit and seems -ts values have to wildly different when using -sm row. Was testing a bit UD_Q3_K_XL (https://huggingface.co/unsloth/Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main/UD-Q3_K_XL) but couldn't load all 163 layers on GPU compared to Q3_K_M. In theory it should fit but I couldn't find a -ts value that let me load more on either the pair of gpu0/1 (4090 x2) or gpu 2 (5090) while also using just 1.9GB less on gpu 3 (a6000), even though the other 3 combined have 3gb free or so. With both default sm or row this seems to happen for that specific model.

Probably have to test -sm row on normal Q3_K_M.

Leflak · 2025-04-15T12:31:30Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.
Where do you set the thinking = on? (Sorry, new to GGUF)

Can you try "--split-mode row" and see if it makes any difference to the VRAM distribution? Thanks

I get OOM, I guess -sm row is like tensor parallel? Since Q3_K_M is ~110GB, and my GPU with less VRAM is 24GB, it seems to load up to ~90-93GB until a 24GB VRAM GPU goes OOM.

Thanks for the info. Seems like "-sm row" will use the main_gpu for small tensors and intermediate results. Supposedly it will run faster. So it doesn't do much to VRAM distribution. Probably combining "-sm row" with tuning "-ts" manually can work?

I tried the Q4_k_m on a 4x3090 and cpu offloading and it seems to weirdly share layers, tried aso with -sm row with -ts and does not work too so always oom.

ymcki · 2025-04-15T12:52:36Z

I have loaded on 4 GPUs (2x24+1x32+1x48) tinkering a lot with -ts, it seems to work fine with the default split mode.
Where do you set the thinking = on? (Sorry, new to GGUF)

Can you try "--split-mode row" and see if it makes any difference to the VRAM distribution? Thanks

I get OOM, I guess -sm row is like tensor parallel? Since Q3_K_M is ~110GB, and my GPU with less VRAM is 24GB, it seems to load up to ~90-93GB until a 24GB VRAM GPU goes OOM.

Thanks for the info. Seems like "-sm row" will use the main_gpu for small tensors and intermediate results. Supposedly it will run faster. So it doesn't do much to VRAM distribution. Probably combining "-sm row" with tuning "-ts" manually can work?

I tried the Q4_k_m on a 4x3090 and cpu offloading and it seems to weirdly share layers, tried aso with -sm row with -ts and does not work too so always oom.

Q4_K_M itself is 145GB. f16 kv cache at 4k is 1GB. That's way over your VRAM. How much RAM do you have?

Maybe you should try IQ3_XXS which is 97.6GB first? You can download it from
https://huggingface.co/DevQuasar/nvidia.Llama-3_1-Nemotron-Ultra-253B-v1-GGUF/tree/main

ymcki · 2025-04-15T14:07:35Z

I have created a table for the amount of VRAM needed for each layer. Can someone told me what does "-ts" do exactly? For example, if you have three 3090s, you pass "-sm layer" and "-ts 2,3,1". Does that mean the first 54 layers go to card 1, next 81 layers to card 2 and the final 27 layers to card 3?

Panchovix · 2025-04-15T14:35:04Z

I'm not sure how ts works either. I think it is a ratio how layers split, but I have to use very weird values to make it work on nemotron (I.e. 17.3,16,24,22.5)

I think since layers seems to be on different sizes on nemotron and also some are empty/dummy layers, load is not even between layers. You can kinda make some GPUs with more VRAM to use larger layers of you load the layers on the 2nd and 4th fraction (last) of a model.

ymcki · 2025-04-16T00:15:33Z

I created a spread sheet for the exact number of parameters and KV cache size that can help people manually distribute their VRAM.

Layer 0 is the tokenizer weight which I believe is not loaded to the VRAM. Layer 163 is the output layer. All dummy layers are skipped as they don't have any weight.

This is an example of IQ3_M 3.66bpw and IQ4_NL KV Cache 4.5bpw and a context length of 65536.

In this case, suppose you have 5x3090. Then layer 1 to 43 to card 1, layer 44 to 80 to card 2, layer 81 to 125 to card 3, layer 126 to 149 to card 4 and layer 150 to 163 to card 5.

params bpw	3.66	KV bpw	4.5	context	65536
	Attention	Feed Forward	KV Cache/token	Total	Cumulative GB
0	2101346368	0	0	0	0
1	570441728	264257536	2048	457372385.3	0.4259612274
2	570441728	528498688	2048	578262712.3	0.9645103455
3	570441728	792739840	2048	699153039.4	1.615647354
4	570441728	792739840	2048	699153039.4	2.266784363
5	570441728	792739840	2048	699153039.4	2.917921371
6	570441728	792739840	2048	699153039.4	3.56905838
7	570441728	792739840	2048	699153039.4	4.220195389
8	570441728	792739840	2048	699153039.4	4.871332397
9	570441728	1056980992	2048	820043366.4	5.635057297
14	570441728	1056980992	2048	820043366.4	6.398782196
15	570441728	1056980992	2048	820043366.4	7.162507095
16	570441728	1056980992	2048	820043366.4	7.926231995
17	570441728	2617262080	2048	1533871964	9.354761581
18	570441728	2617262080	2048	1533871964	10.78329117
25	570441728	2617262080	2048	1533871964	12.21182076
26	570441728	2617262080	2048	1533871964	13.64035034
27	570441728	2617262080	2048	1533871964	15.06887993
33	570441728	2617262080	2048	1533871964	16.49740952
34	570441728	2617262080	2048	1533871964	17.9259391
35	570441728	2617262080	2048	1533871964	19.35446869
36	570441728	1308639232	2048	935177011.2	20.22542015
40	570441728	1056980992	2048	820043366.4	20.98914505
41	570441728	1056980992	2048	820043366.4	21.75286995
42	570441728	1056980992	2048	820043366.4	22.51659485
43	0	1056980992	0	483568803.8	22.96695339
44	570441728	2617262080	2048	1533871964	24.39548298
45	0	2617262080	0	1197397402	25.51064621
51	570441728	2617262080	2048	1533871964	26.9391758
52	570441728	2617262080	2048	1533871964	28.36770538
53	570441728	2617262080	2048	1533871964	29.79623497
54	570441728	2617262080	2048	1533871964	31.22476456
60	570441728	2617262080	2048	1533871964	32.65329414
61	570441728	2617262080	2048	1533871964	34.08182373
62	570441728	2617262080	2048	1533871964	35.51035332
63	570441728	2617262080	2048	1533871964	36.9388829
69	570441728	2617262080	2048	1533871964	38.36741249
70	570441728	2617262080	2048	1533871964	39.79594208
71	570441728	2617262080	2048	1533871964	41.22447166
72	570441728	2617262080	2048	1533871964	42.65300125
78	570441728	2617262080	2048	1533871964	44.08153084
79	570441728	2617262080	2048	1533871964	45.51006042
80	570441728	2617262080	2048	1533871964	46.93859001
81	570441728	2617262080	2048	1533871964	48.3671196
87	570441728	2617262080	2048	1533871964	49.79564919
88	570441728	1837121536	2048	1176957665	50.89177643
89	570441728	1837121536	2048	1176957665	51.98790367
90	570441728	1837121536	2048	1176957665	53.08403091
93	570441728	1572880384	2048	1056067338	54.06757027
94	570441728	1308639232	2048	935177011.2	54.93852173
95	570441728	1308639232	2048	935177011.2	55.80947319
96	570441728	1308639232	2048	935177011.2	56.68042465
97	0	1308639232	0	598702448.6	57.23800976
98	0	1308639232	0	598702448.6	57.79559486
99	570441728	1308639232	2048	935177011.2	58.66654633
101	0	1308639232	0	598702448.6	59.22413143
102	0	1308639232	0	598702448.6	59.78171654
103	0	1308639232	0	598702448.6	60.33930164
104	0	1308639232	0	598702448.6	60.89688675
105	0	1308639232	0	598702448.6	61.45447186
106	0	1308639232	0	598702448.6	62.01205696
107	0	1308639232	0	598702448.6	62.56964207
108	570441728	1308639232	2048	935177011.2	63.44059353
114	570441728	1572880384	2048	1056067338	64.42413288
115	570441728	2617262080	2048	1533871964	65.85266247
116	0	2617262080	0	1197397402	66.9678257
117	570441728	2617262080	2048	1533871964	68.39635529
126	0	19629359104	0	8980431790	76.76003414
135	0	20937981952	0	9579126743	85.68129112
144	0	16861118464	0	7713961697	92.86547779
150	0	14797520896	0	6769865810	99.17040665
151	0	1056980992	0	483568803.8	99.62076519
152	570441728	1056980992	2048	820043366.4	100.3844901
153	570441728	1308639232	2048	935177011.2	101.2554416
155	570441728	1308639232	2048	935177011.2	102.126393
156	570441728	1308639232	2048	935177011.2	102.9973445
157	570441728	1837121536	2048	1176957665	104.0934717
158	570441728	2617262080	2048	1533871964	105.5220013
159	570441728	2617262080	2048	1533871964	106.9505309
160	570441728	2617262080	2048	1533871964	108.3790605
161	570441728	2617262080	2048	1533871964	109.8075901
162	570441728	1308639232	2048	935177011.2	110.6785415
163	2101362688	0	0	961373429.8	111.5738903
Total	40710979648	212690288640	131072	118840179057

ymcki and others added 13 commits December 19, 2024 10:41

conflict resolution

ecad966

Merge branch 'ggerganov:master' into master

12aded6

move comments after bracket to its own line

643e5e8

Merge branch 'ggerganov:master' into master

e68c76d

Merge branch 'ggerganov:master' into master

6a4805f

Merge branch 'ggerganov:master' into master

f9a1cdb

Merge branch 'ggerganov:master' into master

c1736f3

DeciLMCausalModel now reads rope_theta from config.json properly

984ffac

Merge branch 'ggerganov:master' into master

909a7d9

Merge branch 'ggml-org:master' into master

4dad248

Merge branch 'ggml-org:master' into master

cc615bc

Llama-3_1-Nemotron-Ultra-253B-v1 support

0ac08b5

Merge branch 'master' of github.com:ymcki/llama.cpp

80af2e3

github-actions bot added the python python script changes label Apr 9, 2025

ymcki and others added 3 commits April 9, 2025 23:01

Nemotron 253B support skip ffn computaton when n_head == 0 && n_ff == 0

2a260da

Merge branch 'ggml-org:master' into master

f8f6767

Merge branch 'master' of github.com:ymcki/llama.cpp

1600dfb

Merge branch 'ggml-org:master' into master

bd3d42a

Merge branch 'ggml-org:master' into master

3961bff

ymcki mentioned this pull request Apr 13, 2025

Llama-3_1-Nemotron 51B support turboderp-org/exllamav2#726

Open

Merge branch 'ggml-org:master' into master

a4d654f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3_1-Nemotron-Ultra-253B-v1 support #12843

Llama-3_1-Nemotron-Ultra-253B-v1 support #12843

ymcki commented Apr 9, 2025

nicoboss commented Apr 9, 2025

nicoboss commented Apr 9, 2025

ymcki commented Apr 9, 2025

csabakecskemeti commented Apr 9, 2025

ymcki commented Apr 9, 2025

ymcki commented Apr 9, 2025

nicoboss commented Apr 9, 2025

ymcki commented Apr 9, 2025

nicoboss commented Apr 9, 2025 •

edited

Loading

csabakecskemeti commented Apr 9, 2025

Panchovix commented Apr 9, 2025

samr7 commented Apr 12, 2025

ymcki commented Apr 12, 2025

DRVBSS commented Apr 12, 2025

ymcki commented Apr 12, 2025 •

edited

Loading

csabakecskemeti commented Apr 13, 2025

DRVBSS commented Apr 13, 2025 •

edited

Loading

ymcki commented Apr 13, 2025 •

edited

Loading

danielhanchen commented Apr 13, 2025

danielhanchen commented Apr 13, 2025

Panchovix commented Apr 13, 2025

ymcki commented Apr 13, 2025

ymcki commented Apr 13, 2025

Panchovix commented Apr 13, 2025

ymcki commented Apr 13, 2025

Panchovix commented Apr 14, 2025

Leflak commented Apr 15, 2025

ymcki commented Apr 15, 2025

ymcki commented Apr 15, 2025

Panchovix commented Apr 15, 2025 •

edited

Loading

ymcki commented Apr 16, 2025

Llama-3_1-Nemotron-Ultra-253B-v1 support #12843

Are you sure you want to change the base?

Llama-3_1-Nemotron-Ultra-253B-v1 support #12843

Conversation

ymcki commented Apr 9, 2025

nicoboss commented Apr 9, 2025

nicoboss commented Apr 9, 2025

Prompt

Response

Note

Steps to reproduce

Log

ymcki commented Apr 9, 2025

csabakecskemeti commented Apr 9, 2025

ymcki commented Apr 9, 2025

ymcki commented Apr 9, 2025

nicoboss commented Apr 9, 2025

ymcki commented Apr 9, 2025

nicoboss commented Apr 9, 2025 • edited Loading

csabakecskemeti commented Apr 9, 2025

Panchovix commented Apr 9, 2025

samr7 commented Apr 12, 2025

ymcki commented Apr 12, 2025

DRVBSS commented Apr 12, 2025

ymcki commented Apr 12, 2025 • edited Loading

csabakecskemeti commented Apr 13, 2025

DRVBSS commented Apr 13, 2025 • edited Loading

ymcki commented Apr 13, 2025 • edited Loading

danielhanchen commented Apr 13, 2025

danielhanchen commented Apr 13, 2025

Panchovix commented Apr 13, 2025

ymcki commented Apr 13, 2025

ymcki commented Apr 13, 2025

Panchovix commented Apr 13, 2025

ymcki commented Apr 13, 2025

Panchovix commented Apr 14, 2025

Leflak commented Apr 15, 2025

ymcki commented Apr 15, 2025

ymcki commented Apr 15, 2025

Panchovix commented Apr 15, 2025 • edited Loading

ymcki commented Apr 16, 2025

nicoboss commented Apr 9, 2025 •

edited

Loading

ymcki commented Apr 12, 2025 •

edited

Loading

DRVBSS commented Apr 13, 2025 •

edited

Loading

ymcki commented Apr 13, 2025 •

edited

Loading

Panchovix commented Apr 15, 2025 •

edited

Loading