Running Llama.cpp on the cluster we have made

I'm not paving any particularly new ground here, just recording my path, standing on the shoulders of giants - thanks to https://github.com/ggerganov and https://github.com/theycallmeloki for paving the way.

Lets make a place to work and pull the llama.cpp repo.

cd /clusterfs
mkdir Projects
cd Projects/
git clone https://github.com/ggerganov/llama.cpp

In Part 3 of Garret Mills' instructions (https://glmdev.medium.com/building-a-raspberry-pi-cluster-f5f2446702e8) he provides the installation of our MPI compilation dependencies on all nodes:

srun --nodes=3 apt install openmpi-bin openmpi-common libopenmpi3 libopenmpi-dev -y

That article is a great read on the basics of MPI and even if it isn't doing anything fancy. I followed the whole guide and if you'd like to learn more about parallel programming it's a good way to get your toes wet.

To build llama.cpp with MPI bindings turned on, we do the following:

cameron@rp4n1:~$ cd /clusterfs/Projects/llama.cpp/
cameron@rp4n1:/clusterfs/Projects/llama.cpp$ make CC=mpicc CXX=mpicxx LLAMA_MPI=1

Next we need a trained and quantized model. I'm not redistributing these files, just tracking back for reproducibility, the model files I used were from TheBloke, ~~the q4_0 version of each. I copied these to local storage ( /home/cameron/models ) on each node to reduce the file transfer load per run.~~

~~- 13b https://huggingface.co/TheBloke/LLaMa-13B-GGML ~~ ~~- 65B - https://huggingface.co/TheBloke/llama-65B-GGML~~

llama.cpp has updated the expected format for model binaries, do we are looking for a .gguf file now

https://huggingface.co/TheBloke/Llama-2-70B-GGUF/blob/main/llama-2-70b.Q3_K_M.gguf
https://huggingface.co/TheBloke/Llama-2-13B-GGUF/blob/main/llama-2-13b.Q5_K_M.gguf

I grabbed these unscientifically. The different Q-value represents a degree of compression or abstraction, if you will, making the model files smaller, but lossy against the less quantized models. I picked a small model file to see how fast we can get tokens, and a big file to see if we can get high-quality responses. I'll eventually grab others to compare different Q-values of the 13b for instance.

Make a hostfile to pass in the MPI host configuration, this diverges from Garret Mills' guide about MPI, but I think this is one way in which llama.cpp implements MPI as a secondary priority. Instead of slurm defining our MPI environment we have to nail it down at invocation with hosts and slots. The hostnames depend on our hostname entries that we configured earlier, otherwise enter an IP or FQDN.

Hostfile: ''' rp4n0 slots=1 rp4n1 slots=1 rp4n2 slots=1 '''

13B

SLurm job :

#!/bin/bash
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=4

cd $SLURM_SUBMIT_DIR

echo "Master node: $(hostname)"

mpirun -hostfile /clusterfs/Projects/llama.cpp/hostfile -n 3 /clusterfs/Projects/llama.cpp/main -m /home/cameron/models/llama-2-13b.Q5_K_M.gguf -p "Raspberry Pi computers are " -n 128

2261.08 ms per token, 0.44 tokens per second

slurm output:

cat slurm-21.out 
Master node: rp4n0
main: build = 975 (9ca4abe)
main: seed  = 1692035168
main: build = 975 (9ca4abe)
main: seed  = 1692035168
main: build = 975 (9ca4abe)
main: seed  = 1692035168
llama.cpp: loading model from /home/cameron/models/llama-13b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/cameron/models/llama-13b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/cameron/models/llama-13b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 6983.72 MB (+  400.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 6983.72 MB (+  400.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_head_kv  = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =    0.11 MB
llama_model_load_internal: mem required  = 6983.72 MB (+  400.00 MB per state)
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.35 MB
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.35 MB
llama_new_context_with_model: kv self size  =  400.00 MB
llama_new_context_with_model: compute buffer total size =   75.35 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


 Raspberry Pi computers are 21st century classics
Raspberry Pi has become a classic in the world of computing, as it was one of the first single-board computers to make waves when it came out back in 2012. In fact, you can say that if Steve Jobs were alive today, he would probably buy hundreds of thousands of Raspberry Pi computers and give them out to people who are considered to be "at risk". The idea is that you'd use the computer for everything from social media to playing Minecraft.
This type of computing is known as “educational computing”, as it was
llama_print_timings:        load time = 17766.29 ms
llama_print_timings:      sample time =   264.42 ms /   128 runs   (    2.07 ms per token,   484.07 tokens per second)
llama_print_timings: prompt eval time = 10146.71 ms /     8 tokens ( 1268.34 ms per token,     0.79 tokens per second)
llama_print_timings:        eval time = 287157.12 ms /   127 runs   ( 2261.08 ms per token,     0.44 tokens per second)
llama_print_timings:       total time = 297598.22 ms

65B

Slurm job:

sub_65b.sh 
#!/bin/bash
#SBATCH --nodes=3
#SBATCH --ntasks-per-node=4

cd $SLURM_SUBMIT_DIR

echo "Master node: $(hostname)"

mpirun -hostfile /clusterfs/Projects/llama.cpp/hostfile -n 3 /clusterfs/Projects/llama.cpp/main -m /home/cameron/models/llama-65b.ggmlv3.q4_0.bin -p "Why do we have to sleep?" -n 128

586930.81 ms per token, 0.00 tokens per second

running and output ( 16 Aug 10:20 ):

cameron@rp3n0:/clusterfs/Projects$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
rp4*         up   infinite      3  alloc rp4n[0-2]
cameron@rp3n0:/clusterfs/Projects$ cat slurm-22.out 
Master node: rp4n0
main: build = 975 (9ca4abe)
main: seed  = 1692118354
main: build = 975 (9ca4abe)
main: seed  = 1692118354
main: build = 975 (9ca4abe)
main: seed  = 1692118354
llama.cpp: loading model from /home/cameron/models/llama-65b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/cameron/models/llama-65b.ggmlv3.q4_0.bin
llama.cpp: loading model from /home/cameron/models/llama-65b.ggmlv3.q4_0.bin
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 35026.50 MB (+ 1280.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 35026.50 MB (+ 1280.00 MB per state)
llama_model_load_internal: format     = ggjt v3 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 8192
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 64
llama_model_load_internal: n_head_kv  = 64
llama_model_load_internal: n_layer    = 80
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: n_gqa      = 1
llama_model_load_internal: rnorm_eps  = 5.0e-06
llama_model_load_internal: n_ff       = 22016
llama_model_load_internal: freq_base  = 10000.0
llama_model_load_internal: freq_scale = 1
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: model size = 65B
llama_model_load_internal: ggml ctx size =    0.21 MB
llama_model_load_internal: mem required  = 35026.50 MB (+ 1280.00 MB per state)
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size =  119.35 MB

system_info: n_threads = 4 / 4 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | VSX = 0 | 
sampling: repeat_last_n = 64, repeat_penalty = 1.100000, presence_penalty = 0.000000, frequency_penalty = 0.000000, top_k = 40, tfs_z = 1.000000, top_p = 0.950000, typical_p = 1.000000, temp = 0.800000, mirostat = 0, mirostat_lr = 0.100000, mirostat_ent = 5.000000
generate: n_ctx = 512, n_batch = 512, n_predict = 128, n_keep = 0


llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size =  119.35 MB
llama_new_context_with_model: kv self size  = 1280.00 MB
llama_new_context_with_model: compute buffer total size =  119.35 MB
 Why do we have to sleep? That’s something that researchers are still trying to figure out. What they do know, however, is how a lack of quality sleep can affect your health and well-being. This includes more than just feeling tired the next day!
Without enough sleep, you may be at greater risk for heart disease and type 2 diabetes, as well as obesity. And while most people associate being drowsy with safety concerns such as driving while fatigued (see below), a new study has found that sleep deprivation may even put nurses’ patients in jeopardy!
The
llama_print_timings:        load time = 1446192.69 ms
llama_print_timings:      sample time =   281.36 ms /   128 runs   (    2.20 ms per token,   454.93 tokens per second)
llama_print_timings: prompt eval time = 592147.39 ms /     8 tokens (74018.42 ms per token,     0.01 tokens per second)
llama_print_timings:        eval time = 74540213.30 ms /   127 runs   (586930.81 ms per token,     0.00 tokens per second)
llama_print_timings:       total time = 75132679.76 ms

To Do:

Move all nodes to 1Gbit network
Compare advanced math libraries' performance variation - we don't have any AVX extensions, but we could use BLAS, perhaps to eek out a little more performance
Add Jetson Nano and VIM3 nodes, requires some version mindfulness:
Jetson Nano base OS images are still on 18.04, but https://github.com/Qengineering/Jetson-Nano-Ubuntu-20-image
VIM3 images are on 20.04
I could build from source on both, or rebase the Pis to 20.04 and only build the Nano from source, but I'd rather keep it uniform. ( scheduler has to be on the same version, so I can't run the weirdies as a partition unto themselves controlled by the same controller as the others, I tried )

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLama.cpp.md

LLama.cpp.md

Running Llama.cpp on the cluster we have made

13B

65B

Files

LLama.cpp.md

Latest commit

History

LLama.cpp.md

File metadata and controls

Running Llama.cpp on the cluster we have made

13B

65B