[Open-to-community] Benchmark bloomz.cpp on different hardware #4

Vaibhavs10 · 2023-03-15T18:13:38Z

Hey hey,

We are working hard to help you unlock the truest potential of open-source LLMs. In order for us to build better and cater to the majority of hardware we need your help to run benchmarks with bloomz.cpp 🤗

We are looking for the following information:

Hardware information (CPU/ RAM/ GPU/ Threads)
Inference time (time per token)
Memory use

You can do so by following the quickstart steps in the project's README. 💯

Ping @NouamaneTazi and @Vaibhavs10 if you have any questions! <3

Happy benchmarking! 🚀

eschaffn · 2023-03-16T02:36:13Z

Is it possible to run this on windows?

NouamaneTazi · 2023-03-16T07:45:34Z

Good point!
It should be possible with the latest modifications done in llama.cpp. We still need to pull those to this repo.

Feel free to open a PR for that if you'd like @eschaffn 🚀

lapo-luchini · 2023-03-16T18:09:14Z

I didn't expect conversion to need 22 GiB RAM (running on Win64 native python3.11).
I just barely managed. 😅

Quantization used more ore less 10 GiB RAM running on WSL Ubuntu / gcc-9.4.0.

bloom_model_quantize: model size  = 30886.16 MB
bloom_model_quantize: quant size  =  4831.16 MB
bloom_model_quantize: hist: 0.000 0.022 0.018 0.031 0.050 0.075 0.102 0.129 0.152 0.128 0.102 0.074 0.049 0.031 0.018 0.021

main: quantize time = 203633.06 ms
main:    total time = 203633.06 ms

Executed at around 5.5 token/s on a AMD Ryzen 5 3600:

% make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin -p 'Translate "Hi, how are you?" in French:'
 -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -mavx -mavx2 -mfma -mf16c -msse3
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC:       cc (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
I CXX:      g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0

make: Nothing to be done for 'default'.
main: seed = 1678990037
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour, comment vas-tu?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  8464.89 ms
main:   sample time =   180.49 ms
main:  predict time =  3074.70 ms / 180.86 ms per token
main:    total time = 12601.97 ms

lapo-luchini · 2023-03-16T19:13:31Z

FreeBSD 13 on Intel i7-3770 CPU @ 3.40GHz:
(I had to remove parameters or it would just crash)

% gmake && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin -p 'Translate "Hi, how are you?" in French:
' -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  FreeBSD
I UNAME_P:  amd64
I UNAME_M:  amd64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:
I CC:       FreeBSD clang version 13.0.0 ([email protected]:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a
303)
I CXX:      FreeBSD clang version 13.0.0 ([email protected]:llvm/llvm-project.git llvmorg-13.0.0-0-gd7b669b3a
303)

gmake: Nessuna operazione da eseguire per «default».
main: seed = 1678993867
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1
.300000


Translate "Hi, how are you?" in French: comment vas-tu? "</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time = 25791.55 ms
main:   sample time =   204.70 ms
main:  predict time = 50346.14 ms / 3146.63 ms per token
main:    total time = 89922.15 ms

eschaffn · 2023-03-17T01:18:49Z

Intel I9-13900KS
Nvidia RTX 4090
I gave WSL 28GB of RAM and 50% Disk as swap

Running on Windows 10 with WSL 2 Ubuntu with CUDA

main: seed = 1679015450
bloom_model_load: loading model from './models/ggml-model-bloomz-7b1-f16.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 15927.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from './models/ggml-model-bloomz-7b1-f16.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size = 15446.16 MB / num tensors = 366

main: prompt: 'Je vais'
main: number of tokens in prompt = 2
  5830 -> 'Je'
 17935 -> ' vais'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Je vais maintenant discuter de quelques des propriétés digitales</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time = 237397.50 ms
main:   sample time =   112.54 ms
main:  predict time =  1832.55 ms / 203.62 ms per token
main:    total time = 240065.61 ms

After quantitization:

main: seed = 1679016169
bloom_model_load: loading model from './models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from './models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Je vais'
main: number of tokens in prompt = 2
  5830 -> 'Je'
 17935 -> ' vais'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Je vais supposer que je veux poser a few queries on the server.</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  1892.14 ms
main:   sample time =   178.52 ms
main:  predict time =  1956.12 ms / 139.72 ms per token
main:    total time =  4616.26 ms```

itakafu · 2023-03-18T00:02:31Z

Tried on my MacBook Pro 14inch, M2 Max, 96GB memory running macOS Ventura 13.2.1!

(ml) ~/W/bloomz.cpp ❯❯❯ make && ./main -m models/ggml-model-bloomz-7b1-f16.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1679097305
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 1
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 15927.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size = 15446.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Bonjour!</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  4650.54 ms
main:   sample time =    66.74 ms
main:  predict time =   730.79 ms / 56.21 ms per token
main:    total time =  5695.71 ms

After quantitization:

(ml) ~/W/bloomz.cpp ❯❯❯ make && ./main -m models/ggml-model-bloomz-7b1-f16-q4_0.bin  -p 'Translate "Hi, how are you?" in French:' -t 8 -n 256
I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

make: Nothing to be done for `default'.
main: seed = 1679097027
bloom_model_load: loading model from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin' - please wait ...
bloom_model_load: n_vocab = 250880
bloom_model_load: n_ctx   = 512
bloom_model_load: n_embd  = 4096
bloom_model_load: n_mult  = 1
bloom_model_load: n_head  = 32
bloom_model_load: n_layer = 30
bloom_model_load: f16     = 2
bloom_model_load: n_ff    = 16384
bloom_model_load: n_parts = 1
bloom_model_load: ggml ctx size = 5312.64 MB
bloom_model_load: memory_size =   480.00 MB, n_mem = 15360
bloom_model_load: loading model part 1/1 from 'models/ggml-model-bloomz-7b1-f16-q4_0.bin'
bloom_model_load: ............................................. done
bloom_model_load: model size =  4831.16 MB / num tensors = 366

main: prompt: 'Translate "Hi, how are you?" in French:'
main: number of tokens in prompt = 11
153772 -> 'Translate'
 17959 -> ' "H'
    76 -> 'i'
 98257 -> ', '
 20263 -> 'how'
  1306 -> ' are'
  1152 -> ' you'
  2040 -> '?'
     5 -> '"'
   361 -> ' in'
196427 -> ' French:'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Translate "Hi, how are you?" in French: Comment vas-tu?</s> [end of text]


main: mem per token = 24017564 bytes
main:     load time =  1545.36 ms
main:   sample time =   117.36 ms
main:  predict time =   738.26 ms / 49.22 ms per token
main:    total time =  2709.12 ms

barsuna mentioned this issue Mar 26, 2023

Quantization doesn't work with Bloomz 176B #14

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Open-to-community] Benchmark bloomz.cpp on different hardware #4

[Open-to-community] Benchmark bloomz.cpp on different hardware #4

Vaibhavs10 commented Mar 15, 2023 •

edited

Loading

eschaffn commented Mar 16, 2023

NouamaneTazi commented Mar 16, 2023

lapo-luchini commented Mar 16, 2023

lapo-luchini commented Mar 16, 2023

eschaffn commented Mar 17, 2023 •

edited

Loading

itakafu commented Mar 18, 2023

[Open-to-community] Benchmark bloomz.cpp on different hardware #4

[Open-to-community] Benchmark bloomz.cpp on different hardware #4

Comments

Vaibhavs10 commented Mar 15, 2023 • edited Loading

eschaffn commented Mar 16, 2023

NouamaneTazi commented Mar 16, 2023

lapo-luchini commented Mar 16, 2023

lapo-luchini commented Mar 16, 2023

eschaffn commented Mar 17, 2023 • edited Loading

itakafu commented Mar 18, 2023

Vaibhavs10 commented Mar 15, 2023 •

edited

Loading

eschaffn commented Mar 17, 2023 •

edited

Loading