Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantization doesn't work with Bloomz 176B #14

Open
agemagician opened this issue Mar 18, 2023 · 10 comments
Open

Quantization doesn't work with Bloomz 176B #14

agemagician opened this issue Mar 18, 2023 · 10 comments

Comments

@agemagician
Copy link

Hello,

I have successfully converted the bloomz 176B model to fp16.
However, the quantization doesn't work and throw an error:

./quantize ./models/ggml-model-bloomz-f16.bin ./models/ggml-model-bloomz-f16-q4_0.bin 2
bloom_model_quantize: loading model from './models/ggml-model-bloomz-f16.bin'
bloom_model_quantize: n_vocab = 250880
bloom_model_quantize: n_ctx   = 512
bloom_model_quantize: n_embd  = 14336
bloom_model_quantize: n_mult  = 1
bloom_model_quantize: n_head  = 112
bloom_model_quantize: n_layer = 70
bloom_model_quantize: f16     = 1
terminate called after throwing an instance of 'std::length_error'
  what():  vector::_M_default_append
Aborted (core dumped)

Any idea how this could be fixed ?

@agemagician agemagician changed the title Quantization doesn't work with Bloomz Quantization doesn't work with Bloomz 176B Mar 18, 2023
@ZhangYunchenY
Copy link

same question

@NouamaneTazi
Copy link
Owner

NouamaneTazi commented Mar 22, 2023

Unfortunately I need more details than that :/
Did you try with other models like 7B1 and it works? You only get this problem with 176B?

@ZhangYunchenY
Copy link

Yes, it works for BLOOMZ-560m and BLOOMZ-7B1. I got the same problem shown in the @agemagician error message.

@NouamaneTazi
Copy link
Owner

Oh, seeing #15, it seems you have already solved this issue? What was the problem? @agemagician

@ZhangYunchenY
Copy link

ZhangYunchenY commented Mar 22, 2023

Oh, seeing #15, it seems you have already solved this issue? What was the problem? @agemagician

I cannot inference the 176B FP16 model, while I have 1TB RAM. And the same error message is shown in @agemagician #15 . It works for 560M and 7B1.

@agemagician
Copy link
Author

#15 using fp16 not 4-bit model.

@barsuna
Copy link

barsuna commented Mar 26, 2023

Here is where it seems to crash...

$ g++ -I. -I./examples -g -std=c++11 -fPIC -pthread quantize.cpp ggml.o utils.o -o quantize
$ gdb --args ./quantize ./models/bloom/ggml-model-bloom-f16.bin ./models/bloom/ggml-model-bloomz-f16-q4_0.bin 2
Reading symbols from ./quantize...
(gdb)
(gdb) list 190
185 if (ftype != 0 && ftype != 1) {
186 fprintf(stderr, "%s: unsupported ftype %d for integer quantization\n", func, ftype);
187 return false;
188 }
189
190 if (ftype == 1) {
191 data_f16.resize(nelements);
192 finp.read(reinterpret_cast<char *>(data_f16.data()), nelements * sizeof(ggml_fp16_t));
193 data_f32.resize(nelements);
194 for (int i = 0; i < nelements; ++i) {
(gdb) break 190
Breakpoint 1 at 0x7796: file quantize.cpp, line 190.
(gdb) run
Starting program: ./quantize ./models/bloom/ggml-model-bloom-f16.bin ./models/bloom/ggml-model-bloomz-f16-q4_0.bin 2
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
bloom_modal_quantize: loading model from './models/bloom/ggml-model-bloom-f16.bin'
bloom_model_quantize: n_vocab = 250880
bloom_model_quantize: n_ctx = 512
bloom_model_quantize: n_embd = 14336
bloom_model_quantize: n_mult = 1
bloom_model_quantize: n_head = 112
bloom_model_quantize: n_layer = 70
bloom_model_quantize: f16 = 1

Breakpoint 1, bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:190
190 if (ftype == 1) {
(gdb) next
191 data_f16.resize(nelements);
(gdb) frame
#0 bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2) at quantize.cpp:191
191 data_f16.resize(nelements);
(gdb) p nelements
$1 = -698351616
(gdb) p data_f16
$2 = std::vector of length 0, capacity 0
(gdb) next
terminate called after throwing an instance of 'std::length_error'
what(): vector::_M_default_append

Program received signal SIGABRT, Aborted.
__GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
50 ../sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb)
(gdb) where
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7a77859 in __GI_abort () at abort.c:79
#2 0x00007ffff7e72911 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#3 0x00007ffff7e7e38c in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#4 0x00007ffff7e7e3f7 in std::terminate() () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#5 0x00007ffff7e7e6a9 in __cxa_throw () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#6 0x00007ffff7e75326 in std::__throw_length_error(char const*) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#7 0x0000555555560da7 in std::vector<unsigned short, std::allocator >::_M_check_len (this=0x7fffffffd800, __n=18446744073011200000,
__s=0x5555555b9454 "vector::_M_default_append") at /usr/include/c++/7/bits/stl_vector.h:1505
#8 0x000055555555eecc in std::vector<unsigned short, std::allocator >::_M_default_append (this=0x7fffffffd800, __n=18446744073011200000)
at /usr/include/c++/7/bits/vector.tcc:568
#9 0x000055555555dd81 in std::vector<unsigned short, std::allocator >::resize (this=0x7fffffffd800, __new_size=18446744073011200000)
at /usr/include/c++/7/bits/stl_vector.h:692
#10 0x000055555555b7c0 in bloom_model_quantize (fname_inp="./models/bloom/ggml-model-bloom-f16.bin", fname_out="./models/bloom/ggml-model-bloomz-f16-q4_0.bin", itype=2)
at quantize.cpp:191
#11 0x000055555555c49f in main (argc=4, argv=0x7fffffffdf68) at quantize.cpp:316
(gdb)

@barsuna
Copy link

barsuna commented Apr 2, 2023

Quantization for 176B works with this commit
barsuna@2d0e478
Inference is also working

@linuxmagic-mp
Copy link

Can we get a clearer status update? Your readme isn't clear whether everything is good with the 176B quantize, I am still having a problem with it on bloom.cpp, and not sure where your status is on this? Any word on whether patches/fixes will get into bloom.cpp?

@TheBloke
Copy link

TheBloke commented Jul 8, 2023

Hi @barsuna

Thank you very much for making your fork to fix quantising with 176B. I recently quantised BloomZ 176B and Bloom Chat 176B to GPTQ and released to HF Hub, and today wanted to do GGML as well. I hit the issue described in this thread and your fork enabled me to quantise the models.

Unfortunately there appears to be an inference problem. I was wondering if you saw this too, and might have any idea what is wrong?

The issue is that it seems to be missing words out, or skipping over words. Here's some examples testing q4_0 with BloomChat 176B (issue is the same with BloomZ 176B):

<human>: write a story about llamas\n<bot>: Once upon a time, in the land of Spain there were two small and fluffy creatures known as Llama's.  They lived happily together with their names were Fred and Wilma they way down. For many years,  The one day when the started having a great fun adventure traveling across mountainside through new lands discovering different cultures and new things, but upon arriving to them until 1 night stoped at an o nices or so called Peru.  They met a town that beforested hot them some there new friends along with this amazing of many llamas where they way side.

they road. The next day 2 ther  being
<human>: write a story about llamas\n<bot>: Once upon a time there were two llamas named Mac and Cheese who wanted to get out of their boring home in the farm. They heard some where they could find a new friends on an exciting place called city full of adventures.  The found a train ride.
</s> [end of text]
<human>:tell me about Paris\n<bot>: The City of Light
Paris, often simply known as Paris (UK: /ˈpaɪərz/;[2] US:z; French: [paʁi]), is the capital and most populous city of France. With a country which forms part of Île-de-France region on the northern الباطل Peninsula italy or Normandy layead)

[note also called Pasde lay in with overal-Paris, parisi/ (French pronunciation: [pajونٹیʁis i<sup>jɛ̃]), is capitalregion Paris;[3][4]) and often shortened Parigi basilica a latinu Seine
Paris claiments: Pari),[5] or just

The story prompts seem coherent, but then it's like it suddenly skips forward in the sentence by a few words. Then the Paris prompt is half coherent, half not, and again looks like bits are missing.

Is there any chance you might know what is wrong, or could look to fix it? If so I will be able to release 176B GGMLs to HF Hub and there's quite a few people who would love to try them.

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants