FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

MekkCyber · 2024-05-23T11:57:34Z

Implementation of 1.58bit LLM with Llama following the paper & handbook released by Microsoft :

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

Here are The training results on 25B tokens :

xrsrke · 2024-05-23T12:33:25Z

Hello. Thanks for the PR. One question: The difference in loss here is very high? In the paper, it should be ~0.1, but here the difference is more than 0.5

MekkCyber · 2024-05-23T16:05:37Z

I think it has to do with the batch size. During our latest experiment, we trained the 1.58 model on 100B tokens, and we managed to get a 2.8 loss after 25B tokens with a batch size of 1024 :

gau-nernst · 2024-09-21T01:46:23Z

src/nanotron/parallel/tensor_parallel/nn.py

+            ) / w_scale / x_scale
+        else : 
+            w = self.weight
+            x_norm = normalize(x, self.in_features)


Shouldn't RMSNorm here have learnable weights?

@MekkCyber May I know why this is marked as resolved? From my understanding of the training tips handbook, the new RMSNorm should have learnable weights as the usual RMSNorm layers. Is there a reason you left it out here? (as well as in the PR merged in huggingface/transformers).

hjc3613 · 2024-10-29T12:55:04Z

Hi, I fetched this pr, and fintune the llama 70B using tp=8 or pp=8, before training, I have convert the llama 70B into nanotron format using this method: #174, with set pp=1, dp=1, tp=1, but when start training using pp=1, dp=1, tp=8, I got this error:

when training with pp=1, dp=1, tp=1 which is consistent with convert config, I run OOM
so, how to finetune the 70B model into 1.58bit from scratch? could you give me some suggestion please? thank you!

MekkCyber · 2024-10-30T10:02:23Z

Hey @hjc3613 thanks for the report ! you don't have to be consitent with the convert config, it should work I will investigate that! Can you tell me from your side what is the content of models/qwen2.5-72b-instruct-nanotron/model/model/decoder/0/pp_block/MLPBitNet

hjc3613 · 2024-10-30T10:52:20Z

Hey @hjc3613 thanks for the report ! you don't have to be consitent with the convert config, it should work I will investigate that! Can you tell me from your side what is the content of models/qwen2.5-72b-instruct-nanotron/model/model/decoder/0/pp_block/MLPBitNet

thank you very much!
My requirement is to successfully fine-tune a 70B model to a 1.58bit model, using either Qwen2.5 70B or LLama 70B - both have almost identical model architectures. According to the training code, I need to provide a dataset and a base model in nanotron format, which will then be fine-tuned to 1.58bit for inference. However, I only have a HF version of the base model, so following PR174, I'm first converting it to nanotron format. Before conversion, I need to set dp=1, pp=1, tp=1. After conversion, I'm using the code from PR180 for training. This is what I'm currently doing - if there are any unreasonable or unnecessary steps, please advise. Thank you sincerely. To my understanding, a 70B model must have pp>1 or tp>1, as a single GPU card cannot hold the complete model.

MekkCyber added 4 commits May 14, 2024 14:40

adding llama_bitnet

7d4ebbb

adding llama_bitnet

0b8d8ea

fixing self.MLPBitNet to self.mlp

e1c3c1a

Adding inference support

2239f41

MekkCyber changed the title ~~Fixes : https://github.com/huggingface/nanotron/issues/114~~ FEAT: Adding 1.58bit LLMs training protocol in nanotron Jun 12, 2024

MekkCyber changed the title ~~FEAT: Adding 1.58bit LLMs training protocol in nanotron~~ FEAT: Adding 1.58bit LLMs training architecture in nanotron Jun 12, 2024

gau-nernst reviewed Sep 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

MekkCyber commented May 23, 2024

xrsrke commented May 23, 2024

MekkCyber commented May 23, 2024 •

edited

Loading

gau-nernst Sep 21, 2024 •

edited

Loading

gau-nernst Oct 30, 2024

hjc3613 commented Oct 29, 2024 •

edited

Loading

MekkCyber commented Oct 30, 2024

hjc3613 commented Oct 30, 2024

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

Are you sure you want to change the base?

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

Conversation

MekkCyber commented May 23, 2024

xrsrke commented May 23, 2024

MekkCyber commented May 23, 2024 • edited Loading

gau-nernst Sep 21, 2024 • edited Loading

Choose a reason for hiding this comment

gau-nernst Oct 30, 2024

Choose a reason for hiding this comment

hjc3613 commented Oct 29, 2024 • edited Loading

MekkCyber commented Oct 30, 2024

hjc3613 commented Oct 30, 2024

MekkCyber commented May 23, 2024 •

edited

Loading

gau-nernst Sep 21, 2024 •

edited

Loading

hjc3613 commented Oct 29, 2024 •

edited

Loading