Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT: Adding 1.58bit LLMs training architecture in nanotron #180

Draft
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

MekkCyber
Copy link

Implementation of 1.58bit LLM with Llama following the paper & handbook released by Microsoft :

https://github.com/microsoft/unilm/blob/master/bitnet/The-Era-of-1-bit-LLMs__Training_Tips_Code_FAQ.pdf

Here are The training results on 25B tokens :

loss_curve

cc @NouamaneTazi @xrsrke @thomwolf

@xrsrke
Copy link
Member

xrsrke commented May 23, 2024

Hello. Thanks for the PR. One question: The difference in loss here is very high? In the paper, it should be ~0.1, but here the difference is more than 0.5

image

@MekkCyber
Copy link
Author

MekkCyber commented May 23, 2024

I think it has to do with the batch size. During our latest experiment, we trained the 1.58 model on 100B tokens, and we managed to get a 2.8 loss after 25B tokens with a batch size of 1024 :
lr

@MekkCyber MekkCyber changed the title Fixes : https://github.com/huggingface/nanotron/issues/114 FEAT: Adding 1.58bit LLMs training protocol in nanotron Jun 12, 2024
@MekkCyber MekkCyber changed the title FEAT: Adding 1.58bit LLMs training protocol in nanotron FEAT: Adding 1.58bit LLMs training architecture in nanotron Jun 12, 2024
) / w_scale / x_scale
else :
w = self.weight
x_norm = normalize(x, self.in_features)
Copy link

@gau-nernst gau-nernst Sep 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't RMSNorm here have learnable weights?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MekkCyber May I know why this is marked as resolved? From my understanding of the training tips handbook, the new RMSNorm should have learnable weights as the usual RMSNorm layers. Is there a reason you left it out here? (as well as in the PR merged in huggingface/transformers).

@hjc3613
Copy link

hjc3613 commented Oct 29, 2024

Hi, I fetched this pr, and fintune the llama 70B using tp=8 or pp=8, before training, I have convert the llama 70B into nanotron format using this method: #174, with set pp=1, dp=1, tp=1, but when start training using pp=1, dp=1, tp=8, I got this error:
image
when training with pp=1, dp=1, tp=1 which is consistent with convert config, I run OOM
so, how to finetune the 70B model into 1.58bit from scratch? could you give me some suggestion please? thank you!

@MekkCyber
Copy link
Author

Hey @hjc3613 thanks for the report ! you don't have to be consitent with the convert config, it should work I will investigate that! Can you tell me from your side what is the content of models/qwen2.5-72b-instruct-nanotron/model/model/decoder/0/pp_block/MLPBitNet

@hjc3613
Copy link

hjc3613 commented Oct 30, 2024

Hey @hjc3613 thanks for the report ! you don't have to be consitent with the convert config, it should work I will investigate that! Can you tell me from your side what is the content of models/qwen2.5-72b-instruct-nanotron/model/model/decoder/0/pp_block/MLPBitNet

thank you very much!
My requirement is to successfully fine-tune a 70B model to a 1.58bit model, using either Qwen2.5 70B or LLama 70B - both have almost identical model architectures. According to the training code, I need to provide a dataset and a base model in nanotron format, which will then be fine-tuned to 1.58bit for inference. However, I only have a HF version of the base model, so following PR174, I'm first converting it to nanotron format. Before conversion, I need to set dp=1, pp=1, tp=1. After conversion, I'm using the code from PR180 for training. This is what I'm currently doing - if there are any unreasonable or unnecessary steps, please advise. Thank you sincerely. To my understanding, a 70B model must have pp>1 or tp>1, as a single GPU card cannot hold the complete model.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants