-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can you make the 50B slimpajama (your pre-train data) available to public? #4
Comments
Thanks for the great work! I agree that having exact dataset would be super useful to pretrain other models of similar size and compare the results objectively. |
Thank you both for the question! I am working a new setup that would allow you to reproduce both data reprocessing and pretraining with cleaner code and more documentation. Please stay tuned. Meanwhile, if you don't want to wait for my updates, to re-produce my 50B slim pajama data (not 100% deterministic, see below), you can download the full slim pajama dataset, tokenize the full dataset, and extract the first 50B token. Specifically,
And finally run the same pertaining code I used in this repo: The training command will call this function with the same seed with shuffle to get the preprocessed data: The reason for using the full SlimPajama data is that I can continue to train beyond the first 50B for potentially better results. Not 100% deterministic please keep in mind that although we are using the same random seeds, some randomness could come from the data preprocessing steps. Specifically, I tried to preprocess the data while slim pajama data git LFS data download is still running. Thus, depending on the list of available files in the local folder for https://github.com/keeeeenw/TinyLlama/blob/main/scripts/prepare_slimpajama.py#L150, you might end up with a different order of the preprocessed data which would may have a different set of 50B tokens. To allow myself to reproduce the results 100% deterministically, I saved the preprocessed data locally after the "python scripts/prepare_slimpajama.py" step. The data is 900+G and I don't have a good way to distribute it. One option is for me to run the training pipeline to get the 50B data sample again, and run the tokenizer to reverse the tokenization process, and finally save the text data that corresponds to the 50B token. This would help reduce the data size for upload, remove the dependency on the tokenizer, and reduce the amount of unnecessary data if folks are not interested in continue training the model beyond the 50B token. And then I can upload the text to hugging face with the same format as the slim pajama dataset (Not sure if Hugging Face allow uploading this much data though. It seems to be doable based on https://discuss.huggingface.co/t/is-there-a-size-limit-for-dataset-hosting/14861/12) |
Thank you for sharing the pipeline and your thoughts.
|
The reverse tokenized data (i.e. in text form) would be valuable atleast in my usecase as I am going to use a very different tokenizer. So when you get the time, I will appreciate if you coud make that 50B available on HF. |
Hi,
Very impressive results. Please open source your 50B subset of pre-train data.
The text was updated successfully, but these errors were encountered: