Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you make the 50B slimpajama (your pre-train data) available to public? #4

Open
sanyalsunny111 opened this issue Apr 16, 2024 · 4 comments

Comments

@sanyalsunny111
Copy link

Hi,

Very impressive results. Please open source your 50B subset of pre-train data.

@MaveriQ
Copy link

MaveriQ commented Apr 21, 2024

Thanks for the great work!

I agree that having exact dataset would be super useful to pretrain other models of similar size and compare the results objectively.

@keeeeenw
Copy link
Owner

keeeeenw commented Apr 21, 2024

Thank you both for the question! I am working a new setup that would allow you to reproduce both data reprocessing and pretraining with cleaner code and more documentation. Please stay tuned.

Meanwhile, if you don't want to wait for my updates, to re-produce my 50B slim pajama data (not 100% deterministic, see below), you can download the full slim pajama dataset, tokenize the full dataset, and extract the first 50B token.

Specifically,

# Download the dataset
cd /path/to/dataset
git lfs install
git clone https://huggingface.co/datasets/cerebras/SlimPajama-627B

# Download tokenizer from https://huggingface.co/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T/tree/main or your preferred tokenizer if you do not need to reproduce my results. You can also call tokenizer = AutoTokenizer.from_pretrained("TinyLlama/TinyLlama-1.1B-step-50K-105b") in python which will save it in a local hugging face folder.

# Tokenize (I named the destination slim_star_combined but I don't use star coder dataset)
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path /path/to/tokenizer --destination_path data/slim_star_combined --split validation --percentage 1.0
python scripts/prepare_slimpajama.py --source_path /path/to/SlimPajama --tokenizer_path /path/to/tokenizer  --destination_path data/slim_star_combined --split train --percentage 1.0

And finally run the same pertaining code I used in this repo:
https://github.com/keeeeenw/TinyLlama/blob/main/run_e2e_no_wait.sh#L35

The training command will call this function with the same seed with shuffle to get the preprocessed data:
https://github.com/keeeeenw/TinyLlama/blob/main/pretrain/tinyllama.py#L382

The reason for using the full SlimPajama data is that I can continue to train beyond the first 50B for potentially better results.

Not 100% deterministic please keep in mind that although we are using the same random seeds, some randomness could come from the data preprocessing steps. Specifically, I tried to preprocess the data while slim pajama data git LFS data download is still running. Thus, depending on the list of available files in the local folder for https://github.com/keeeeenw/TinyLlama/blob/main/scripts/prepare_slimpajama.py#L150, you might end up with a different order of the preprocessed data which would may have a different set of 50B tokens.

To allow myself to reproduce the results 100% deterministically, I saved the preprocessed data locally after the "python scripts/prepare_slimpajama.py" step. The data is 900+G and I don't have a good way to distribute it. One option is for me to run the training pipeline to get the 50B data sample again, and run the tokenizer to reverse the tokenization process, and finally save the text data that corresponds to the 50B token. This would help reduce the data size for upload, remove the dependency on the tokenizer, and reduce the amount of unnecessary data if folks are not interested in continue training the model beyond the 50B token. And then I can upload the text to hugging face with the same format as the slim pajama dataset (Not sure if Hugging Face allow uploading this much data though. It seems to be doable based on https://discuss.huggingface.co/t/is-there-a-size-limit-for-dataset-hosting/14861/12)

@MaveriQ
Copy link

MaveriQ commented Apr 22, 2024

Thank you for sharing the pipeline and your thoughts.

  • I wonder what kind of learning rate schedule are you using? Specifically usually it's a linear/cosine decay with warmup. However both require number of training steps to be specified in advance. When you say you will continue pretraining, are you going for another schedule cycle (from what I know this is not very common for pretraining) or did you specify number of training steps beyond 50B right from the start?

@MaveriQ
Copy link

MaveriQ commented Apr 22, 2024

The reverse tokenized data (i.e. in text form) would be valuable atleast in my usecase as I am going to use a very different tokenizer. So when you get the time, I will appreciate if you coud make that 50B available on HF.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants