Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A bunch of new optimizers and schedules #21

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
ea9f4d0
soap and muon has been added. also black + isort everything
Andron00e Oct 15, 2024
d5153c4
-fix annotations
Andron00e Oct 15, 2024
5dcc69a
soap is ready, muon needs to be done
Andron00e Oct 15, 2024
49b802d
AdEMAMix and Lion is here, Muon still TODO
Andron00e Oct 16, 2024
3d8b24a
AdEMAMix and Lion is here, Muon still TODO
Andron00e Oct 16, 2024
a047786
Schedule-Free AdamW is here, warmup_percent -> warmup_steps / iterations
Andron00e Oct 17, 2024
66b77f5
eval on a fix subset + better lr decay
mpagli Oct 17, 2024
7f158a3
Schedule-Free SGD + AdamW are here
Andron00e Oct 17, 2024
1f67f6d
Schedule-Free SGD + AdamW are here
Andron00e Oct 17, 2024
d3082e2
Merge branch 'soap' into small-upgrades
Andron00e Oct 17, 2024
5fb1d57
Merge pull request #19 from epfml/small-upgrades
Andron00e Oct 17, 2024
7378449
push to wandb team + display grad norm
mpagli Oct 17, 2024
156fdab
Merge branch 'small-upgrades' of https://github.com/epfml/llm-baselin…
mpagli Oct 17, 2024
6d8a47e
Merge pull request #20 from epfml/small-upgrades
Andron00e Oct 17, 2024
2b0e71f
a code for schedules is here
Andron00e Oct 17, 2024
5d43a53
new schedules
Andron00e Oct 17, 2024
a908d96
cos_inf and wsd schedules are here
Andron00e Oct 17, 2024
2e4bc4e
codestyle
Andron00e Oct 17, 2024
c5345d6
--fix in readme
Andron00e Oct 17, 2024
9b2c2e6
fix grad norm display
mpagli Oct 17, 2024
e142dfc
removed warmup_percent argument
Andron00e Oct 17, 2024
513b902
schedule-free fix, added scheduer check in optim/base.py
Andron00e Oct 18, 2024
8e2aa6d
--fix saving of a checkpoint if scheduler==none
Andron00e Oct 19, 2024
eda40bf
fix requirements
Andron00e Oct 20, 2024
3649936
Adam-mini is here
Andron00e Oct 20, 2024
e75b3be
refactoring
mpagli Oct 21, 2024
a624c2a
Merge branch 'soap' into ademamix
mpagli Oct 21, 2024
5876db9
reviewed
Andron00e Oct 21, 2024
5bfbe3a
Merge pull request #22 from epfml/ademamix
Andron00e Oct 21, 2024
94ca273
extra_args removed
Andron00e Oct 21, 2024
a1ce2f4
--fixed scheduler ckpt again, signSGD and Signum are here, codestyle
Andron00e Oct 24, 2024
5ddffa5
updated signum
Andron00e Oct 24, 2024
3311ca8
changed logic a bit, schedules moved from utils to schedules.py; cos_…
Andron00e Oct 31, 2024
1ca7a76
sgdf
Andron00e Nov 4, 2024
d485b74
prodigy
Andron00e Nov 4, 2024
6007a95
add fineweb
mpagli Nov 4, 2024
2e526b8
Merge branch 'soap' of https://github.com/epfml/llm-baselines into soap
mpagli Nov 4, 2024
dc5169c
-- description
Andron00e Nov 4, 2024
7dd576b
log tokens processed
mpagli Nov 5, 2024
8f51763
Merge branch 'soap' of https://github.com/epfml/llm-baselines into soap
mpagli Nov 5, 2024
94a4e57
add fineweb-edu
mpagli Nov 5, 2024
c8e6fd8
finewebedu
Andron00e Nov 5, 2024
99d7af6
sophia is in the process
Andron00e Nov 5, 2024
fd9aa7e
shampoo is here, needs major improvements (very memory- and time- con…
Andron00e Nov 6, 2024
479b56b
shampoo is here, problems with memory due to torch.inverse, needs imp…
Andron00e Nov 6, 2024
c811739
muon is here, still fix shampoo
Andron00e Nov 6, 2024
361213d
--fix sophiag
Andron00e Nov 6, 2024
2b9d0f3
sophiag fixed, test two adamw runs using muon and soap branches, if t…
Andron00e Nov 6, 2024
e7780d6
Merge pull request #24 from epfml/muon
Andron00e Nov 7, 2024
19ff7b1
--adopt todo, --fix sophia
Andron00e Nov 7, 2024
6709eb6
clipped version are here, muon schedules todo
Andron00e Nov 8, 2024
0f05b19
updates in muon
Andron00e Nov 10, 2024
729c820
-- micro
Andron00e Nov 10, 2024
d81d0b2
implemented a scheduler for muon
Andron00e Nov 11, 2024
6a8a159
new scheduler, double decay, test it
Andron00e Nov 11, 2024
c30eeca
minor
Andron00e Nov 12, 2024
82dd1af
adopt is here
Andron00e Nov 18, 2024
755cf59
adopt fix
Andron00e Nov 18, 2024
0c53df3
adopt fix again
Andron00e Nov 18, 2024
6cfd01b
adopt again
Andron00e Nov 19, 2024
d4585ae
mars is here, ready to try
Andron00e Nov 19, 2024
9700a50
--fix info
Andron00e Nov 19, 2024
29ce41b
--small changes in mars train
Andron00e Nov 19, 2024
aaff7fc
adafactor and lamb
Andron00e Nov 21, 2024
7ece053
--fix adopt
Andron00e Nov 23, 2024
664d2df
Merge pull request #25 from epfml/soap
Andron00e Nov 24, 2024
10a2a3a
muon-debug
Andron00e Nov 24, 2024
4f5c061
--signum fix
Andron00e Nov 25, 2024
5d63141
Merge pull request #26 from epfml/muon
Andron00e Nov 25, 2024
c642f63
normalized sgd + sophia removed hardcoded precondition frequency
Andron00e Dec 7, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 76 additions & 13 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# LLM-baselines

A modular codebase to experiment with transformers, inspired by NanoGPT.
A modular codebase to experiment with transformers, inspired by nanoGPT.

## Quickstart

Expand Down Expand Up @@ -36,44 +36,104 @@ parser.add_argument('--batch_size', default=32, type=int)
parser.add_argument('--acc_steps', default=4, type=int)
parser.add_argument('--seed', default=0, type=int) # random seed for the parameters
parser.add_argument('--data_seed', default=1337, type=int) # random seed defining the data ordering
parser.add_argument('--eval_interval', default=200, type=int)
parser.add_argument('--full_eval_at', nargs="+", type=int)
parser.add_argument('--eval_batches', default=32, type=int)
parser.add_argument('--device', default='cuda:0', type=str) # see below to run on multiple GPUs
parser.add_argument('--iterations', default=25000, type=int) # total number of training iterations
parser.add_argument('--lr', default=1e-3, type=float)
parser.add_argument('--warmup_percent', default=0.05, type=float) # the total number of warmup steps is iterations * warmup_percent
parser.add_argument('--warmup_steps', default=300, type=int)
parser.add_argument('--lr', default=1e-3, type=float)
parser.add_argument('--wsd_final_lr_scale', default=0.0, type=float) # wsd scheduler
parser.add_argument('--wsd_fract_decay', default=0.1, type=float) # wsd scheduler
parser.add_argument('--decay_type', default='linear', choices=['linear', 'cosine', 'exp', 'miror_cosine', 'square', 'sqrt'])
parser.add_argument('--dd_second_decay_type', default='linear', choices=['linear', 'cosine', 'exp', 'miror_cosine', 'square', 'sqrt'])
parser.add_argument('--dd_first_lr_factor', default=1e-2, type=float)
parser.add_argument('--weight_decay', default=0.1, type=float) # I recommend you keep this value, else instabilities might arise
parser.add_argument('--beta1', default=0.9, type=float) # adam parameter
parser.add_argument('--beta2', default=0.95, type=float) # adam parameter
parser.add_argument('--scheduler', default='cos', choices=['linear', 'cos', 'none'])
parser.add_argument('--opt', default='adamw', choices=['adamw', 'sgd'])
parser.add_argument('--scheduler', default='cos', choices=['linear', 'cos', 'wsd', 'cos_inf', 'none', 'dd'])
parser.add_argument('--cos_inf_steps', default=0, type=int) # cos_inf scheduler
parser.add_argument('--opt', default='adamw', choices=['adamw', 'sgd', 'muon', 'soap', 'ademamix', 'ademamix2', 'lion', 'sf-adamw', 'sf-sgd', 'signsgd', 'signum', 'sgdf', 'prodigy', 'sophiag', 'shampoo', 'adopt', 'clip-adagrad', 'clip-adagrad-delay-eta', 'clip-adam', 'clip-adam-delay-eta', 'mars', 'adafactor', 'lamb', 'normalized-sgd'])
parser.add_argument('--eval_freq', default=200, type=int) # in iterations
parser.add_argument('--results_base_folder', default="./exps", type=str) # where the checkpoints will be saved
parser.add_argument('--grad_clip', default=0.0, type=float) # default value is 1.0 in NanoGPT
parser.add_argument('--grad_clip', default=0.0, type=float) # default value is 1.0 in nanoGPT
parser.add_argument('--momentum', default=0.9, type=float)
parser.add_argument('--shampoo_beta', default=-1.0, type=float)
parser.add_argument('--precondition_frequency', default=10, type=int) #for SOAP and Sophia
parser.add_argument('--max_precond_dim', default=10000, type=int)
parser.add_argument('--merge_dims', default=False, type=bool) # merge dimensions till the product of the dimensions is less than or equal to max_precond_dim
parser.add_argument('--precondition_1d', default=False, type=bool)
parser.add_argument('--normalize_grads', default=False, type=bool)
parser.add_argument('--soap_data_format', default='channels_first', type=str)
parser.add_argument('--correct_bias', default=True, type=bool)
parser.add_argument('--nesterov', default=False, type=bool) # whether to use Nesterov-style momentum
parser.add_argument('--muon_ns_steps', default=5, type=int) # the number of steps to use in the newton schulz, if it is iterative
parser.add_argument('--muon_lr_factor', default=0.02, type=float) # a factor by which to reduce the lr for muon
parser.add_argmunet('--adema_beta3', default=0.9, type=float) # beta3 in AdEMAMix
parser.add_argument('--adema_alpha', default=2.0, type=float) # alpha in AdEMAMix
parser.add_argument('--adema_beta3_warmup', default=None, type=int) # AdEMAMix hyperparameter
parser.add_argument('--adema_alpha_warmup', default=None, type=int) # AdEMAMix hyperparameter
parser.add_argument('--schedulefree_r', defalut=0.0, type=float) # schedulefree hyperparameter
parser.add_argument('--weight_lr_power', default=2.0, type=float) # schedulefree hyperparameter
parser.add_argument('--model_sharding', default=None, type=bool) # Adam-mini
parser.add_argument('--adam_mini_verbose', default=False, type=bool) # print all the logs if true
parser.add_argument('--log_interval', default=50, type=int)
parser.add_argument('--dampening', default=0.0, type=float)
parser.add_argument('--prodigy_beta3', default=None, type=float) # coefficients for computing the Prodidy stepsize using running averages
parser.add_argument('--prodigy_decouple', default=True, type=bool) # Use AdamW style decoupled weight decay
parser.add_argument('--prodigy_use_bias_correction', default=False, type=bool)
parser.add_argument('--prodigy_safeguard_warmup', default=False, type=bool) # Remove lr from the denominator of D estimate to avoid issues during warm-up stage. Off by default.
parser.add_argument('--prodigy_fsdp_in_use', default=False, type=bool)
parser.add_argument('--sophia_rho', default=0.04, type=float)
parser.add_argument('--clipping_type', default='no', choices=['no', 'local', 'elementwise']) # for methods with clipping
parser.add_argument('--clipping_eta', default=1.0, type=float)
parser.add_argument('--mars_type', default='mars-adamw', choices=['mars-adamw', 'mars-lion', 'mars-shampoo'],)
parser.add_argument('--mars_vr_gamma', default=0.025, type=float)
parser.add_argument('--mars_is_approx', default=True, type=float)
parser.add_argument('--mars_lr', default=3e-3, type=float)
parser.add_argument('--mars_beta1', default=0.95, type=float)
parser.add_argument('--mars_beta2', default=0.99, type=float)
parser.add_argument('--adafactor_decay_rate', default=-0.8, type=float)
parser.add_argument('--lamb_use_bias_correction', default=False, type=bool)
# Dataset params
parser.add_argument('--dataset', default='slimpajama', choices=['slimpajama', 'wikitext', "shakespeare-char", 'arxiv', "arxiv2000", "arxiv+wiki", 'openwebtext2'])
parser.add_argument('--dataset', default='slimpajama', choices=['slimpajama', 'wikitext', 'shakespeare-char', 'arxiv', 'arxiv2000', 'arxiv+wiki', 'openwebtext2', 'redpajama', 'redpajamav2', 'slimpajama_chunk1', 'fineweb', 'finewebedu'])
parser.add_argument('--tokenizer', default='gpt2', type=str, choices=['gpt2', 'mistral'])
parser.add_argument('--vocab_size', default=50304, type=int)
parser.add_argument('--data_in_ram', action='store_true') # force the data to RAM, you most likely do not need this
# Model params
parser.add_argument('--model', default='base', choices=['base', 'llama2'])
parser.add_argument('--use_pretrained', default="none", type=str) # 'none', 'gpt-2' or a path to the pretraind model
parser.add_argument('--model', default='base', choices=['base', 'llama', 'test'])
parser.add_argument('--parallel_block', action='store_true')
parser.add_argument('--use_pretrained', default='none', type=str) # 'none', 'gpt2' or a path to the pretraind model
parser.add_argument('--from_dense', action='store_true')
parser.add_argument('--init_std', default=0.02, type=float)
parser.add_argument('--dropout', default=0.0, type=float) # keep to 0 unless in low data regime (e.g. wikitext)
parser.add_argument('--n_head', default=12, type=int)
parser.add_argument('--n_layer', default=12, type=int) # depth in (att + ff) blocks
parser.add_argument('--n_embd', default=768, type=int) # hidden size ...
parser.add_argument('--sequence_length', default=512, type=int)
parser.add_argument('--dtype', default=torch.bfloat16, type=torch.dtype)
parser.add_argument('--dtype', default='bfloat16', type=str, choices=['float32', 'float16', 'bfloat16'],)
parser.add_argument('--bias', default=False, type=bool)
parser.add_argument('--compile', action='store_true') # if true then model is compiled
parser.add_argument('--rmsnorm_eps', default=1e-5, type=float) # used by the llama model
parser.add_argument('--multiple_of', default=256, type=int) # used by the llama model make SwiGLU hidden layer size multiple of large power of 2
parser.add_argument('--n_kv_head', default=None, type=int) # for Adam-mini
# Checkpointing
parser.add_argument('--results_base_folder', default='./exps', type=str)
parser.add_argument('--permanent_ckpt_interval', default=0, type=int)
parser.add_argument('--latest_ckpt_interval', default=0, type=int)
parser.add_argument('--resume_from', default=None, type=str)
parser.add_argument('--resume_from_swa', default=None, type=str)
parser.add_argument('--auto_resume', default=True)
# logging params (WandB)
parser.add_argument('--wandb', action='store_true') # whether to use wandb or not
parser.add_argument('--wandb_project', default="my-project", type=str)
parser.add_argument('--wandb_run_prefix', default="none", type=str) # is added before the autogenerated experiment name
parser.add_argument('--wandb_project', default='my-project', type=str)
parser.add_argument('--wandb_entity', default=None, type=none_or_str) # for the team projects
parser.add_argument('--wandb_run_prefix', default='none', type=str) # is added before the autogenerated experiment name
parser.add_argument('--eval_seq_prefix', default="Once upon a time", type=str) # prefix used to generate sequences
parser.add_argument('--log_dynamics', action='store_true')
# Distributed args
parser.add_argument('--distributed_backend', default=None, type=str, required=False,
choices=distributed.registered_backends()) # distributed backend type (e.g. nccl)
parser.add_argument('--save_checkpoint_freq', default=None, type=int, required=False)
```

## Using WandB
Expand Down Expand Up @@ -111,12 +171,15 @@ src/
optim/
utils.py # contains eval and get_batch functions
base.py # training function for the base and llama models
...
distributed/
# code to enable simple distributed training
```

Given the above structure, to add your own model, you can just fork the `./src/models/base.py` file, do your modifications, then if necessary fork the `./src/optim/base.py` in case you need some custom training loop or evaluation. You also need to fork the `./src/config/base.py` file to add your own parameters, which imply adding your new config to the mapping `CONFIG_FORMAT_TO_MODULE_MAP` in `./src/config/__init__.py`. To add a new dataset, create a new file in the `data` folder, check `wikitext.py` for the expected format.

**Note:** we use [black](https://black.readthedocs.io/en/stable/the_black_code_style/current_style.html) and [isort](https://pycqa.github.io/isort/) for all pull requests. Before committing your code, simply run ```black . && isort .``` and you will be fine.

## Multi-GPU training

Given a multi-GPU machine with e.g. 4 GPUs, one can distribute the training using data-parallelism:
Expand Down
11 changes: 6 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
tiktoken
--find-links https://download.pytorch.org/whl/torch_stable.html
torch==2.0.0+cu118
torchaudio==2.0.0+cu118
torchvision==0.15.0+cu118
tqdm==4.65.0
torch
torchaudio
torchvision
tqdm
transformers
wandb
datasets
zstandard
zstandard
numpy==1.22.4
Loading