Merge branch 'main' into improve-documentation

allenai · Nov 26, 2024 · 71abc2c · 71abc2c
2 parents 4e256a9 + 9c677c9
commit 71abc2c
Show file tree

Hide file tree

Showing 166 changed files with 63,869 additions and 1,030 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -14,6 +14,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - `one_in_eight` configuration for activation checkpointing
 - New tokenizer in the source instead of from huggingface
 - Improved support for GCS
+- `torch.compile()` now only compiles each block, not the whole model.
+- Support for `torch.compile()` with `dynamic=True`
+- Resetting the `torch.compile()` after every evaluation, because evaluation messes with the compiled versions
+- Added more in-loop evaluation tasks to pick from, mostly for scaling law.
 
 
 ## [v0.5.1](https://github.com/allenai/OLMo/releases/tag/v0.5.1) - 2024-10-17
@@ -55,7 +59,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - Swapped in correct flan data mix.
 - Fix bug where the attention norm, when applied before the attention block, was modifying the residual stream.
 - Fixed `OLMo.from_checkpoint()` so that it correctly loads `olmo_core` and `torch_new` style checkpoints.
-- Fixed `preserve_rng_state` being incorrectly set to False when doing gradient checkpointing with dropout 
+- Fixed `preserve_rng_state` being incorrectly set to False when doing gradient checkpointing with dropout
 
 
 ## [v0.4.0](https://github.com/allenai/OLMo/releases/tag/v0.4.0) - 2024-07-11

diff --git a/.../annealing/peteish1-anneal-from-1907359-50B-nowup-moremath-dclm07-fw2-se-flan-google.yaml b/.../annealing/peteish1-anneal-from-1907359-50B-nowup-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-476848-100B-moremath-dclm07-fw2-se-flan-google.yaml b/configs/annealing/peteish13-anneal-from-476848-100B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-476848-300B-moremath-dclm07-fw2-google.yaml b/configs/annealing/peteish13-anneal-from-476848-300B-moremath-dclm07-fw2-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-476848-300B-moremath-dclm07-fw2-se-flan-google.yaml b/configs/annealing/peteish13-anneal-from-476848-300B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-557000-100B-big-number-no-whammy-2-2xbsz-google.yaml b/configs/annealing/peteish13-anneal-from-557000-100B-big-number-no-whammy-2-2xbsz-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-557000-100B-big-number-no-whammy-2-google.yaml b/configs/annealing/peteish13-anneal-from-557000-100B-big-number-no-whammy-2-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-557000-100B-moremath-dclm07-fw2-se-flan-google.yaml b/configs/annealing/peteish13-anneal-from-557000-100B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-557000-300B-moremath-dclm07-fw2-se-flan-google.yaml b/configs/annealing/peteish13-anneal-from-557000-300B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-596057-100B-big-number-no-whammy-2-google.yaml b/configs/annealing/peteish13-anneal-from-596057-100B-big-number-no-whammy-2-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-596057-100B-moremath-dclm07-fw2-se-flan-google.yaml b/configs/annealing/peteish13-anneal-from-596057-100B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish13-anneal-from-596057-300B-big-number-no-whammy-2-google.yaml b/configs/annealing/peteish13-anneal-from-596057-300B-big-number-no-whammy-2-google.yaml
diff --git a/configs/annealing/peteish7-anneal-from-477000-100B-moremath-dclm07-fw2-se-flan-google.yaml b/configs/annealing/peteish7-anneal-from-477000-100B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish7-anneal-from-928646-300B-nowup-moremath-dclm07-fw2-google.yaml b/configs/annealing/peteish7-anneal-from-928646-300B-nowup-moremath-dclm07-fw2-google.yaml
diff --git a/.../annealing/peteish7-anneal-from-928646-300B-nowup-moremath-dclm07-fw2-se-flan-google.yaml b/.../annealing/peteish7-anneal-from-928646-300B-nowup-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-code-dclm07-fw2.yaml b/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-code-dclm07-fw2.yaml
diff --git a/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2-mask.yaml b/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2-mask.yaml
diff --git a/...ing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2-se-flan-from4000-2xbsz.yaml b/...ing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2-se-flan-from4000-2xbsz.yaml
diff --git a/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2-se-flan.yaml b/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2-se-flan.yaml
diff --git a/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2.yaml b/configs/annealing/peteish7-anneal-from-928646-50B-nowup-moremath-dclm07-fw2.yaml
diff --git a/.../annealing/peteish7-medlr-anneal-from-477000-100B-moremath-dclm07-fw2-se-flan-google.yaml b/.../annealing/peteish7-medlr-anneal-from-477000-100B-moremath-dclm07-fw2-se-flan-google.yaml
diff --git a/...gs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-dclm07-flan-decon-hard-train.yaml b/...gs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-dclm07-flan-decon-hard-train.yaml
diff --git a/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-dclm07-flan-decon-hard.yaml b/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-dclm07-flan-decon-hard.yaml
diff --git a/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-dclm07-flan-decon.yaml b/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-dclm07-flan-decon.yaml
diff --git a/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-refine-og.yaml b/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-refine-og.yaml
diff --git a/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-refine-rw.yaml b/configs/annealing/peteish7-weka-anneal-from-928646-50B-nowup-refine-rw.yaml
diff --git a/...ng/peteish7-weka-anneal-from-928646-50B-nowup_big-number-no-whammy-3_seed1337-google.yaml b/...ng/peteish7-weka-anneal-from-928646-50B-nowup_big-number-no-whammy-3_seed1337-google.yaml
diff --git a/...ling/peteish7-weka-anneal-from-928646-50B-nowup_big-number-no-whammy-3_seed42-google.yaml b/...ling/peteish7-weka-anneal-from-928646-50B-nowup_big-number-no-whammy-3_seed42-google.yaml
diff --git a/configs/peteish1-google.yaml b/configs/peteish1-google.yaml
diff --git a/configs/peteish1-weka.yaml b/configs/peteish1-weka.yaml
@@ -84,7 +84,7 @@ save_num_unsharded_checkpoints_to_keep: -1
 load_path: null
 
 max_duration: 1ep
-global_train_batch_size: 1024
+global_train_batch_size: 512
 device_train_microbatch_size: 4
 
 precision: amp_bf16

diff --git a/configs/peteish13-google.yaml b/configs/peteish13-google.yaml
diff --git a/configs/peteish13-s3.yaml b/configs/peteish13-s3.yaml
@@ -84,7 +84,7 @@ save_num_unsharded_checkpoints_to_keep: -1
 load_path: null
 
 max_duration: 1ep
-global_train_batch_size: 1024
+global_train_batch_size: 2048
 device_train_microbatch_size: 2
 
 precision: amp_bf16

diff --git a/configs/peteish13-weka.yaml b/configs/peteish13-weka.yaml
diff --git a/configs/peteish7-google.yaml b/configs/peteish7-google.yaml
diff --git a/olmo/config.py b/olmo/config.py
@@ -696,6 +696,17 @@ class CompilerConfig(BaseConfig):
     The backend to use.
     """
 
+    dynamic: Optional[bool] = None
+    """
+    From the torch docs:
+    
+    Use dynamic shape tracing. When this is True, we will up-front attempt to generate a kernel that is as dynamic
+    as possible to avoid recompilations when sizes change. This may not always work as some
+    operations/optimizations will force specialization; use TORCH_LOGS=dynamic to debug overspecialization. When
+    this is False, we will NEVER generate dynamic kernels, we will always specialize. By default (None), we
+    automatically detect if dynamism has occurred and compile a more dynamic kernel upon recompile.
+    """
+
 
 class DistributedStrategy(StrEnum):
     ddp = "ddp"