diff --git a/README.md b/README.md
index c53694158..874354206 100644
--- a/README.md
+++ b/README.md
@@ -88,10 +88,6 @@ The quantized model is sensitive to input types and CUDA handling. To avoid pote
 
 Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/OLMo-eval) repo.
 
-## Debugging
-
-See [Debugging](https://github.com/allenai/OLMo/blob/main/docs/NOTES.md#debugging).
-
 ## Citing
 
 ```bibtex
diff --git a/docs/Kempner.md b/docs/Kempner.md
deleted file mode 100644
index 4f20c6b97..000000000
--- a/docs/Kempner.md
+++ /dev/null
@@ -1,41 +0,0 @@
-How to run on the Kempner cluster
-===
-
-Getting started
----
-
-1. Get your login. Log into the Kempner login nodes.
-2. Set up directories:
-
-   ```bash
-   ln -s /n/holyscratch01/kempner_lab/Lab/checkpoints
-   ln -s /n/holyscratch01/kempner_lab/Lab/data
-   ln -s /n/holyscratch01/kempner_lab/Lab/logs
-   ```
-
-3. Add this to `~/.bashrc`. Then, run those commands in your shell.
-
-   ```bash
-   module load ncf
-   module load awscli
-   ```
-
-4. `aws configure`
-5. Install Miniconda with `wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && bash Miniconda3-latest-Linux-x86_64.sh`, then follow prompts. You'll probably want to log out and back in after this.
-6. `conda create -y -n LLM python=3.10 ipython`
-7. `conda activate LLM`
-8. `conda install -y pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia`
-9. `git clone https://github.com/allenai/LLM.git`
-10. `cd LLM`
-11. `pip install -e .`
-12. Pre-download all the downstream evals. In a Python shell:
-
-    ```bash
-    from olmo.eval.downstream import *
-    tokenizer = Tokenizer.from_file("tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json")
-    for x in label_to_task_map.values():
-        kwargs = {}
-        if isinstance(x, tuple):
-            x, kwargs = x
-        x(tokenizer=tokenizer, **kwargs)
-    ```
diff --git a/docs/LUMI.md b/docs/LUMI.md
deleted file mode 100644
index 0663570f8..000000000
--- a/docs/LUMI.md
+++ /dev/null
@@ -1,213 +0,0 @@
-How to run in LUMI
-==================
-
-Detailed documentation is at https://docs.lumi-supercomputer.eu.
-If you are reading that, keep in mind that it is written for HPC people (who care about Fortran), not ML people (who care about Python).
-
-## Project name
-
-Project names are strings.
-They show up everywhere.
-They always look like `"project_123456789`.
-Look up ours in the CSC login system.
-
-## Partitions
-
-LUMI has different partitions.
-The "G" partition is for GPUs.
-For Slurm, we do big runs with `--partition standard-g`.
-We do small runs for debugging on the testing partition, with `--partition small-g`.
-Runs on the small partition have a maximum runtime of 30 minutes, but it seems they don't count against our quota.
-Use `small-g` when you're testing your setup.
-
-## File systems
-
-LUMI has a lot of file systems.
-These are accessible from login nodes as well as compute nodes.
-The names seem arbitrary.
-* Your home directory. 20GB limit. Not particularly fast. Home directories are randomly spread among five different
-  lustre filesystems.
-* `/project/project_123456789`. 50GB limit. I use this for Singularity containers and other build debris, because
-  the LUMI docs told me to do this.
-* `/scratch/project_123456789`. No real limit. CSC recommended we put data there, so that's what I did. Data gets deleted after 90 days of no reads, but that's not enforced right now.
-* `/pfs/lustref1/flash/project_123456789`. No real limit. Very fast. CSC recommended we use this a lot. Data gets deleted after 30 days of no reads, but that's not enforced right now. We should put checkpoints here.
-
-## Custom software
-
-LUMI has two different ways of installing software.
-Conda isn't one of them.
-
-The `module` system puts tools into your environment, like Conda does, but it doesn't have "environments" like
-Conda.
-You just `module load <tool>` whenever you need something.
-There is a [tiny software library](https://lumi-supercomputer.github.io/LUMI-EasyBuild-docs/) of available tools on LUMI.
-If you need something that isn't in the software library, you use EasyBuild to build it.
-This is cumbersome if you have many dependencies, and we do, so we are not leaning into this except for bootstrapping build tools into the environment.
-
-Singularity is the LUMI way of running Docker containers.
-Singularity containers are lighter weight than Docker, somewhere between Conda and Docker.
-When you run in Singularity, the host file system is still visible, but not all drives.
-Some drives have to be manually mapped in with command line flags.
-I have not figured out exactly how this works.
-You can convert a Docker container to a Singularity container with `singularity pull --docker-login docker://my-docker-image`.
-`--docker-login` makes it ask for a password every time.
-There is a way to give it passwords permanently, but I haven't figured out how.
-
-There is a `Dockerfile.lumi` in this repo.
-I converted it to a Singularity image at `/project/project_123456789/containers/llm-lumi_latest.sif`.
-This image contains (should contain) everything we need to run LLM training, and also enough tools to debug a node, or just do general work on the LUMI login nodes.
-For example, it has tools for downloading from S3, which I needed to get the prepared data downloaded.
-This image can be recreated with `singularity pull --docker-login docker://ghcr.io/allenai/llm-lumi:latest` (after building it and uploading it to ghcr.io from a Beaker machine).
-
-## Basic setup to work in LUMI
-
-Here is my `~/.bashrc` file, for your copy and pasting pleasure:
-
-```bash
-# Load LUMI tools
-module load LUMI/22.08 partition/G
-module load systools/22.08
-
-# Allow TERM to make backspace and other keys work properly in the terminal.
-# https://unix.stackexchange.com/questions/43103/backspace-tab-not-working-in-terminal-using-ssh
-export TERM=vt100
-
-# Environment variables
-export PROJECT=project_123456789
-export PROJECT_DIR=/project/$PROJECT
-export SCRATCH_DIR=/scratch/$PROJECT
-export FLASH_DIR=/pfs/lustref1/flash/$PROJECT
-
-# Singularity likes to write giant cache files which blow up your home directory quota, so I put it on the
-# flash drive.
-export SINGULARITY_CACHEDIR=$PROJECT_DIR/singularity_cache.$USER
-export SINGULARITY_TMPDIR=$PROJECT_DIR/singularity_tmp.$USER
-
-# EasyBuild environment variables
-export SBATCH_ACCOUNT=$PROJECT
-export SALLOC_ACCOUNT=$SBATCH_ACCOUNT
-export EBU_USER_PREFIX=/project/$SBATCH_ACCOUNT
-
-# For downloading things from the ai2-llm bucket.
-export AWS_ACCESS_KEY_ID=XXXXXXX
-export AWS_SECRET_ACCESS_KEY=YYYYYYY
-
-# Other API keys for logging and metric tracking.
-export WANDB_API_KEY=XXXXXXX
-```
-
-I `git clone` the repo directly into my home directory, and I'm running from there.
-That seems fine.
-
-## How running stuff works
-
-We run everything with slurm.
-
-### Level 1: `sbatch`
-
-Everything starts with `sbatch script.sh`.
-This uploads `script.sh` to some magical place to be executed whenever the cluster has enough capacity.
-When this time comes, `script.sh` is executed.
-It runs one or more `srun` commands, which puts jobs on the compute nodes.
-Those slurm script files are not ordinary bash files.
-They have a bunch of extra directives at the top for slurm to read.
-The login nodes, the node where `script.sh` runs, and all the compute nodes, have access to the same shared file systems.
-
-### Level 2: `srun`
-
-In our run script, the `srun` part looks like this:
-```bash
-srun \
-  --distribution=block:block \
-  --kill-on-bad-exit \
-  <cmd1>
-```
-
- * `--distribution` is about trying to make sure that adjacent ranks are adjacent in the cluster, i.e., making sure
-that ranks 2 and 3 are on the same node, rather than spread across nodes.
- * `--kill-on-bad-exit` makes sure that when one process dies, slurm kills all the others. By default they keep running.
-
-### Level 3: `run_with_environment.sh`
-
-`<cmd1>` just expands into `scripts/run_with_environment.sh <cmd2>`.
-This is a script that translates various Slurm environment variables into whatever the Mosaic trainer expects.
-In full, it looks like this:
-
-```bash
-# Prefix our own output with the name of the node that's running.
-export NODENAME=$(hostname -s)
-exec > >(trap "" INT TERM; sed -u "s/^/$NODENAME out: /")
-exec 2> >(trap "" INT TERM; sed -u "s/^/$NODENAME err: /" >&2)
-
-# Set up environment
-export MASTER_ADDR=$(scontrol show hostnames | head -n 1)
-export MASTER_PORT=39591
-export WORLD_SIZE=$SLURM_NTASKS
-export RANK=$SLURM_PROCID
-export LOCAL_WORLD_SIZE=$SLURM_NTASKS_PER_NODE
-export LOCAL_RANK=$SLURM_LOCALID
-export NODE_RANK=$((($RANK - $LOCAL_RANK) / $LOCAL_WORLD_SIZE))
-
-# Delete debris that ROCm sometimes leaves around
-rm -f /dev/shm/rocm_smi_card$LOCAL_RANK
-```
-
-Note that the documentation tells us to set `ROCM_VISIBLE_DEVICES`, but this is wrong.
-Slurm already sets this.
-If we set it again, bad things happen.
-
-### Level 4: `singularity`
-
-`<cmd2>` expands into this:
-```bash
-  singularity exec \
-    -B"$PROJECT_DIR:$PROJECT_DIR" \
-    -B"$SCRATCH_DIR:$SCRATCH_DIR" \
-    -B"$FLASH_DIR:$FLASH_DIR" \
-    $PROJECT_DIR/containers/llm-lumi_latest.sif \
-    <cmd3>
-```
-
- * `-B` maps paths from the host into the container. I don't know why some paths have to be mapped, but others (like the home directory) are just there.
- * The actual scripts map some other libraries into the container. Those other libraries are necessary for the Slingshot fast interconnect.
-
-### Level 5: Our own training script
-
-Finally we get to run our own trainer, when `<cmd3>` expands into `python scripts/train.py configs/c4-small.yaml --run_name=${SLURM_JOB_ID}`.
-
-We're not using the MosaicML launcher, torchrun, or anything else that launches training processes.
-We don't need it, since slurm already launches us.
-
-### Monitoring your runs
-
-* You can see all cluster activity with `squeue`.
-* You can see all of your own cluster activity with `squeue --me`.
-* You can see all of our project's cluster activity with `squeue -A $PROJECT`.
-* You can log into a running node with `scripts/log_into_node.sh <jobid>`. This will attach to the node as it runs. When the job finishes or fails, your `bash` will get killed.
-* You can see the logs for a run in `${FLASH_DIR}/logs/${SLURM_JOB_ID}.log`. All nodes write to the same file. E.g. `tail -f $FLASH_DIR/logs/3376668.log`.
-
-### Running an interactive session
-
-This is how you can run an interactive session on a compute node:
-```bash
-srun --account=$PROJECT --partition=small-g --time=00:30:00 --nodes=1 --ntasks-per-node=1 --gpus-per-node=1 --pty bash
-```
-
-## First steps
-
-Thanks for reading all the way down here. Your reward is some first steps.
-
-1. Make sure you are a member of our WandB team: https://wandb.ai/ai2-llm
-2. Make sure you have been invited to Logz.io: https://app.logz.io You don't _need_ Logz.io, but it is a much nicer way of debugging logs than looking at a text file that 512 processes are writing to at the same time.
-3. Log into LUMI
-4. Set up your `~/.bashrc` the way I have it above. Log out and back in to make it take effect.
-5. `git clone` the OLMo repo.
-6. Edit `scripts/lumi/c4-small-on-lumi.sh`:
-   7. Set the maximum run time (`--time`) to 15 minutes.
-   8. Set the `--partition` to `small-g`.
-   9. Set `--nodes` to `2`.
-10. Kick off your first slurm job: `sbatch scripts/lumi/c4-small-on-lumi.sh` This will give you a slurm run id.
-11. Look at the logs as they come in: `less $FLASH_DIR/logs/<jobid>.log`. Press `F` to "follow" the logs as they come in.
-12. Look at wandb. There will be a run whose name is the slurm job id. You can rename the run if you like.
-13. Run `squeue --me` and see your run in the list there.
-14. Wait for a few batches, then kill the run with `scancel <jobid>`.
diff --git a/docs/NOTES.md b/docs/NOTES.md
deleted file mode 100644
index 181b8d93f..000000000
--- a/docs/NOTES.md
+++ /dev/null
@@ -1,126 +0,0 @@
-# OLMo: Open Language Model
-
-## Setup
-
-After cloning this repository, first install the latest [PyTorch](https://pytorch.org) according the official instructions relevant to your environment. Then install the remaining dependencies and code base by running:
-
-```
-pip install -e .
-```
-
-## Running LM pre-training jobs
-
-Our training script is [scripts/train.py](./scripts/train.py), which should be launched either through `torchrun` or Slurm (see below) since it only supports distributed training (on GPUs).
-The first argument to the training script is a path to a [training configuration file](./configs/).
-Then it takes any number of optional arguments that can be used to override values from the configuration file using dot notation.
-For example, to change the learning rate you'd pass `--optimizer.learning_rate=0.0001`.
-
-### Launching a training job
-
-In the examples below we'll focus on training the "tiny" model on 8 GPUs and we'll assume that you've cloned this repository and are running all of the commands from the repository root,
-whether that be on your laptop, on LUMI, or in a Beaker interactive session on Cirrascale.
-
-#### Running on Cirrascale in a Beaker interactive session
-
-```bash
-run_name=c4-tiny-test-run
-torchrun --nproc-per-node=8 scripts/train.py configs/c4-tiny.yaml \
-  --run_name=${run_name} \
-  --save_folder=/tmp/${run_name}  # change to somewhere permanent for a real run
-```
-
-#### Running on Cirrascale via [beaker-gantry](https://github.com/allenai/beaker-gantry)
-
-Check the script at [`scripts/beaker/olmo-small-ablation-on-gantry.sh`](scripts/beaker/olmo-small-ablation-on-gantry.sh) for an example on how to run a training job on Cirrascale. Using that script, you can launch a training job like this:
-
-```bash
-CONFIG_PATH=configs/choose_a_config.yml \
-LOAD_PATH=/optional/path/to/checkpoint/ \
-  bash scripts/beaker/olmo-small-ablation-on-gantry.sh
-```
-
-If `CONFIG_PATH` is not specified, the default config is `configs/olmo-small-ablation.yaml`. If `LOAD_PATH` is not specified, the training will start from scratch.
-
-#### Running on LUMI via Slurm
-
-First read our [LUMI](docs/LUMI.md) documentation, but submitting a new job essentially just boils down to running this:
-
-```bash
-sbatch scripts/lumi/c4-small-on-lumi.sh
-```
-
-### Restarting a training job from a checkpoint
-
-To restart a training job from a previous checkpoint, add the argument `--load_path=/path/to/checkpoint_directory` and re-launch the training run using the same method.
-
-The checkpoints for a run will be located in the run's `--save_folder`. They're always subdirectories of `save_folder` that look like `step1000` for sharded checkpoints or `step1000-unsharded` for unsharded checkpoints.
-There are also symlinks for the latest checkpoints in the form of `latest` and `latest-unsharded` for sharded and unsharded checkpoints, respectively.
-
-Sharded checkpoints are the default type of checkpoint that's saved during training since these are the fastest, but you can also save unsharded checkpoints by setting `--save_interval_unsharded [INT]`.
-
-If you plan to restart a training run using a *different* world size, you can only restart from an *unsharded* checkpoint.
-However, you can convert a sharded checkpoint into an unsharded checkpoint by launching the script [scripts/unshard.sh](./scripts/unshard.sh) in the same way you launched the training script. Note that this needs to be launched with the exact same world size as when the *sharded* checkpoint was saved.
-
-## Finding official runs and checkpoints
-
-We track all of our runs in Weights & Biases under [the "ai2-llm" entity](https://wandb.ai/ai2-llm).
-The corresponding checkpoints are stored in GCS under `gs://ai2-olmo/<wandb_run_path>`.
-For example, checkpoints for the run [https://wandb.ai/ai2-llm/c4-small/runs/euox4j8q](https://wandb.ai/ai2-llm/c4-small/runs/euox4j8q) are located at [gs://ai2-olmo/ai2-llm/c4-small/euox4j8q/](https://console.cloud.google.com/storage/browser/ai2-olmo/ai2-llm/c4-small/euox4j8q).
-
-You can load a checkpoint like this:
-
-```python
-from olmo import OLMo, Tokenizer
-
-checkpoint = "gs://ai2-olmo/ai2-llm/c4-small/euox4j8q/step73000-unsharded"
-model = OLMo.from_checkpoint(checkpoint, device="cuda")
-tokenizer = Tokenizer.from_checkpoint(checkpoint)
-```
-
-### Highlighted checkpoints
-
- * `gs://ai2-olmo/ai2-llm/c4-small/euox4j8q/step73000-unsharded` - 1B parameters, 150B tokens, this one of our first decent checkpoints at the 1B scale.
-
-## Generating text
-
-You can use the `generate()` method to produce text using beam search with a variety of options.
-
-For example:
-
-```python
-# Prepare inputs.
-# Note: we don't want the EOS token added to the end of the input, hence
-# the `add_special_tokens=False`.
-input_ids = tokenizer.encode("I'm a large language model, ", add_special_tokens=False)
-# `model.generate()` expects a batch.
-input_tensor = torch.tensor(input_ids).unsqueeze(0)
-
-# Run beam search.
-outputs = model.generate(input_tensor, max_steps=3, beam_size=3)
-
-# The output token IDs are shape (batch_size, beam_size, max_steps)
-best_generation = outputs.token_ids[0][0].tolist()
-print(tokenizer.decode(best_generation))
-```
-
-## Debugging
-
-### Finding the cause of hangs
-
-Hangs in distributed training can be due to several different causes, including
-bad user code, AMD/Nvidia memory-allocation issues, or issues in hardware setup.
-These issues can be difficult to root-cause and even harder to fix.
-
-One approach we use to find the cause of a hang in distributed training is to first identify which processes/nodes are hanging. The [scripts/pyspy_all_processes.sh](https://github.com/allenai/OLMo/blob/main/scripts/pyspy_all_processes.sh) script retrieves the python state of relevant python processes using `pyspy`. A process/node with different state may be experiencing a hang.
-
-If a hang is suspected to be in GPU code, then you can run `gcore <pid>` on a hanging process to get core dumps. Then you can run `gdb <corefile>` and check where the code is hanging from a C++ perspective. Code being stuck on a GPU memory allocation (malloc) may be indicative of a hardware/setup issue rather than a problem in training code.
-
-### Comparing two models that should be identical
-
-There are some scenarios when one might want to investigate why two models/setups that should be identical are yielding different results. A naive solution is to run both setups side-by-side and compare results manually (and this might not be possible if you have just 1 GPU).
-
-An alternative for comparing OLMo models is to run the training of both models with the `--module_outputs_save_steps=[<list of steps]` config option. This causes OLMo to save a portion of the inputs & outputs of each OLMo submodule into a `traces/` folder at the model step's save location. Then [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) can be used to compare these portions of inputs & outputs, thus hopefully isolating the issue to a subset of model modules. See [script/compare_module_outputs.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_module_outputs.py) for more details on its usage.
-
-When comparing different hardware or dependency setups, it is possible that model
-state gets corrupted before the first forward pass of training. One can check this
-by running training with `--force_save_unsharded --dry_run --load_path=<original_model_path>` to save a checkpoint after the original model has loaded but before training has started. Then [scripts/compare_model_state.py](https://github.com/allenai/OLMo/blob/main/scripts/compare_model_state.py) can be used to see if parameters are different between the 2 models.
\ No newline at end of file
diff --git a/docs/TRAINLOG.md b/docs/TRAINLOG.md
deleted file mode 100644
index 65cd4ab8b..000000000
--- a/docs/TRAINLOG.md
+++ /dev/null
@@ -1,259 +0,0 @@
-Experiment Log
-==============
-
-2023-07-12
-----------
-
-For about a week, we have been chasing an issue where our loss curve looks wavy like this:
-<img width="519" alt="Screenshot 2023-07-13 at 14 56 19" src="https://github.com/allenai/LLM/assets/920638/5fec3ad9-5fd6-4959-956d-9f47e5232bd2">
-
-Our colleagues from MosaicML suggested that our data might not be properly mixed, but we reviewed the code carefully and
-found no problems. However, after exhausting all other possibilities, we had nothing left to go on, so we decided
-to try and graph our batch composition over time. Turns out, there are significant changes in batch composition after all:
-
-![image](https://github.com/allenai/LLM/assets/920638/3362e78e-4554-451e-8a59-a0114a4c4d56)
-
-In this graph, organge is content from Common Crawl, and green is content from The Stack, i.e., code. As you can see, the
-proportion of code changes significantly over time, and if you overlay the graphs, you can see that more code means lower
-loss. So clearly something is up with our shuffling after all.
-
-When we construct batches, we concatenate all content into one giant array of instances (samples), and then shuffle the
-array. We use `torch.randperm()` to shuffle. Long story short, it turns out that `torch.randperm()` does not shuffle very
-well. When you graph the index of the instances that end up in our batches over time, you see a very pronounced pattern:
-
-![image](https://github.com/allenai/LLM/assets/920638/39b01f8d-f1db-4485-b339-c20ee423b98a)
-
-While it would be interesting to find out why this happens, we left that as an exercise for the PyTorch team, and
-re-implemented our shuffling code to use NumPy. Now the curve looks like this:
-
-![image](https://github.com/allenai/LLM/assets/920638/192c5790-ab1f-4a3d-8fb6-a9dbc74391e8)
-
-Nice and random!
-
-![image](https://imgs.xkcd.com/comics/random_number.png)
-
-
-
-2023-04-26
-----------
-
-Large node counts means large failure rates. Yesterday's run ran out of memory at 3am, so we didn't get as many
-batches done as we were hoping. Today's issues:
- * Something seems to be wrong with the super-fast network drive we're using for checkpointing. Writing checkpoints
-   there consistently fails today, even though it worked fine yesterday. We switched to a slower drive for
-   checkpointing for now to make some progress.
- * Occasionally the Slingshot interconnect between the nodes fails with the message "Cassini Event Queue overflow".
-   The workaround is to set a larger event queue size by setting the environment variable
-   `FI_CXI_DEFAULT_CQ_SIZE` to `131072` (or some other large number). 
- * A lot of nodes fail to come up when starting a job. There are at least two separate issues that cause this.
-   Low-level failures in RCCL or device drivers do not get logged properly. Instead, they just print to stdout and
-   stderr. We altered our launch script to nevertheless tell us which node is reporting which error
-   (a078ae4686e190dc1e9eb91ab8f434e90d95d152). Using this, we can exclude nodes from the launch every time one
-   errors out during startup. It's a laborious process. Starting a 64-node job will often take 30 minutes or more
-   due to these failures. To get a better handle on the situation we started a spreadsheet that keeps track of the
-   failures. If we start to see a pattern, maybe we can do something more intelligent than excluding nodes one at
-   a time.
-
-Despite all this, the model is now training properly, and at the time of writing we have trained on 7B tokens.
-We even had our first proper loss spike!
-
-<img width="557" alt="Screenshot 2023-04-26 at 16 51 59" src="https://user-images.githubusercontent.com/920638/234726481-2fceb391-65da-4da9-9844-aaf0c493ee6a.png">
-
-
-2023-04-25
-----------
-
-The issues with checkpointing have been resolved, and we have a big training run under way. We're using this
-opportunity to track speed vs. number of nodes.
-
-<img width="479" alt="Screenshot 2023-04-26 at 16 55 55" src="https://user-images.githubusercontent.com/920638/234726824-074e6386-7e8a-4ec2-9afd-38717d2e601d.png">
-
-The results are pretty good. We lose about 20% efficiency to communication overhead, which is acceptable.
-With 64 nodes we no longer have to do gradient accumulation, so it's possible that's why the 64-node configuration
-is actually faster than the 32-node one.
-
-2023-04-24
-----------
-
-We're finding more and more problems with Torch 2. We have to use Torch 2 because some drivers that make our
-hardware work are only available for Torch 2, but it seems really half-baked in its current state. Compounding the
-problems is the fact that we're attempting to use MosaicML's Composer to run the training, but Torch 2 is not
-officially supported by Composer yet. In an effort to not stack two unstable bits of software on top of each other,
-we decided to write our own trainer instead of relying on Composer.
-
-While we're tracking a number of smaller problems around `torch.compile()` and random numbers, the big issue is
-checkpointing. Writing a model that's using Torch's FSDP to disk is surprisingly difficult.
-
-
-2023-04-18
-----------
-
-While not strictly necessary from a scientific perspective, we thought it might be a good idea to train the
-medium size model to 300B tokens, to shake down the system, and make sure we're on the right track. There was no
-way we could have the data done in time for this run, so we're just training on C4. However, in an effort to make
-this run useful as a baseline, we wanted to have a somewhat reasonable validation set. We chose the recently
-released Red Pajama data, because it is quite close to what we want to do with our own data. The data team is
-working on this right now.
-
-
-2023-04-17
-----------
-
-Our original model settings and sizes came from MosaicML. We found they are a lot smaller than they say they are,
-so we went back to the [PaLM paper](https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html)
-and put in those settings. Most of our architecture choices follow the PaLM paper, so it makes sense to do it with
-the dimensions of the model as well.
-
-Since the new sizes don't follow exactly the 1B/7B/70B scheme, we now call them "small", "medium", and "large".
-The new sizes are as follows:
-
-| Name             | parameters -h |  parameters | non-embedding parameters |
-|------------------|--------------:|------------:|-------------------------:|
-| extra-tiny-debug |          16 M |    16169216 |                  3291392 |
-| tiny             |         288 M |   288706560 |                237195264 |
-| small            |            1B |  1051439104 |                948416512 |
-| medium           |         7.8 B |  7791353856 |               7585308672 |
-| large            |        60.8 B | 60818014208 |              60405923840 |
-
-
-2023-04-13
-----------
-
-We've been experimenting with a [triton](https://github.com/openai/triton) implementation of [FlashAttention](https://github.com/HazyResearch/flash-attention/blob/main/flash_attn/flash_attn_triton.py) that supports using an arbitrary attention bias, which would allow us to use [ALiBi](https://www.semanticscholar.org/paper/f5aba74fbd512190ed5f61127618381f70710572).
-Unfortunately it doesn't look like this is going to be a viable option at the moment.
-This particular implementation only works on an older version of triton that uses a CUDA-specific backend.
-Therefore it won't run on AMD GPUs.
-
-We'll revisit this again when there are updates to [HazyResearch/flash-attention](https://github.com/HazyResearch/flash-attention).
-
-Meanwhile, we ran the 70B model for the first time! 32 nodes, 256 GPUs. [Some problems with the latest PyTorch](https://github.com/pytorch/pytorch/issues/97436)
-prevented us from running this at full speed, and still performance was good. We only ran six batches, since this
-was just about validating the setup. We will do longer runs when we have some of the obvious performance problems
-sorted out.
-
-One interesting learning is that even with just 32 nodes, nodes often don't come up cleanly. We have seen GPUs in
-inconsistent states, some nodes not appearing during the rendezvous, and just undiagnosed hangs. To get a handle on
-these problems, we started working on some tooling to diagnose hanging processes across the cluster, all based on
-[py-spy](https://github.com/benfred/py-spy).
-
-
-2023-04-12
-----------
-
-Today we got the [Slingshot Interconnect](https://www.hpe.com/us/en/compute/hpc/slingshot-interconnect.html)
-working. The LUMI cluster uses this style of interconnect to tie the GPUs together with an aggregate of 800GBit/s.
-For a large distributed training job, the speed of this interconnect is absolutely essential. Case in point, the
-1.2B model, on only two nodes (16 GPUs), went from 7500 tokens/second/GPU to 9500 tokens/second/GPU. That is a huge
-increase in speed!
-
-In the details, getting this to work was all about using the right libraries and making sure they are available to
-the right process at the right time. This is all about setting the right environment variables, setting the right
-flags, and general low-level Linux stuff. It's not the sexy part of training large language models.
-
-
-2023-04-11
-----------
-
-PyTorch 2.0 came with a new feature: `torch.compile()`. It promises massive speedups if you set up your model
-right. We intend to take advantage, and got it working with NVidia hardware, but it was a bit harder to make it
-work on AMD hardware as well. With the help of AMD engineers we figured it out today, and immediately saw a 15%
-speedup on the 1.2B model! 15% are hard to come by, so this is a big success.
-
-
-2023-04-10
-----------
-
-Today we got the 7B model running on LUMI for the first time. This ran on 8 nodes, 64 GPUs. We're missing a lot of
-tricks to make it fast, and yet we saw 1200 tokens/second/GPU. That's pretty good!
-
-It took a long time to get this working, mainly due to an issue that had nothing to do with the 7B config. ALiBi
-attention uses some constant static tensors for its calculations. We don't want to recompute these for every batch,
-so we keep them in torch "buffers", which are like parameters, except they don't receive gradient updates.
-These buffers are in bf16 format, and contain a lot of `-inf`, so right off the bat they are exploring a lot of
-edge cases. Also, [torch buffers are poorly defined in the context of FSDP](https://github.com/pytorch/pytorch/blob/4d3d3317ebd1c57a28754281d91ed7f35f4ce320/torch/distributed/fsdp/_init_utils.py#L257),
-and some of the operations that FSDP does on them result in `NaN`. The solution is to [store these tensors in a
-different way so that FSDP does not interfere with them](https://github.com/allenai/LLM/pull/90/files#diff-ef8ab7279deeec716e70a1cc9ab2accaaa60f27b301cc0733f1e00a9e39c07d1).
-
-
-2023-04-06
-----------
-
-The LUMI cluster is back from maintenance! This is one day earlier than expected. LUMI software has been updated
-as part of the maintenance, and it broke our launch scripts. On these compute nodes, the topology is such that
-every GPU has a fast connection to 8 CPUs, and a somewhat slower connection to the others. So we use "CPU binding"
-to make sure the right CPUs are paired up with the right GPUs. This part of the system broke, so for now we're
-running without it. We never benchmarked its impact, so it's not clear how important it really is. The bandwidth
-from CPU to GPU isn't the biggest bottleneck anyways, so it's probably not a problem.
-
-
-2023-04-03
-----------
-
-We added the option to decouple the MLP and Attention computations as in the PaLM architecture.
-That is, within each transformer block we compute `MLP(LN(x)) + Attention(LN(x))` instead of `MLP(LN(x + Attention(LN(x))))` (ignoring some skip connections).
-This allows to increase throughput because we can fuse the separate feed-forward and attention input projections into a single linear layer.
-We also experimented with [fusing the output projections](https://github.com/allenai/LLM/pull/79) into a single linear layer but that didn't help, possibly due to the overhead of concatenating the feed-forward and attention activations together.
-
-
-2023-04-02
-----------
-
-First training run! We trained a 300M model on about 70B tokens from C4.
-The purpose of this model is to give the other LLM teams something in our format that's not completely random,
-so they can test their evaluation and inference code.
-
-This ran on a single node only on AMD's cluster.
-On AMD hardware we're still missing Flash Attention, and we could not get `torch.compile()` to work in time for the run.
-Both are expected to provide significant speedups.
-This training run used model settings that are optimal for compiled models, despite not being able to compile,
-because we want it to be a representative model for the downstream evaluations.
-
-
-2023-03-28
-----------
-
-We've investigated a number ways to optimize training throughput in terms of tokens per second and MFU (model flop utilization). This is a list of all of the optimizations that have worked so far, ranked by how much of speedup they gave on a 1.2b param model:
-
-1. Using FlashAttention via PyTorch's built-in `scaled_dot_product_attention` function. This resulted in a ~12% speedup over the default attention implementation while also reducing GPU memory utilization.
-
-    Unfortunately ALiBi can't be used with FlashAttention at the moment, so the best option if we want to use relative positional encodings is probably RoPE (which can be used with FlashAttention). In general RoPE is slower than ALiBi but when combined with FlashAttention it's faster. Of course ALiBi + FlashAttention would be ideal.
-
-1. Setting embedding/vocab size to a multiple of 128. E.g. the actual vocab size is 50257, but we force the embedding size to be 50304. This resulted in an ~11% speedup.
-1. Using low-precision LayerNorm when **not** using `torch.compile()`. This resulted in a speedup of ~10%, but it actually slows throughput when using a compiled model. This probably has to do with manually casting tensors to different data types, which cause more breaks in the graph.
-1. Compiling the model via `torch.compile()` with the default mode. This resulted in a ~7% speedup without increasing (and in some cases decreasing) GPU memory utilization.
-
-    The other compile modes ("reduce-overhead" and "max-autotune") were not as fast and required substantially more GPU memory.
-
-    Compiling as a "fullgraph" also improves throughput even further except when using FSDP since FSDP forces breaks in the graph.
-1. Tweaking the FSDP settings to use "PURE" mixed precision, limit all gathers, and use non-reentrant activation checkpointing resulted in a 1-2% speedup.
-
-Using the best compatible combination of the above settings (so everything except #3) gets us close to 60% MFU with the 1.2b model. That's really good!
-
-For more details, see:
-- [Benchmarking the performance of PyTorch's new `compile()` and built-in FlashAttention.](https://wandb.ai/ai2-llm/petew-torch2-benchmarks/reports/PyTorch-2-0-benchmarks--VmlldzozODQyMDY5?accessToken=2fh801xe265n5xx7juphb1xnx8itvls8g7nrqsjdd4ja0xlks7kaozue94z2mez3)
-- [Benchmarking the cost of using RoPE](https://wandb.ai/ai2-llm/rope-benchmarks/reports/Benchmarking-RoPE--VmlldzozODQ1MjMz)
-- [Benchmarking the performance of `compile()` with FSDP](https://wandb.ai/ai2-llm/fsdp-compile-benchmarks)
-- [Benchmarking low precision LayerNorm](https://api.wandb.ai/links/ai2-llm/9favfpnh)
-
-
-2023-03-15
-----------
-
-The cluster is down for maintenance, so we're just queueing up some features we want to run. We also used the LUMI downtime to build a better logging feature. When running 1000s of nodes in a cluster, it's difficult to get logs that make sense. We're sending our logs to third-party logging provider [logz.io](https://logz.io). It's basic, but it gets the job done.
-
-
-2023-03-14
-----------
-
-This is the first day with some experiments that were serious enough to mention here.
-Experiments were all on 1.3B models, with increasing microbatch size, just to shake down the system.
-All runs happened on the `small-g` partition, which is for debugging only.
-I reserved only one node each time.
-Runtimes are limited to 15 minutes, which is too short for performance to stability, but good enough to get an idea.
-Findings:
- * WandB integration works.
- * Launching nodes with slurm works.
-   We let slurm launch everything, but there is a different school of thought that says that slurm should just launch one process on each node, and then you use `torch.distributed` to spread out on the node.
-   I'm not sure what that buys us, and it's one extra component in the mix, so I didn't do it that way.
- * Automatic restarts work. One run got killed and automatically restarted.
-   It is great that restarts work, but somewhat worrisome that we're already sampling this behavior after less than 45 minutes of runtime on only one node.