diff --git a/README.md b/README.md index f0fe81d3e..8e858c307 100644 --- a/README.md +++ b/README.md @@ -86,76 +86,6 @@ olmo = AutoModelForCausalLM.from_pretrained("allenai/OLMo-2-1124-7B", torch_dtyp The quantized model is sensitive to input types and CUDA handling. To avoid potential issues, we recommend explicitly converting input IDs to CUDA using: `inputs.input_ids.to('cuda')` -## Reproducibility -## Training - -Install required packages: -```bash -pip3 install ai2-olmo wandb datasets torchmetrics scikit-learn -``` - -### Inspecting training data -Find the data order file URL in the [Models Overview](#models-overview) table. For example, the OLMo-7B model's first epoch data order file is located at [https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy](https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/train_data/global_indices.npy). -Once you have that you can use this snippet to inspect the data within a particular batch: - -```python -import numpy as np -from cached_path import cached_path - -from olmo.config import TrainConfig -from olmo.data import build_memmap_dataset - -# Update these paths to what you want: -data_order_file_path = cached_path("https://olmo-checkpoints.org/ai2-llm/olmo-medium/wvc30anm/train_data/global_indices.npy") -train_config_path = "configs/official/OLMo-7B.yaml" - - -cfg = TrainConfig.load(train_config_path) -dataset = build_memmap_dataset(cfg, cfg.data) -batch_size = cfg.global_train_batch_size -global_indices = np.memmap(data_order_file_path, mode="r+", dtype=np.uint32) - - -def get_batch_instances(batch_idx: int) -> list[list[int]]: - batch_start = batch_idx * batch_size - batch_end = (batch_idx + 1) * batch_size - batch_indices = global_indices[batch_start:batch_end] - batch_instances = [] - for index in batch_indices: - token_ids = dataset[index]["input_ids"].tolist() - batch_instances.append(token_ids) - return batch_instances - - -# Get all 2048 x 2048 token IDs in the first batch. -get_batch_instances(0) -``` - - -## Fine-tuning - -To fine-tune an OLMo model using our trainer you'll first need to prepare your dataset by tokenizing it and saving the tokens IDs to a flat numpy memory-mapped array. See [`scripts/prepare_tulu_data.py`](./scripts/prepare_tulu_data.py) for an example with the Tulu V2 dataset, which can be easily modified for other datasets. - -Next, prepare your training config. There are many examples in the [`configs/`](https://github.com/allenai/OLMo/blob/main/configs) directory that you can use as a starting point. The most important thing is to make sure the model parameters (the `model` field in the config) match up with the checkpoint you're starting from. To be safe you can always start from the config that comes with the model checkpoint. At a minimum you'll need to make the following changes to the config or provide the corresponding overrides from the command line: - -- Update `load_path` to point to the checkpoint you want to start from. -- Set `reset_trainer_state` to `true`. -- Update `data.paths` to point to the `token_ids.npy` file you generated. -- Optionally update `data.label_mask_paths` to point to the `label_mask.npy` file you generated, unless you don't need special masking for the loss. -- Update `evaluators` to add/remove in-loop evaluations. - -Once you're satisfied with your training config, you can launch the training job via `torchrun`. For example: - -``` -torchrun --nproc_per_node=8 scripts/train.py {path_to_train_config} \ - --data.paths=[{path_to_data}/input_ids.npy] \ - --data.label_mask_paths=[{path_to_data}/label_mask.npy] \ - --load_path={path_to_checkpoint} \ - --reset_trainer_state -``` - -Note: passing CLI overrides like `--reset_trainer_state` is only necessary if you didn't update those fields in your config. - ## Evaluation Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/ai2-olmo-eval) repo.