Skip to content

Commit

Permalink
Merge branch 'main' into llama3_converter
Browse files Browse the repository at this point in the history
  • Loading branch information
ischlag authored Jul 2, 2024
2 parents 3e169c5 + e3ec5e3 commit eb68e41
Show file tree
Hide file tree
Showing 27 changed files with 1,315 additions and 243 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/fa2_unit_tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ jobs:
python -c "import torch; print('torch:', torch.__version__, torch)"
python -c "import torch; print('CUDA available:', torch.cuda.is_available())"
- name: Instal nanotron
- name: Install nanotron
run: |
python -m pip install --upgrade pip
pip install packaging
Expand All @@ -55,4 +55,4 @@ jobs:
- name: Run tests
# NOTE: -m fa2 will only run the unit tests that have the mark
# "fa2" (these are FA2-related tests)
run: pytest -m fa2 --color=yes --durations=0 --ignore tests/fp8 --verbose tests/
run: pytest -m fa2 --color=yes --durations=0 --ignore tests/fp8 --ignore tests/nanoset --verbose tests/
15 changes: 15 additions & 0 deletions .github/workflows/trufflehog.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
on:
push:

name: Secret Leaks

jobs:
trufflehog:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
fetch-depth: 0
- name: Secret Scanning
uses: trufflesecurity/trufflehog@main
6 changes: 6 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -14,3 +14,9 @@ test:
--ignore tests/fp8 \
--verbose \
examples/doremi/tests/

pip install -r examples/llama/requirements.txt
pytest \
--color=yes \
--verbose \
examples/llama/tests/
60 changes: 32 additions & 28 deletions docs/nanoset.md
Original file line number Diff line number Diff line change
@@ -1,49 +1,50 @@
# Nanosets
Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a kind of datasets based on [numpy memory-mapped arrays](https://numpy.org/doc/stable/reference/generated/numpy.memmap.html). `Nanosets` are capable of serving batches from files containing pre-tokenized datasets. They allow reading tokens from one or multiple datasets and even specifying the weight of each dataset when building batches.
Nanotron incorporates [`Nanosets`](../src/nanotron/data/nanoset.py), a dataset for processing tokenized documents with [`datatrove`](https://github.com/huggingface/datatrove). They allow reading tokens from one or multiple datasets and even specifying the weight of each dataset when building batches.
## Install
To use `Nanosets`, it's necessary to install Nanotron with the `nanosets` flavor.
```
pip install -e '.[nanosets]'
pip install nanotron[nanosets]
```
This will install the following dependencies:
- `transformers`: To tokenize the datasets
- `datasets`: To preprocess the datasets
- `datatrove`: To preprocess the datasets
- `numba`: To compile helper functions in order to speed up the creation of `Nanosets`
- `transformers`: For the tokenizers
## Data pre-processing
To use these datasets, first, we need to preprocess the data. The input format can either be a column of a Hugging Face Dataset or a .json file containing a text sample per line. For example:
To use this dataset, first, we need to preprocess the data using `datatrove`'s `DocumentTokenizer` pipeline. We invite you to take a look at `datatrove`, since it contains multiple features that allow, for example, filter out documents based on specific rules/criteria, extract text content from raw formats or scheduling the preprocessing in a Slurm cluster. We have also added a simple script capable of tokenizing datasets.

<pre>
{"src": "www.nvidia.com", "text": "The quick brown fox", "type": "Eng", "id": "0", "title": "First Part"}
{"src": "The Internet", "text": "jumps over the lazy dog", "type": "Eng", "id": "42", "title": "Second Part"}
</pre>

The preprocessing is done using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. Below we show an example for processing a corpus with the Llama2 tokenizer.
The preprocessing is done using the [`tools/preprocess_data.py`](../tools/preprocess_data.py) script. The input format can either be a Hugging Face Dataset, a path to a `.jsonl` or a path to a folder containing multiple `.jsonl` files. Below we show an example for processing a Hugging Face Dataset from the Hub with the Llama3 tokenizer.

<pre>
torchrun --nproc-per-node 16 tools/preprocess_data.py \
--input HuggingFaceH4/testing_alpaca_small \
--split train \
--column completion \
--output-prefix datasets/testing_alpaca_small \
--tokenizer-name-or-path openai-community/gpt2
python3 tools/preprocess_data.py \
--tokenizer-name-or-path meta-llama/Meta-Llama-3-8B \
--output-folder datasets/emotion \
--n-tasks 16 \
hf \
--dataset dair-ai/emotion \
</pre>

The preprocessing script has to be launched with `torchrun` in order to spawn `--nproc-per-node` workers that will preprocess the dataset concurrently. The `--input` dataset can be either a Hugging Face Dataset from the Hub or a `.json` file. The processed dataset will be stored in *`--output-prefix`_input_ids.npy*. In `--tokenizer-name-or-path`, we will have to specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`.
First with `--tokenizer-name-or-path` we will specify a tokenizer in the same way as we do when using `AutoTokenizers.from_pretrained(...)`. Then we specify the `--output-folder` where we will store the tokenized documents and the number of workers with `--n-tasks`. Finally we will indicate the type of dataset (whether if it's a Hugging Face Dataset ["**hf**"] or in jsonl ["**jsonl**"] format) and the dataset that we want to preprocess. Check the different settings with `python3 tools/preprocess_data.py --help`, `python3 tools/preprocess_data.py hf --help` & `python3 tools/preprocess_data.py jsonl --help`.

The output will be one file named, in this case, `datasets/testing_alpaca_small_input_ids.npy`. We will then have to specify this file in the `dataset_path` field in the config file.
Every worker will store in `--output-folder` 3 different kind of files:
- `*.ds` Containing the tokenized documents
- `*.ds.index` Containing the bounds of each tokenized document
- `*.ds.metadata` Containing the number of tokens and tokenizer used

> [!IMPORTANT]
Remember to introduce the type of dataset to process. e.g. python3 tools/preprocess_data.py --tokenizer-name-or-path gpt2 --n-tasks 16 **jsonl** --dataset raw_datasets/c4-es-json-files

## Working with Nanosets

To work with `Nanosets`, we just need to configure 1 argument:
1. `dataset_path`: This argument specifies the file or files that will compose the `Nanoset`. There are 3 ways to specify it:
1. `dataset_folder`: This argument specifies the file or files that will compose the `Nanoset`. There are 3 ways to specify it:
1. If we specify a single path, we will create a `Nanoset` from a single dataset file.
```yaml
data_stages:
- name: General purpose training (Single dataset)
start_training_step: 1
data:
dataset:
dataset_path: datasets/SlimPajama-6B_input_ids.npy
dataset_folder: datasets/SlimPajama-6B
num_loading_workers: 0
seed: 1234
```
Expand All @@ -54,9 +55,9 @@ To work with `Nanosets`, we just need to configure 1 argument:
start_training_step: 15
data:
dataset:
dataset_path:
- datasets/SlimPajama-6B_input_ids.npy
- datasets/testing_alpaca_small_input_ids.npy
dataset_folder:
- datasets/SlimPajama-6B
- datasets/testing_alpaca_small
num_loading_workers: 0
seed: 1234
```
Expand All @@ -67,9 +68,9 @@ To work with `Nanosets`, we just need to configure 1 argument:
start_training_step: 25
data:
dataset:
dataset_path:
datasets/SlimPajama-6B_input_ids.npy: 0.8
datasets/testing_alpaca_small_input_ids.npy: 0.2
dataset_folder:
datasets/SlimPajama-6B: 0.8
datasets/testing_alpaca_small: 0.2
num_loading_workers: 0
seed: 1234
```
Expand All @@ -82,7 +83,10 @@ torchrun --nproc-per-node 8 run_train.py --config configs/config_nanoset.yaml
```

## Under the hood
`Nanosets` are responsible of building samples of `sequence length + 1` tokens from the preprocessed dataset files. The `dataset lengths` of each dataset will be determined by the `(dataset_number_of_tokens - 1) / sequence length`, discarding the last sample if its length < `sequence length`.
`Nanosets` are responsible of building samples of `sequence length + 1` tokens from the preprocessed dataset files. Despite most of the extracting logic lies in `DatatroveFolderDataset`, `Nanosets` will take care of the following:
1. Creating dataset mixtures from different dataset folder paths
2. Ensure that in each epoch, we consume each sample only once
3. Ensure that we never exhaust the `DataLoader`

Based on the `dataset lengths`, the `dataset weights` and the `number of samples per epoch` (defined as the `sum(dataset lengths)`), we build the two indexes we need in order to extract samples from the `Nanoset` ([build_nanoset_index_helper](../src/nanotron/data/nanoset.py)):
- `dataset index`: Contains the index of the dataset from the list of `dataset paths` from which to extract the sample, respecting the established dataset weight.
Expand Down
24 changes: 12 additions & 12 deletions examples/config_nanoset.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -7,25 +7,25 @@ checkpoints:
data_stages:
- data:
dataset:
dataset_path: datasets/testing_alpaca_small_input_ids.npy
dataset_folder: datasets/c4-es/tokenized
num_loading_workers: 1
seed: 42
name: General purpose training (Single dataset)
start_training_step: 1
- data:
dataset:
dataset_path:
- datasets/yelp_review_full_input_ids.npy
- datasets/testing_alpaca_small_input_ids.npy
dataset_folder:
- datasets/SlimPajama-6B/tokenized
- datasets/c4-es/tokenized
num_loading_workers: 1
seed: 42
name: Second purpose training (> 1 dataset)
start_training_step: 15
- data:
dataset:
dataset_path:
datasets/testing_alpaca_small_input_ids.npy: 0.8
datasets/yelp_review_full_input_ids.npy: 0.2
dataset_folder:
datasets/SlimPajama-6B/tokenized: 0.8
datasets/c4-es/tokenized: 0.2
num_loading_workers: 1
seed: 42
name: Third purpose training (Blended dataset)
Expand Down Expand Up @@ -57,7 +57,7 @@ model:
initializer_range: 0.02
intermediate_size: 64
is_llama_config: true
max_position_embeddings: 256
max_position_embeddings: 1024
num_attention_heads: 4
num_hidden_layers: 2
num_key_value_heads: 4
Expand All @@ -67,7 +67,7 @@ model:
rope_scaling: null
tie_word_embeddings: true
use_cache: true
vocab_size: 32000
vocab_size: 50257
optimizer:
accumulate_grad_in_fp32: true
clip_grad: 1.0
Expand All @@ -88,11 +88,11 @@ optimizer:
weight_decay: 0.01
zero_stage: 0
parallelism:
dp: 2
dp: 1
expert_parallel_size: 1
pp: 1
pp_engine: 1f1b
tp: 2
tp: 1
tp_linear_async_communication: true
tp_mode: REDUCE_SCATTER
profiler: null
Expand All @@ -105,6 +105,6 @@ tokens:
limit_test_batches: 0
limit_val_batches: 0
micro_batch_size: 2
sequence_length: 128
sequence_length: 1024
train_steps: 200
val_check_interval: -1
17 changes: 17 additions & 0 deletions examples/llama/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
## Debugging the tests with vscode

To debug the tests with vscode, add the following json to your `launch.json` file.

```
{
"name": "Test conversion",
"type": "python",
"request": "launch",
"module": "pytest",
"console": "integratedTerminal",
"args": [
"examples/llama/tests"
],
"justMyCode": false
}
```
Empty file added examples/llama/__init__.py
Empty file.
119 changes: 119 additions & 0 deletions examples/llama/convert_hf_to_nanotron.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
"""
Converts a HF model to nanotron format
Command:
torchrun --nproc_per_node=1 convert_hf_to_nanotron.py --checkpoint_path=hf_weights --save_path=nanotron_weights
"""

import dataclasses
import json
from argparse import ArgumentParser
from pathlib import Path

import nanotron
import torch
from convert_weights import get_config_mapping, get_weight_mapping, load_nanotron_model
from nanotron.config import LlamaConfig as NanotronLlamaConfig
from nanotron.models.llama import LlamaForTraining
from transformers import LlamaConfig as HFLlamaConfig
from transformers import LlamaForCausalLM


def _handle_attention_block(
q: torch.Tensor, k: torch.Tensor, v: torch.Tensor, n_q_heads: int, n_kv_heads: int, d_qk: int
) -> torch.Tensor:
# Huggingface Llama separates the q, k, v weights (as opposed to nanotron).
# Furthermore, in the rotary embeddings in nanotron expects interleaved pairs of even
# and odd dimensions GPT-J style, while the huggingface implementation expects
# the whole 1st half and then the whole 2nd half GPT-NeoX style (for more information
# see flash_attn.layers.rotary.RotaryEmbedding).
# This function handles the concatenation of the q, k, v weights and proper permutation
# to ensure correct transformation.

def interleave(w: torch.Tensor):
w_new = []
for head_w in w.split(d_qk):
head_w = head_w.view(2, d_qk // 2, -1).transpose(0, 1).reshape(d_qk, -1)
w_new.append(head_w)
return torch.cat(w_new)

q = interleave(q)
k = interleave(k)
return torch.cat([q, k, v])


def convert_hf_to_nt(model_hf: LlamaForCausalLM, model_nt: LlamaForTraining, config: NanotronLlamaConfig):
"""Converts the weights from the model_hf to model_nt, making modifications
in-place."""

hf_sd = model_hf.state_dict()
nt_to_hf = get_weight_mapping(config, nt_to_hf=True)

for module_name_nt, module_nt in model_nt.named_modules():
for param_name_nt, param_nt in module_nt.named_parameters(recurse=False):
# In the case of qkv_proj, the nt_to_hf has exactly three keys, ccorresponding
# to q, k, v.
if "qkv_proj" in module_name_nt:
key_k, key_q, key_v = sorted(nt_to_hf[f"{module_name_nt}.{param_name_nt}"])
q = hf_sd[key_q]
k = hf_sd[key_k]
v = hf_sd[key_v]
param = _handle_attention_block(
q,
k,
v,
config.num_attention_heads,
config.num_key_value_heads,
config.hidden_size // config.num_attention_heads,
)
# The case of gate_up_proj, nt_to_hf_map has two keys.
elif "gate_up_proj" in module_name_nt:
key_gate, key_up = sorted(nt_to_hf[f"{module_name_nt}.{param_name_nt}"])
gate = hf_sd[key_gate]
up = hf_sd[key_up]
param = torch.cat([gate, up])
# All other cases are simple 1-to-1 correspondence.
else:
hf_key = nt_to_hf[f"{module_name_nt}.{param_name_nt}"]
param = hf_sd[hf_key]

with torch.no_grad():
param_nt.copy_(param)


def get_nanotron_config(config: HFLlamaConfig) -> NanotronLlamaConfig:
"""Converts a huggingface configuration to nanotron configuration."""
attrs = {key: getattr(config, value) for key, value in get_config_mapping(nt_to_hf=True).items()}
return NanotronLlamaConfig(**attrs)


def convert_checkpoint_and_save(checkpoint_path: Path, save_path: Path):
"""Loads the huggingface checkpoint in `checkpoint_path`, creates
a new nanotron instance, copies the weights from the huggingface checkpoint
and saves the transformed nanotron to `save_path`."""

# Load huggingface.
hf_model = LlamaForCausalLM.from_pretrained(checkpoint_path)

# Init nanotron model.
model_config = get_nanotron_config(hf_model.config)
nanotron_model = load_nanotron_model(model_config=model_config)

# Copy weights and save model.
parallel_context = nanotron.parallel.ParallelContext(
data_parallel_size=1, pipeline_parallel_size=1, tensor_parallel_size=1
)
convert_hf_to_nt(hf_model, nanotron_model, model_config)
nanotron.serialize.save_weights(model=nanotron_model, parallel_context=parallel_context, root_folder=save_path)
with open(save_path / "model_config.json", "w+") as f:
json.dump(dataclasses.asdict(model_config), f)
print(f"Model saved to {save_path}")


if __name__ == "__main__":
parser = ArgumentParser(description="Convert HF weights to nanotron format")
parser.add_argument("--checkpoint_path", type=Path, default="llama-7b", help="Path to the checkpoint")
parser.add_argument("--save_path", type=Path, default="llama-7b-hf", help="Path to save the nanotron model")
args = parser.parse_args()

# Convert HF model to nanotron format.
convert_checkpoint_and_save(checkpoint_path=args.checkpoint_path, save_path=args.save_path)
Loading

0 comments on commit eb68e41

Please sign in to comment.