-
Notifications
You must be signed in to change notification settings - Fork 123
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #162 from huggingface/nouamane/custom-dl
Support custom dataloader
- Loading branch information
Showing
6 changed files
with
373 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,39 @@ | ||
# Use a custom dataloader with Nanotron | ||
|
||
This example shows how to use a custom dataloader with Nanotron. We will use a simple dataloader that loads a random tokenized dataset and feeds it to a Nanotron model. | ||
https://github.com/huggingface/nanotron/blob/2e21db0db46a40bedbd03714616dd0ae4ea75914/examples/custom-dataloader/run_train.py#L72-L84 | ||
|
||
`DataCollatorForCLM` is a custom data collator that takes a list of input_ids and returns a dictionary with the input_ids and the labels on the ranks which need it. For example `input_ids` are only needed in the first PP rank, while `labels` are needed in the last PP rank. | ||
|
||
And to test it out, you should fix your config to have: (example: [config_custom_dl.yaml](config_custom_dl.yaml)) | ||
```yaml | ||
- data: | ||
dataset: null # Custom dataloader will be used | ||
num_loading_workers: 1 | ||
seed: 42 | ||
name: Stable Training Stage | ||
start_training_step: 1 | ||
``` | ||
To try it out you can run the following command: | ||
```bash | ||
export CUDA_DEVICE_MAX_CONNECTIONS=1 # important for some distributed operations | ||
torchrun --nproc_per_node=2 examples/custom-dataloader/run_train.py --config-file examples/custom-dataloader/config_custom_dl.yaml | ||
``` | ||
|
||
## Troubleshooting | ||
|
||
### `return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)` | ||
``` | ||
File "/fsx/nouamane/projects/nanotron/src/nanotron/parallel/tensor_parallel/nn.py", line 284, in forward | ||
out = super().forward(masked_input) | ||
File "/fsx/nouamane/miniconda/envs/2-1-cu121/lib/python3.10/site-packages/torch/nn/modules/sparse.py", line 162, in forward | ||
return F.embedding( | ||
File "/fsx/nouamane/miniconda/envs/2-1-cu121/lib/python3.10/site-packages/torch/nn/functional.py", line 2233, in embedding | ||
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) | ||
RuntimeError: CUDA error: device-side assert triggered | ||
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. | ||
``` | ||
|
||
If you encounter an error with `torch.embedding`, it's probable you're feeding a token which is bigger than the model's vocabulary size. Check your model's vocab size and tokenizer |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,103 @@ | ||
checkpoints: | ||
checkpoint_interval: 10 | ||
checkpoints_path: checkpoints | ||
checkpoints_path_is_shared_file_system: false | ||
resume_checkpoint_path: null | ||
save_initial_state: false | ||
data_stages: | ||
- data: | ||
dataset: null # Custom dataloader will be used | ||
num_loading_workers: 1 | ||
seed: 42 | ||
name: Stable Training Stage | ||
start_training_step: 1 | ||
- data: | ||
dataset: | ||
dataset_overwrite_cache: false | ||
dataset_processing_num_proc_per_process: 1 | ||
hf_dataset_config_name: null | ||
hf_dataset_or_datasets: stas/openwebtext-10k | ||
hf_dataset_splits: train | ||
text_column_name: text | ||
num_loading_workers: 1 | ||
seed: 42 | ||
name: Annealing Phase | ||
start_training_step: 10 | ||
general: | ||
benchmark_csv_path: null | ||
consumed_train_samples: null | ||
ignore_sanity_checks: true | ||
project: debug | ||
run: tiny_llama_%date_%jobid | ||
seed: 42 | ||
step: null | ||
lighteval: null | ||
logging: | ||
iteration_step_info_interval: 1 | ||
log_level: info | ||
log_level_replica: info | ||
model: | ||
ddp_bucket_cap_mb: 25 | ||
dtype: bfloat16 | ||
init_method: | ||
std: 0.025 | ||
make_vocab_size_divisible_by: 1 | ||
model_config: | ||
bos_token_id: 1 | ||
eos_token_id: 2 | ||
hidden_act: silu | ||
hidden_size: 16 | ||
initializer_range: 0.02 | ||
intermediate_size: 64 | ||
is_llama_config: true | ||
max_position_embeddings: 256 | ||
num_attention_heads: 4 | ||
num_hidden_layers: 2 | ||
num_key_value_heads: 4 | ||
pad_token_id: null | ||
pretraining_tp: 1 | ||
rms_norm_eps: 1.0e-05 | ||
rope_scaling: null | ||
tie_word_embeddings: true | ||
use_cache: true | ||
vocab_size: 256 | ||
optimizer: | ||
accumulate_grad_in_fp32: true | ||
clip_grad: 1.0 | ||
learning_rate_scheduler: | ||
learning_rate: 0.0003 | ||
lr_decay_starting_step: null | ||
lr_decay_steps: 13 | ||
lr_decay_style: cosine | ||
lr_warmup_steps: 2 | ||
lr_warmup_style: linear | ||
min_decay_lr: 1.0e-05 | ||
optimizer_factory: | ||
adam_beta1: 0.9 | ||
adam_beta2: 0.95 | ||
adam_eps: 1.0e-08 | ||
name: adamW | ||
torch_adam_is_fused: true | ||
weight_decay: 0.01 | ||
zero_stage: 0 | ||
parallelism: | ||
dp: 2 | ||
expert_parallel_size: 1 | ||
pp: 1 | ||
pp_engine: 1f1b | ||
tp: 1 | ||
tp_linear_async_communication: true | ||
tp_mode: REDUCE_SCATTER | ||
profiler: null | ||
tokenizer: | ||
tokenizer_max_length: null | ||
tokenizer_name_or_path: robot-test/dummy-tokenizer-wordlevel | ||
tokenizer_revision: null | ||
tokens: | ||
batch_accumulation_per_replica: 1 | ||
limit_test_batches: 0 | ||
limit_val_batches: 0 | ||
micro_batch_size: 2 | ||
sequence_length: 256 | ||
train_steps: 15 | ||
val_check_interval: -1 |
Oops, something went wrong.