Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error occured when i train the model #9

Open
littletomatodonkey opened this issue May 11, 2024 · 5 comments
Open

Error occured when i train the model #9

littletomatodonkey opened this issue May 11, 2024 · 5 comments

Comments

@littletomatodonkey
Copy link

littletomatodonkey commented May 11, 2024

Hi, thanks for your great job! I want to reproduce the training process but some error occured as follows. Could you please help to have a look? Thanks!

Training scripts (I just have 4xA100, so the node num is changed to 4 in train_cllm.sh)

model_path="/mnt/bn/multimodel/models/official/cllm/cllm--vicuna-7b-sharegpt-gpt4-48k/model"
trajectory_file="data/collected_jacobi_trajectory/cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"
output_path="./output_baseline"
n_token_seq_size=512

bash scripts/train_cllm.sh ${model_path} ${trajectory_file} ${output_path} ${n_token_seq_size}

The errors are as follows.

Traceback (most recent call last):
  File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 289, in <module>
    train()
  File "/mnt/bn/multimodel/code/Consistency_LLM/cllm/train_cllm_global.py", line 281, in train
    trainer.train()
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1537, in train
    return inner_training_loop(
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/transformers/trainer.py", line 1821, in _inner_training_loop
    for step, inputs in enumerate(epoch_iterator):
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/accelerate/data_loader.py", line 448, in __iter__
    current_batch = next(dataloader_iter)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 630, in __next__
    data = self._next_data()
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/dataloader.py", line 674, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 54, in fetch
    return self.collate_fn(data)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 265, in default_collate
    return collate(batch, collate_fn_map=default_collate_fn_map)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in collate
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 127, in <dictcomp>
    return elem_type({key: collate([d[key] for d in batch], collate_fn_map=collate_fn_map) for key in elem})
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/utils/data/_utils/collate.py", line 138, in collate
    raise RuntimeError('each element in list of batch should be of equal size')
RuntimeError: each element in list of batch should be of equal size
@littletomatodonkey
Copy link
Author

I tested on another 8 gpu device and meet the same error.

@littletomatodonkey
Copy link
Author

When I train with batch size = 1 on single A100 (80G), it told me out of memory. Do I need set other configs? Thanks!

export CUDA_VISIBLE_DEVICES=0
export WANDB_PROJECT=consistency_llm

model_path="/mnt/bn/multimodel/models/official/cllm/GAIR--Abel-7B-001/model"
trajectory_file="data/collected_jacobi_trajectory/my_cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"

output_path="./output_baseline"
n_token_seq_size=512

torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=101 --rdzv_endpoint='localhost:5666' \
    --master_port 10000 \
    cllm/train_cllm_global.py \
    --target_model_path ${model_path} \
    --data_path ${trajectory_file} \
    --output_dir ${output_path} \
    --max_new_tokens ${n_token_seq_size} \
    --bf16 True \
    --tf32 True \
    --report_to wandb \
    --do_train \
    --num_train_epochs 1 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --gradient_accumulation_steps 1 \
    --gradient_checkpointing True \
    --evaluation_strategy "epoch" \
    --save_strategy "steps" \
    --save_steps 100 \
    --save_total_limit 50 \
    --learning_rate 2e-5 \
    --weight_decay 0. \
    --warmup_ratio 0.03 \
    --lr_scheduler_type "cosine" \
    --logging_steps 10 \
    --model_max_length 2048 \
    --lazy_preprocess True \
    --fsdp "full_shard auto_wrap" \
    --fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer'

Error info

  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/optimizer.py", line 76, in _use_grad
    ret = func(self, *args, **kwargs)
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/adamw.py", line 173, in step
    self._init_group(
  File "/mnt/bn/multimodel/envs/miniconda3/envs/cllm/lib/python3.10/site-packages/torch/optim/adamw.py", line 125, in _init_group
    state["exp_avg_sq"] = torch.zeros_like(
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB. GPU 0 has a total capacty of 79.35 GiB of which 158.19 MiB is free. Process 2239837 has 79.19 GiB memory in use. Of the allocated memory 78.24 GiB is allocated by PyTorch, and 305.59 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

@snyhlxde1
Copy link
Collaborator

snyhlxde1 commented May 13, 2024

Hi thank you for your interest in our work!

I checked your bash script commands, the n_token_seq_size should be set to 16 (notice that n_token_seq_size is the sub-sequence length that used for Jacobi iteration, 512 should be the max output sequence length used during the Jacobi trajectory collection process. The two arguments are different). And the prepared Jacobi dataset you download is formatted to support batch size = 1 training only. For batch_size > 1:
you need to generate your own Jacobi datasets with batch size > 1 or some data pre-processing with the dataset to train with batch size > 1.

We have also updated the example training script accordingly.

Notice that the provided Jacobi trajectory file reads:
cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512
which can be interpreted as:

  1. it has been post processed to remove repetitive generation content so flagged as 'cleaned', and data augmentation is turned on. (see paper data cleaning section and Jacobi trajectory generation script)
  2. n_token_seq_size = 16 (max_new_tokens) used during Jacobi trajectory collection process.
  3. model_max_length = 512 (max_seq_len) used during Jacobi trajectory collection process.

For the OOM issue, please use more than 1 A100 80G GPU :)

@littletomatodonkey
Copy link
Author

Hi thank you for your interest in our work!

I checkout your bash script command, the n_token_seq_size should be set to 16 (notice that n_token_seq_size is the sub-sequence length that used for Jacobi iteration, 512 should be the max output sequence length used during the Jacobi trajectory collection process. The two arguments are different). And the prepared Jacobi dataset you download is formatted to support batch size = 1 training only. For batch_size > 1: you need to generate your own Jacobi datasets with batch size > 1 or some data pre-processing with the dataset to train with batch size > 1.

We have also updated the example training script accordingly.

Notice that the provided Jacobi trajectory file reads: cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512 which can be interpreted as:

  1. it has been post processed to remove repetitive generation content so flagged as 'cleaned', and data augmentation is turned on. (see paper data cleaning section and Jacobi trajectory generation script)
  2. n_token_seq_size = 16 (max_new_tokens) used during Jacobi trajectory collection process.
  3. model_max_length = 512 (max_seq_len) used during Jacobi trajectory collection process.

For the OOM issue, please use more than 1 A100 80G GPU :)

Thanks for your reply, i tried using 4-cards training with n_token_seq_size=16 and it can train normally.
For the larger bs training, i'll take a look, would you consider providing a script to deal with the multi samples' training per-batch? Thanks !

@snyhlxde1
Copy link
Collaborator

snyhlxde1 commented May 13, 2024

Dealing with multi-sample training per batch would require some modifications to the Jacobi trajectory preparation script as well as minor modifications to data preprocessing in cllm/train_cllm_global.py script, or post-processing with the current version of Jacobi dataset so that each data entry can be collated into a batch (require removing redundant dimensionality for the collected token ids etc.). Feel free to look into, give it a try and follow up on this thread. I would love to help out.
If there is enough interest, we will update the scripts accordingly to automate this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants