-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error occured when i train the model #9
Comments
I tested on another 8 gpu device and meet the same error. |
When I train with export CUDA_VISIBLE_DEVICES=0
export WANDB_PROJECT=consistency_llm
model_path="/mnt/bn/multimodel/models/official/cllm/GAIR--Abel-7B-001/model"
trajectory_file="data/collected_jacobi_trajectory/my_cleaned_gsm8k_jacobi_max_new_tokens16_augTrue_labels_True_max_seq_len_512.json"
output_path="./output_baseline"
n_token_seq_size=512
torchrun --nnodes=1 --nproc_per_node=1 --rdzv_id=101 --rdzv_endpoint='localhost:5666' \
--master_port 10000 \
cllm/train_cllm_global.py \
--target_model_path ${model_path} \
--data_path ${trajectory_file} \
--output_dir ${output_path} \
--max_new_tokens ${n_token_seq_size} \
--bf16 True \
--tf32 True \
--report_to wandb \
--do_train \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 1 \
--gradient_accumulation_steps 1 \
--gradient_checkpointing True \
--evaluation_strategy "epoch" \
--save_strategy "steps" \
--save_steps 100 \
--save_total_limit 50 \
--learning_rate 2e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 10 \
--model_max_length 2048 \
--lazy_preprocess True \
--fsdp "full_shard auto_wrap" \
--fsdp_transformer_layer_cls_to_wrap 'LlamaDecoderLayer' Error info
|
Hi thank you for your interest in our work! I checked your bash script commands, the We have also updated the example training script accordingly. Notice that the provided Jacobi trajectory file reads:
For the OOM issue, please use more than 1 A100 80G GPU :) |
Thanks for your reply, i tried using 4-cards training with |
Dealing with multi-sample training per batch would require some modifications to the Jacobi trajectory preparation script as well as minor modifications to data preprocessing in |
Hi, thanks for your great job! I want to reproduce the training process but some error occured as follows. Could you please help to have a look? Thanks!
Training scripts (I just have 4xA100, so the node num is changed to 4 in
train_cllm.sh
)The errors are as follows.
The text was updated successfully, but these errors were encountered: