-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Loss device for ORPOTrainer #18
Comments
Hello @ganeshkrishnan1, Could you try calling the model to the CPU first before passing the model to the ORPOTrainer by removing
Accelerate usually allocates the model and the loss automatically to the appropriate GPUs, so let me know if loading the model to the CPU first could resolve it. |
If I remove the device_map then only one GPU is used and I get a device out of memory error. If I add device_map then I get this error "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3" Also I tried replacing the ORPO trainer with the DPO trainer and it worked without any issues. |
Hi @ganeshkrishnan1 , @jiwooya1000 , I just face the same situation and raise an issue in trl here, any suggestions to fix this error? |
I'm facing the same issue, anyone know how to fix the error? |
@blaze7451 @huangxinping there is no known fix for now. We have reverted to DPO for now and might revisit this later or try to fix it ourselves |
Hello, @nlee-208 and I are currently using alignment-handbook and TRL too, but we were not able to reproduce the issue for now. Could you specify which setting of accelerate are you using @ganeshkrishnan1 @huangxinping @blaze7451 (e.g., FSDP, DS2, --multi-gpu)? |
I am using FDSP
|
@jiwooya1000 I am just using ORPORTrainer and following a blog tutorial. Fine-tune Llama 3 with ORPO If I only set up one GPU to fine-tune Llama 3, it can be trained successfully. |
I simply used ORPOTrainer, didn't specifically set any accelerate config. Just like the situation of @huangxinping, i tried to use single GPU to train and it works for me. |
Hey @alvarobartt, do you perhaps have any solution/similar experience using the ORPOTrainer from trl? Seems like there are some issues with the device mapping from either bnb or peft for the trl ORPOTrainer. |
Thanks for the ping @nlee-208! Indeed AFAIK |
Also could you guys @huangxinping @ganeshkrishnan1 @blaze7451 clarify what the issue is? Does removing the |
Could you guys try using the FSDP configuration at https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/fsdp.yaml, and run that as |
So to replicate Maxime's script via the # Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true
# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj
# Data training arguments
dataset_mixer:
mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12
# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10 Then the following FSDP configuration file (tweaking the compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: true
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false And then run that as: ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml Otherwise, if you prefer to use custom code, you can look at Hope that helped you in the meantime 👍🏻 |
Thanks, got it working though there is the bug of Ampere devices with bfloat16. I created a pull request to fix it |
哈喽,我想知道这样的配置在加载模型时候是否会出现warning:You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with 'model. to (' cuda' )'. |
I hit upon an error in HuggingFace for which there are strangely zero google search results
"ValueError: Calculated loss must be on the original device" I can see this error source code in huggingface trainer.py file
The full error is "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"
This happens when I use multi-gpu using accelerate with this code
I can set the device map to specific GPU to avoid this but one GPU doesnt have enough memory to support our ORPO training
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 14.50 MiB is free. yadda yadda
This is specific to ORPO as I have no issues with PEFT finetuning and multi gpu setup
The text was updated successfully, but these errors were encountered: