Loss device for ORPOTrainer #18

ganeshkrishnan1 · 2024-04-12T02:23:54Z

I hit upon an error in HuggingFace for which there are strangely zero google search results

"ValueError: Calculated loss must be on the original device" I can see this error source code in huggingface trainer.py file

The full error is "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"

This happens when I use multi-gpu using accelerate with this code

model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map='auto'
)

I can set the device map to specific GPU to avoid this but one GPU doesnt have enough memory to support our ORPO training

 model = AutoModelForCausalLM.from_pretrained(
           model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map={"": 0},  attn_implementation=attn_implementation
 )

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 14.50 MiB is free. yadda yadda

orpo_config = ORPOConfig(
    output_dir="./output/",
    evaluation_strategy="steps",
    do_eval=True,
    optim="paged_adamw_8bit",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=1,
    log_level="debug",
    logging_steps=20,
    learning_rate=8e-6,
    eval_steps=20,
    num_train_epochs=3,
    # max_steps=9000,
    save_steps=20,
    save_strategy='epoch',
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    beta=0.1, #beta is ORPO's lambda
    max_length=1024,
)

trainer = ORPOTrainer(
        model=model,
        train_dataset=dataset[0],
        eval_dataset=dataset[1],
        peft_config=peft_config,
        args=orpo_config,
        tokenizer=tokenizer,
)

trainer.train()

This is specific to ORPO as I have no issues with PEFT finetuning and multi gpu setup

The text was updated successfully, but these errors were encountered:

jiwooya1000 · 2024-04-16T13:25:49Z

Hello @ganeshkrishnan1,

Could you try calling the model to the CPU first before passing the model to the ORPOTrainer by removing device_map='auto'?

model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch_dtype, quantization_config=bnb_config
)

Accelerate usually allocates the model and the loss automatically to the appropriate GPUs, so let me know if loading the model to the CPU first could resolve it.

ganeshkrishnan1 · 2024-04-17T20:20:58Z

If I remove the device_map then only one GPU is used and I get a device out of memory error. If I add device_map then I get this error "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"

Also I tried replacing the ORPO trainer with the DPO trainer and it worked without any issues.

blaze7451 · 2024-04-23T03:22:01Z

Hi @ganeshkrishnan1 , @jiwooya1000 , I just face the same situation and raise an issue in trl here, any suggestions to fix this error?

huangxinping · 2024-04-24T02:53:37Z

I'm facing the same issue, anyone know how to fix the error?

ganeshkrishnan1 · 2024-04-24T03:10:56Z

@blaze7451 @huangxinping there is no known fix for now. We have reverted to DPO for now and might revisit this later or try to fix it ourselves

jiwooya1000 · 2024-04-24T03:32:33Z

Hello, @nlee-208 and I are currently using alignment-handbook and TRL too, but we were not able to reproduce the issue for now. Could you specify which setting of accelerate are you using @ganeshkrishnan1 @huangxinping @blaze7451 (e.g., FSDP, DS2, --multi-gpu)?

ganeshkrishnan1 · 2024-04-24T03:54:26Z

I am using FDSP

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device

huangxinping · 2024-04-24T15:37:49Z

@jiwooya1000 I am just using ORPORTrainer and following a blog tutorial. Fine-tune Llama 3 with ORPO

If I only set up one GPU to fine-tune Llama 3, it can be trained successfully.

blaze7451 · 2024-04-24T16:08:02Z

I simply used ORPOTrainer, didn't specifically set any accelerate config. Just like the situation of @huangxinping, i tried to use single GPU to train and it works for me.

nlee-208 · 2024-04-24T16:28:32Z

Hey @alvarobartt, do you perhaps have any solution/similar experience using the ORPOTrainer from trl? Seems like there are some issues with the device mapping from either bnb or peft for the trl ORPOTrainer.

alvarobartt · 2024-04-24T16:31:39Z

Thanks for the ping @nlee-208! Indeed AFAIK device_map is only intended to be used for inference, other than that accelerate handles the device placement and so does the alignment-handbook, so setting device_map for training is just not correct AFAIK.

See https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/model_utils.py#L33C1-L35C85

alvarobartt · 2024-04-24T16:33:51Z

Also could you guys @huangxinping @ganeshkrishnan1 @blaze7451 clarify what the issue is? Does removing the device_map work on a multi-GPU environment, or is it also failing? Also how many processes are you using? If you could share the accelerate config for multi-GPU that would be great to help debug that

alvarobartt · 2024-04-24T16:39:57Z

Could you guys try using the FSDP configuration at https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/fsdp.yaml, and run that as accelerate launch --config_file fsdp.yaml your_script.py using the same number of processes in num_processes as GPUs you want to use, and also removing the device_map="auto" from AutoModelForCausalLM.from_pretrained?

alvarobartt · 2024-04-24T17:02:30Z

So to replicate Maxime's script via the alignment-handbook you should use the following configuration, say config.yaml:

# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12

# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear 
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10

Then the following FSDP configuration file (tweaking the num_processes), say fsdp.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And then run that as:

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml

Otherwise, if you prefer to use custom code, you can look at run_orpo.py for reference on how to properly initialize accelerate to use multiple GPUs for fine-tuning.

Hope that helped you in the meantime 👍🏻

ganeshkrishnan1 · 2024-04-25T02:34:18Z

Thanks, got it working though there is the bug of Ampere devices with bfloat16. I created a pull request to fix it

hitszxs · 2024-06-26T08:54:57Z

So to replicate Maxime's script via the alignment-handbook you should use the following configuration, say config.yaml:因此，要通过 alignment-handbook 复制 Maxime 的脚本，您应该使用以下配置，例如 config.yaml ：

# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12

# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear 
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10

Then the following FSDP configuration file (tweaking the num_processes), say fsdp.yaml:然后是以下 FSDP 配置文件（调整 num_processes ），例如 fsdp.yaml ：

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And then run that as:然后运行它：

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml

Otherwise, if you prefer to use custom code, you can look at run_orpo.py for reference on how to properly initialize accelerate to use multiple GPUs for fine-tuning.否则，如果您更喜欢使用自定义代码，可以查看 run_orpo.py 以获取有关如何正确初始化 accelerate 以使用多个 GPU 进行微调的参考。

Hope that helped you in the meantime 👍🏻希望同时对您有所帮助👍🏻

哈喽，我想知道这样的配置在加载模型时候是否会出现warning：You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with 'model. to (' cuda' )'.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loss device for ORPOTrainer #18

Loss device for ORPOTrainer #18

ganeshkrishnan1 commented Apr 12, 2024

jiwooya1000 commented Apr 16, 2024

ganeshkrishnan1 commented Apr 17, 2024

blaze7451 commented Apr 23, 2024

huangxinping commented Apr 24, 2024

ganeshkrishnan1 commented Apr 24, 2024

jiwooya1000 commented Apr 24, 2024

ganeshkrishnan1 commented Apr 24, 2024

huangxinping commented Apr 24, 2024 •

edited

Loading

blaze7451 commented Apr 24, 2024 •

edited

Loading

nlee-208 commented Apr 24, 2024

alvarobartt commented Apr 24, 2024 •

edited

Loading

alvarobartt commented Apr 24, 2024 •

edited

Loading

alvarobartt commented Apr 24, 2024

alvarobartt commented Apr 24, 2024

ganeshkrishnan1 commented Apr 25, 2024

hitszxs commented Jun 26, 2024

Loss device for ORPOTrainer #18

Loss device for ORPOTrainer #18

Comments

ganeshkrishnan1 commented Apr 12, 2024

jiwooya1000 commented Apr 16, 2024

ganeshkrishnan1 commented Apr 17, 2024

blaze7451 commented Apr 23, 2024

huangxinping commented Apr 24, 2024

ganeshkrishnan1 commented Apr 24, 2024

jiwooya1000 commented Apr 24, 2024

ganeshkrishnan1 commented Apr 24, 2024

huangxinping commented Apr 24, 2024 • edited Loading

blaze7451 commented Apr 24, 2024 • edited Loading

nlee-208 commented Apr 24, 2024

alvarobartt commented Apr 24, 2024 • edited Loading

alvarobartt commented Apr 24, 2024 • edited Loading

alvarobartt commented Apr 24, 2024

alvarobartt commented Apr 24, 2024

ganeshkrishnan1 commented Apr 25, 2024

hitszxs commented Jun 26, 2024

huangxinping commented Apr 24, 2024 •

edited

Loading

blaze7451 commented Apr 24, 2024 •

edited

Loading

alvarobartt commented Apr 24, 2024 •

edited

Loading

alvarobartt commented Apr 24, 2024 •

edited

Loading