Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loss device for ORPOTrainer #18

Open
ganeshkrishnan1 opened this issue Apr 12, 2024 · 16 comments
Open

Loss device for ORPOTrainer #18

ganeshkrishnan1 opened this issue Apr 12, 2024 · 16 comments

Comments

@ganeshkrishnan1
Copy link

I hit upon an error in HuggingFace for which there are strangely zero google search results

"ValueError: Calculated loss must be on the original device" I can see this error source code in huggingface trainer.py file

The full error is "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"

This happens when I use multi-gpu using accelerate with this code

model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map='auto'
)

I can set the device map to specific GPU to avoid this but one GPU doesnt have enough memory to support our ORPO training

 model = AutoModelForCausalLM.from_pretrained(
           model_name, torch_dtype=torch_dtype, quantization_config=bnb_config, device_map={"": 0},  attn_implementation=attn_implementation
 )

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 28.00 MiB. GPU 0 has a total capacity of 14.58 GiB of which 14.50 MiB is free. yadda yadda

orpo_config = ORPOConfig(
    output_dir="./output/",
    evaluation_strategy="steps",
    do_eval=True,
    optim="paged_adamw_8bit",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=4,
    per_device_eval_batch_size=1,
    log_level="debug",
    logging_steps=20,
    learning_rate=8e-6,
    eval_steps=20,
    num_train_epochs=3,
    # max_steps=9000,
    save_steps=20,
    save_strategy='epoch',
    warmup_ratio=0.1,
    lr_scheduler_type="linear",
    beta=0.1, #beta is ORPO's lambda
    max_length=1024,
)

trainer = ORPOTrainer(
        model=model,
        train_dataset=dataset[0],
        eval_dataset=dataset[1],
        peft_config=peft_config,
        args=orpo_config,
        tokenizer=tokenizer,
)

trainer.train()

This is specific to ORPO as I have no issues with PEFT finetuning and multi gpu setup

@jiwooya1000
Copy link
Contributor

Hello @ganeshkrishnan1,

Could you try calling the model to the CPU first before passing the model to the ORPOTrainer by removing device_map='auto'?

model_name = "aihello/podcast"
model = AutoModelForCausalLM.from_pretrained(
          model_name, torch_dtype=torch_dtype, quantization_config=bnb_config
)

Accelerate usually allocates the model and the loss automatically to the appropriate GPUs, so let me know if loading the model to the CPU first could resolve it.

@ganeshkrishnan1
Copy link
Author

If I remove the device_map then only one GPU is used and I get a device out of memory error. If I add device_map then I get this error "ValueError: Calculated loss must be on the original device: cuda:0 but device in use is cuda:3"

Also I tried replacing the ORPO trainer with the DPO trainer and it worked without any issues.

@blaze7451
Copy link

Hi @ganeshkrishnan1 , @jiwooya1000 , I just face the same situation and raise an issue in trl here, any suggestions to fix this error?

@huangxinping
Copy link

I'm facing the same issue, anyone know how to fix the error?

@ganeshkrishnan1
Copy link
Author

@blaze7451 @huangxinping there is no known fix for now. We have reverted to DPO for now and might revisit this later or try to fix it ourselves

@jiwooya1000
Copy link
Contributor

Hello, @nlee-208 and I are currently using alignment-handbook and TRL too, but we were not able to reproduce the issue for now. Could you specify which setting of accelerate are you using @ganeshkrishnan1 @huangxinping @blaze7451 (e.g., FSDP, DS2, --multi-gpu)?

@ganeshkrishnan1
Copy link
Author

I am using FDSP

fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device

@huangxinping
Copy link

huangxinping commented Apr 24, 2024

@jiwooya1000 I am just using ORPORTrainer and following a blog tutorial. Fine-tune Llama 3 with ORPO

If I only set up one GPU to fine-tune Llama 3, it can be trained successfully.

@blaze7451
Copy link

blaze7451 commented Apr 24, 2024

I simply used ORPOTrainer, didn't specifically set any accelerate config. Just like the situation of @huangxinping, i tried to use single GPU to train and it works for me.

@nlee-208
Copy link
Contributor

Hey @alvarobartt, do you perhaps have any solution/similar experience using the ORPOTrainer from trl? Seems like there are some issues with the device mapping from either bnb or peft for the trl ORPOTrainer.

@alvarobartt
Copy link
Contributor

alvarobartt commented Apr 24, 2024

Thanks for the ping @nlee-208! Indeed AFAIK device_map is only intended to be used for inference, other than that accelerate handles the device placement and so does the alignment-handbook, so setting device_map for training is just not correct AFAIK.

See https://github.com/huggingface/alignment-handbook/blob/70769f9e9ba41c7f08ba6c4ff3725441b68b7ca3/src/alignment/model_utils.py#L33C1-L35C85

@alvarobartt
Copy link
Contributor

alvarobartt commented Apr 24, 2024

Also could you guys @huangxinping @ganeshkrishnan1 @blaze7451 clarify what the issue is? Does removing the device_map work on a multi-GPU environment, or is it also failing? Also how many processes are you using? If you could share the accelerate config for multi-GPU that would be great to help debug that

@alvarobartt
Copy link
Contributor

Could you guys try using the FSDP configuration at https://github.com/huggingface/alignment-handbook/blob/main/recipes/accelerate_configs/fsdp.yaml, and run that as accelerate launch --config_file fsdp.yaml your_script.py using the same number of processes in num_processes as GPUs you want to use, and also removing the device_map="auto" from AutoModelForCausalLM.from_pretrained?

@alvarobartt
Copy link
Contributor

So to replicate Maxime's script via the alignment-handbook you should use the following configuration, say config.yaml:

# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12

# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear 
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10

Then the following FSDP configuration file (tweaking the num_processes), say fsdp.yaml:

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And then run that as:

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml

Otherwise, if you prefer to use custom code, you can look at run_orpo.py for reference on how to properly initialize accelerate to use multiple GPUs for fine-tuning.

Hope that helped you in the meantime 👍🏻

@ganeshkrishnan1
Copy link
Author

Thanks, got it working though there is the bug of Ampere devices with bfloat16. I created a pull request to fix it

@hitszxs
Copy link

hitszxs commented Jun 26, 2024

So to replicate Maxime's script via the alignment-handbook you should use the following configuration, say config.yaml:因此,要通过 alignment-handbook 复制 Maxime 的脚本,您应该使用以下配置,例如 config.yaml

# Model arguments
model_name_or_path: meta-llama/Meta-Llama-3-8B
torch_dtype: bfloat16
use_flash_attention_2: true

# LoRA arguments
use_peft: true
load_in_4bit: true
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
- gate_proj
- up_proj
- down_proj

# Data training arguments
dataset_mixer:
  mlabonne/orpo-dpo-mix-40k: 0.1
dataset_splits:
- train
preprocessing_num_workers: 12

# ORPOTrainer arguments
beta: 0.1
do_eval: true
evaluation_strategy: steps
eval_steps: 0.2
gradient_accumulation_steps: 4
gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
hub_model_id: llama-3-orpo-qlora
learning_rate: 8.0e-6
log_level: info
logging_steps: 1
lr_scheduler_type: linear 
max_length: 1024
max_prompt_length: 512
num_train_epochs: 1
optim: paged_adamw_8bit
output_dir: results/
per_device_train_batch_size: 2
per_device_eval_batch_size: 2
seed: 42
warmup_ratio: 0.1
warmup_steps: 10

Then the following FSDP configuration file (tweaking the num_processes), say fsdp.yaml:然后是以下 FSDP 配置文件(调整 num_processes ),例如 fsdp.yaml

compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: true
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false

And then run that as:然后运行它:

ACCELERATE_LOG_LEVEL=info TRANSFORMERS_VERBOSITY=info accelerate launch --config_file fsdp.yaml scripts/run_orpo.py config.yaml

Otherwise, if you prefer to use custom code, you can look at run_orpo.py for reference on how to properly initialize accelerate to use multiple GPUs for fine-tuning.否则,如果您更喜欢使用自定义代码,可以查看 run_orpo.py 以获取有关如何正确初始化 accelerate 以使用多个 GPU 进行微调的参考。

Hope that helped you in the meantime 👍🏻希望同时对您有所帮助👍🏻

哈喽,我想知道这样的配置在加载模型时候是否会出现warning:You are attempting to use Flash Attention 2.0 with a model not initialized on GPU. Make sure to move the model to GPU after initializing it on CPU with 'model. to (' cuda' )'.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants