model.gradient_checkpointing_enable() makes loss.requires_grad be False #35826

ZCWei51 · 2025-01-22T03:04:28Z

System Info

Python 3.9.19
transformers 4.42.0
torch 2.2.2+cu118
peft 0.12.0

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

When I tried using model.gradient_checkpointing_enable() to reduce memory consumption during training, I encountered an error: "RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn." After troubleshooting, I found that the issue seems to be caused by loss.requires_grad being set to False, which prevents backpropagation. The following is the reproducible code to directly obtain loss.requires_grad False

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4"
import torch
from transformers import  AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType

def main():
    train_data = {"input": "input test", "output": "output test"}
    model_name = "/workspace/model/CodeLlama-13b-Instruct-hf"
    output_dir = "./test_debug"
    
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16,device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    model.config.pad_token_id = model.config.eos_token_id

    input_ids = tokenizer.encode(train_data["input"])
    output_ids = tokenizer.encode(train_data["output"])
    model_inputs_output = input_ids + output_ids + [tokenizer.eos_token_id]
    model_inputs_output = torch.tensor(model_inputs_output, dtype=torch.int64)
    labels = copy.deepcopy(model_inputs_output)
    labels[: len(input_ids)] = -1 # 
    example_mask = model_inputs_output.ge(0)
    label_mask = labels.ge(0)
    model_inputs_output[~example_mask] = 0
    labels[~label_mask] = -100
    train_dataset = {
            "input_ids": model_inputs_output.unsqueeze(0).to("cuda"),
            "attention_mask": example_mask.unsqueeze(0).to("cuda"),
            "labels": labels.unsqueeze(0).to("cuda")
        }

    lora_config = LoraConfig(
            r=8,  
            lora_alpha=16,  
            target_modules=["q_proj", "gate_proj", "v_proj", "o_proj", "up_proj", "k_proj", "down_proj"],  # 与llama-factory一致
            lora_dropout=0.05,  
            task_type= TaskType.CAUSAL_LM  
        )
    model = get_peft_model(model, lora_config)
    model.gradient_checkpointing_enable()
    model.train()    
    model.print_trainable_parameters()
    model.to("cuda")

    output = model(**train_dataset)
    loss = output["loss"]
    print(f"loss: {loss.requires_grad}")


if __name__ == "__main__":
    main()

Output is

loss: False

This is confusing because model.gradient_checkpointing_enable() is designed to reduce memory consumption, but if loss.requires_grad is set to False, it disrupts the normal training process. Meanwhile, when I use similar code from LLama-factory to achieve the effect of model.gradient_checkpointing_enable(), I find that loss.requires_grad is True. Below is the code:

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4"
import torch
from transformers import  AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import copy
from types import MethodType
from functools import partial
import inspect
from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple
from transformers import PreTrainedModel

def _gradient_checkpointing_enable(
    self: "PreTrainedModel", gradient_checkpointing_kwargs: Optional[Dict[str, Any]] = None
) -> None:
    r"""
    Activates gradient checkpointing for the current model.

    Modification of the original method to enable gradient checkpointing for block-wise optimizer.
    """
    from torch.utils.checkpoint import checkpoint

    if not self.supports_gradient_checkpointing:
        raise ValueError("{} does not support gradient checkpointing.".format(self.__class__.__name__))

    if gradient_checkpointing_kwargs is None:
        gradient_checkpointing_kwargs = {"use_reentrant": True}

    gradient_checkpointing_func = partial(checkpoint, **gradient_checkpointing_kwargs)

    def custom_gradient_checkpointing_func(func, *args, **kwargs):
        module: "torch.nn.Module" = func.__self__

        if any(param.requires_grad for param in module.parameters()):
            for arg in args:
                if torch.is_tensor(arg) and torch.is_floating_point(arg):
                    arg.requires_grad_(True)

        return gradient_checkpointing_func(func, *args, **kwargs)

    if "value" in inspect.signature(self._set_gradient_checkpointing).parameters:  # old GC format
        self.apply(partial(self._set_gradient_checkpointing, value=True))
        self.enable_input_require_grads()
        print("You are using the old GC format, some features (e.g. BAdam) will be invalid.")
    else:  # have already enabled input require gradients
        self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=custom_gradient_checkpointing_func)


def main():
    train_data = {"input": "input test", "output": "output test"}
    model_name = "/workspace/model/CodeLlama-13b-Instruct-hf"
    output_dir = "./test_debug"
    
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16,device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # set the pad token of the model's configuration
    model.config.pad_token_id = model.config.eos_token_id
    # return 
    if not getattr(model, "supports_gradient_checkpointing", False):
        print("Current model does not support gradient checkpointing.")
    else:
        # use_reentrant=False might increase VRAM usage (have not been empirically verified yet)
        # According to: https://github.com/huggingface/transformers/issues/28339
        model.gradient_checkpointing_enable = MethodType(_gradient_checkpointing_enable, model)
        model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": True})
        setattr(model.config, "use_cache", False)  # turn off when gradient checkpointing is enabled
        print("Gradient checkpointing enabled.")

    input_ids = tokenizer.encode(train_data["input"])
    output_ids = tokenizer.encode(train_data["output"])
    model_inputs_output = input_ids + output_ids + [tokenizer.eos_token_id]
    model_inputs_output = torch.tensor(model_inputs_output, dtype=torch.int64)
    labels = copy.deepcopy(model_inputs_output)
    labels[: len(input_ids)] = -1 # 
    example_mask = model_inputs_output.ge(0)
    label_mask = labels.ge(0)
    model_inputs_output[~example_mask] = 0
    labels[~label_mask] = -100
    train_dataset = {
            "input_ids": model_inputs_output.unsqueeze(0).to("cuda"),
            "attention_mask": example_mask.unsqueeze(0).to("cuda"),
            "labels": labels.unsqueeze(0).to("cuda")
        }

    lora_config = LoraConfig(
            r=8,  
            lora_alpha=16,  
            target_modules=["q_proj", "gate_proj", "v_proj", "o_proj", "up_proj", "k_proj", "down_proj"],  # 与llama-factory一致
            lora_dropout=0.05,  
            task_type= TaskType.CAUSAL_LM  
        )
    model = get_peft_model(model, lora_config)
    # model.gradient_checkpointing_enable()
    model.train()    
    model.print_trainable_parameters()
    model.to("cuda")

    output = model(**train_dataset)
    loss = output["loss"]
    print(f"loss: {loss.requires_grad}")


if __name__ == "__main__":
    main()

output is

loss: True

Expected behavior

I am not entirely sure if this is a bug in the implementation of model.gradient_checkpointing_enable(). If it is not, please feel free to close the issue directly and let me know. Thank you for taking the time to look into this issue :)

The text was updated successfully, but these errors were encountered:

Rocketknight1 · 2025-01-22T15:25:36Z

cc @muellerzr @SunMarc

github-actions · 2025-02-21T08:02:51Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ZCWei51 · 2025-02-23T06:05:38Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Hi 👋，

Just wanted to follow up on this issue. I understand you might be busy, but I'm curious if there's any update or if there's additional information I can provide to help resolve it? 🙏

Thanks for your time!

SunMarc · 2025-02-24T15:18:36Z

Thanks for the report ! I'll have a look soon but could you try to see if the problem comes from peft ? You can try to use a smaller model for debugging if you don't have enough ram.

ZCWei51 · 2025-02-25T01:27:09Z

Thanks for the report ! I'll have a look soon but could you try to see if the problem comes from peft ? You can try to use a smaller model for debugging if you don't have enough ram.

Thanks for your reply! 😄
As you said, after I commented out the peft in the code, the loss.requires_grad is back to True.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4"
import torch
from transformers import  AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import copy
from types import MethodType
from functools import partial
import inspect
from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple
from transformers import PreTrainedModel


def main():
    train_data = {"input": "input test", "output": "output test"}
    model_name = "/workspace/model/CodeLlama-13b-Instruct-hf"
    output_dir = "./test_debug"
    
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16,device_map="auto")
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    tokenizer.pad_token = tokenizer.eos_token
    # set the pad token of the model's configuration
    model.config.pad_token_id = model.config.eos_token_id
    # return 
    # if not getattr(model, "supports_gradient_checkpointing", False):
    #     print("Current model does not support gradient checkpointing.")
    # else:
    #     # use_reentrant=False might increase VRAM usage (have not been empirically verified yet)
    #     # According to: https://github.com/huggingface/transformers/issues/28339
    #     model.gradient_checkpointing_enable = MethodType(_gradient_checkpointing_enable, model)
    #     model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": True})
    #     setattr(model.config, "use_cache", False)  # turn off when gradient checkpointing is enabled
    #     print("Gradient checkpointing enabled.")

    input_ids = tokenizer.encode(train_data["input"])
    output_ids = tokenizer.encode(train_data["output"])
    model_inputs_output = input_ids + output_ids + [tokenizer.eos_token_id]
    model_inputs_output = torch.tensor(model_inputs_output, dtype=torch.int64)
    labels = copy.deepcopy(model_inputs_output)
    labels[: len(input_ids)] = -1 # 
    example_mask = model_inputs_output.ge(0)
    label_mask = labels.ge(0)
    model_inputs_output[~example_mask] = 0
    labels[~label_mask] = -100
    train_dataset = {
            "input_ids": model_inputs_output.unsqueeze(0).to("cuda"),
            "attention_mask": example_mask.unsqueeze(0).to("cuda"),
            "labels": labels.unsqueeze(0).to("cuda")
        }

    # lora_config = LoraConfig(
    #         r=8,  
    #         lora_alpha=16,  
    #         target_modules=["q_proj", "gate_proj", "v_proj", "o_proj", "up_proj", "k_proj", "down_proj"],  # 与llama-factory一致
    #         lora_dropout=0.05,  
    #         task_type= TaskType.CAUSAL_LM  
    #     )
    # model = get_peft_model(model, lora_config)
    model.gradient_checkpointing_enable()
    model.train()    
    # model.print_trainable_parameters()
    model.to("cuda")

    output = model(**train_dataset)
    loss = output["loss"]
    print(f"loss: {loss.requires_grad}")


if __name__ == "__main__":
    main()

Under the current circumstances:

When not utilizing PEFT, both the model.print_trainable_parameters() method and the custom _gradient_checkpointing_enable function yield loss.requires_grad as True ✅ .

However, when employing PEFT, only the custom _gradient_checkpointing_enable function preserves loss.requires_grad as True ✅ , whereas invoking model.print_trainable_parameters() results in loss.requires_grad becoming False ❌ .

Is this an problem or a bug caused by a code conflict?

ZCWei51 · 2025-02-25T02:07:01Z

Thanks for the report ! I'll have a look soon but could you try to see if the problem comes from peft ? You can try to use a smaller model for debugging if you don't have enough ram.

Thanks for your reply! 😄 As you said, after I commented out the peft in the code, the loss.requires_grad is back to True.

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "4"
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import get_peft_model, LoraConfig, TaskType
import copy
from types import MethodType
from functools import partial
import inspect
from typing import TYPE_CHECKING, Any, Dict, Optional, Tuple
from transformers import PreTrainedModel

def main():
train_data = {"input": "input test", "output": "output test"}
model_name = "/workspace/model/CodeLlama-13b-Instruct-hf"
output_dir = "./test_debug"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16,device_map="auto")
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# set the pad token of the model's configuration
model.config.pad_token_id = model.config.eos_token_id
# return 
# if not getattr(model, "supports_gradient_checkpointing", False):
#     print("Current model does not support gradient checkpointing.")
# else:
#     # use_reentrant=False might increase VRAM usage (have not been empirically verified yet)
#     # According to: https://github.com/huggingface/transformers/issues/28339
#     model.gradient_checkpointing_enable = MethodType(_gradient_checkpointing_enable, model)
#     model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={"use_reentrant": True})
#     setattr(model.config, "use_cache", False)  # turn off when gradient checkpointing is enabled
#     print("Gradient checkpointing enabled.")

input_ids = tokenizer.encode(train_data["input"])
output_ids = tokenizer.encode(train_data["output"])
model_inputs_output = input_ids + output_ids + [tokenizer.eos_token_id]
model_inputs_output = torch.tensor(model_inputs_output, dtype=torch.int64)
labels = copy.deepcopy(model_inputs_output)
labels[: len(input_ids)] = -1 # 
example_mask = model_inputs_output.ge(0)
label_mask = labels.ge(0)
model_inputs_output[~example_mask] = 0
labels[~label_mask] = -100
train_dataset = {
        "input_ids": model_inputs_output.unsqueeze(0).to("cuda"),
        "attention_mask": example_mask.unsqueeze(0).to("cuda"),
        "labels": labels.unsqueeze(0).to("cuda")
    }

# lora_config = LoraConfig(
#         r=8,  
#         lora_alpha=16,  
#         target_modules=["q_proj", "gate_proj", "v_proj", "o_proj", "up_proj", "k_proj", "down_proj"],  # 与llama-factory一致
#         lora_dropout=0.05,  
#         task_type= TaskType.CAUSAL_LM  
#     )
# model = get_peft_model(model, lora_config)
model.gradient_checkpointing_enable()
model.train()    
# model.print_trainable_parameters()
model.to("cuda")

output = model(**train_dataset)
loss = output["loss"]
print(f"loss: {loss.requires_grad}")
if name == "main":
main()
Under the current circumstances:

When not utilizing PEFT, both the model.print_trainable_parameters() method and the custom _gradient_checkpointing_enable function yield loss.requires_grad as True ✅ .

However, when employing PEFT, only the custom _gradient_checkpointing_enable function preserves loss.requires_grad as True ✅ , whereas invoking model.print_trainable_parameters() results in loss.requires_grad becoming False ❌ .

Is this an problem or a bug caused by a code conflict?

Sorry for closing issue due to mis-touch 😭

I compared the model.print_trainable_parameters() source with custom _gradient_checkpointing_enable and found the differences to be self._set_gradient_checkpointing

in model.print_trainable_parameters() the code is
self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=gradient_checkpointing_func)

in above code is
self._set_gradient_checkpointing(enable=True, gradient_checkpointing_func=custom_gradient_checkpointing_func)

The second parameter gradient_checkpointing_func is passed a different value.

Debug is here, and I'm not sure why that would lead to the existing situation.

Thanks for your time! 👍

SunMarc · 2025-02-25T14:51:52Z

Thanks for all the details, I found the issue that you are experiencing. There is an issue when you try to enable gradient checkpointing after you created the peft model. As a temporary fix, you should enable it before. This PR will fix your issue huggingface/peft#2398

github-actions · 2025-03-22T08:04:17Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

ZCWei51 added the bug label Jan 22, 2025

ZCWei51 closed this as completed Feb 25, 2025

ZCWei51 reopened this Feb 25, 2025

SunMarc linked a pull request Feb 25, 2025 that will close this issue

Enable grad checkpointing after get_peft_model huggingface/peft#2398

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

model.gradient_checkpointing_enable() makes loss.requires_grad be False #35826

model.gradient_checkpointing_enable() makes loss.requires_grad be False #35826

ZCWei51 commented Jan 22, 2025

Rocketknight1 commented Jan 22, 2025

github-actions bot commented Feb 21, 2025

ZCWei51 commented Feb 23, 2025

SunMarc commented Feb 24, 2025

ZCWei51 commented Feb 25, 2025 •

edited

Loading

ZCWei51 commented Feb 25, 2025

SunMarc commented Feb 25, 2025

github-actions bot commented Mar 22, 2025

model.gradient_checkpointing_enable() makes loss.requires_grad be False #35826

model.gradient_checkpointing_enable() makes loss.requires_grad be False #35826

Comments

ZCWei51 commented Jan 22, 2025

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Rocketknight1 commented Jan 22, 2025

github-actions bot commented Feb 21, 2025

ZCWei51 commented Feb 23, 2025

SunMarc commented Feb 24, 2025

ZCWei51 commented Feb 25, 2025 • edited Loading

ZCWei51 commented Feb 25, 2025

SunMarc commented Feb 25, 2025

github-actions bot commented Mar 22, 2025

ZCWei51 commented Feb 25, 2025 •

edited

Loading