Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will you release the code? #2

Open
zhanghaoie opened this issue Mar 12, 2024 · 17 comments
Open

Will you release the code? #2

zhanghaoie opened this issue Mar 12, 2024 · 17 comments

Comments

@zhanghaoie
Copy link

No description provided.

@TobiasLee
Copy link

TobiasLee commented Mar 26, 2024

my implementation based on the DPOTrainer in trl for your reference:

    def dpo_loss(
        self,
        policy_chosen_logps: torch.FloatTensor,
        policy_rejected_logps: torch.FloatTensor,
        reference_chosen_logps: torch.FloatTensor,
        reference_rejected_logps: torch.FloatTensor,
        reference_free: bool = False,
    ):
        pi_logratios = policy_chosen_logps - policy_rejected_logps
        ref_logratios = reference_chosen_logps - reference_rejected_logps

        if reference_free:
            ref_logratios = 0

        logits = pi_logratios - ref_logratios
        # add regularization here
        positive_reg = reference_chosen_logps - policy_chosen_logps

        losses = - ( F.logsigmoid(self.beta * logits) - self.dpop_lambda * torch.clamp(
            positive_reg, min=0 # lambda * max(0, ratio)
        ))
        chosen_rewards = (
            self.beta * (policy_chosen_logps - reference_chosen_logps).detach()
        )
        rejected_rewards = (
            self.beta * (policy_rejected_logps - reference_rejected_logps).detach()
        )

        return losses, chosen_rewards, rejected_rewards

Update ffter Discussion

Thanks for your discussions @tenggyut @LuJunru @FeiWang96 .
A mistake is identified in my previous implementation and now I have corrected it.

@FeiWang96
Copy link

my implementation based on the DPOTrainer in trl for your reference:

    def dpo_loss(
        self,
        policy_chosen_logps: torch.FloatTensor,
        policy_rejected_logps: torch.FloatTensor,
        reference_chosen_logps: torch.FloatTensor,
        reference_rejected_logps: torch.FloatTensor,
        reference_free: bool = False,
    ):
        pi_logratios = policy_chosen_logps - policy_rejected_logps
        ref_logratios = reference_chosen_logps - reference_rejected_logps

        if reference_free:
            ref_logratios = 0

        logits = pi_logratios - ref_logratios
        # add regularization here
        positive_reg = reference_chosen_logps - policy_chosen_logps

        losses = -F.logsigmoid(self.beta * logits) - self.dpop_lambda * torch.clamp(
            positive_reg, min=0 # lambda * max(0, ratio)
        )
        chosen_rewards = (
            self.beta * (policy_chosen_logps - reference_chosen_logps).detach()
        )
        rejected_rewards = (
            self.beta * (policy_rejected_logps - reference_rejected_logps).detach()
        )

        return losses, chosen_rewards, rejected_rewards

Is it + instead of - according to the paper?

losses = -F.logsigmoid(self.beta * logits) + self.dpop_lambda * torch.clamp(positive_reg, min=0)

@TobiasLee
Copy link

notice that in the paper the ratio is defined as :
$log \frac{p_{win}}{p_{ref}} $
here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be $log \frac{p_{ref}}{p_{win}} $
so the sign is - to make it consistent with the paper

@LuJunru
Copy link

LuJunru commented Apr 2, 2024

notice that in the paper the ratio is defined as : logpwinpref here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be logprefpwin so the sign is - to make it consistent with the paper

losses = (
                -(F.logsigmoid(self.beta * logits) - self.dpop_lambda * torch.relu(reference_chosen_logps - policy_chosen_logps))
            )

This is my implementation. By the way, I think there are differences between equation 3 and further equations in the appendix.

@tenggyut
Copy link

tenggyut commented Apr 3, 2024

notice that in the paper the ratio is defined as : logpwinpref here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be logprefpwin so the sign is - to make it consistent with the paper

I think it should be +. self.dpop_lambda * torch.relu(reference_chosen_logps - policy_chosen_logps)) is always non-negative, how this can be a penalty term if it always make original loss unchanged or smaller?. besides the idea of the paper is to constrain the chosen response not far from the good response, that been said the reference_chosen_logps should not be smaller than policy_chosen_logps otherwise should be penaltiesed.

so I think it's a mistake in the paper. and if the lambda set to be 50, the loss will be way to high, I think it may be another mistake.

@LuJunru
Copy link

LuJunru commented Apr 3, 2024

notice that in the paper the ratio is defined as : logpwinpref here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be logprefpwin so the sign is - to make it consistent with the paper

I think it should be +. self.dpop_lambda * torch.relu(reference_chosen_logps - policy_chosen_logps)) is always non-negative, how this can be a penalty term if it always make original loss unchanged or smaller?. besides the idea of the paper is to constrain the chosen response not far from the good response, that been said the reference_chosen_logps should not be smaller than policy_chosen_logps otherwise should be penaltiesed.

so I think it's a mistake in the paper. and if the lambda set to be 50, the loss will be way to high, I think it may be another mistake.

I actually trained with my implementation, and I can share my observations just for references. Compared with what I reproduced with original DPO (results), I found the loss of dpop was dramatically large at the beginning, but then drop to a normal level through several steps. However, the final results of dpop was around 3% behind the results of dpo.

Update: I ran a new experiment with the official loss released by the author below (#2 (comment)). The results of dpop were better than previous version, but still 1% behind the results of dpo.

@tenggyut
Copy link

tenggyut commented Apr 3, 2024

notice that in the paper the ratio is defined as : logpwinpref here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be logprefpwin so the sign is - to make it consistent with the paper

I think it should be +. self.dpop_lambda * torch.relu(reference_chosen_logps - policy_chosen_logps)) is always non-negative, how this can be a penalty term if it always make original loss unchanged or smaller?. besides the idea of the paper is to constrain the chosen response not far from the good response, that been said the reference_chosen_logps should not be smaller than policy_chosen_logps otherwise should be penaltiesed.
so I think it's a mistake in the paper. and if the lambda set to be 50, the loss will be way to high, I think it may be another mistake.

I actually trained with my implementation, and I can share my observations just for references. Compared with what I reproduced with original DPO (results), I found the loss of dpop was dramatically large at the beginning, but then drop to a normal level through several steps. However, the final results of dpop was around 3% behind the results of dpo.

have you actually tried text generating with your trained dpop model? because if using -, I feel hard to understand......

@LuJunru
Copy link

LuJunru commented Apr 3, 2024

notice that in the paper the ratio is defined as : logpwinpref here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be logprefpwin so the sign is - to make it consistent with the paper

I think it should be +. self.dpop_lambda * torch.relu(reference_chosen_logps - policy_chosen_logps)) is always non-negative, how this can be a penalty term if it always make original loss unchanged or smaller?. besides the idea of the paper is to constrain the chosen response not far from the good response, that been said the reference_chosen_logps should not be smaller than policy_chosen_logps otherwise should be penaltiesed.
so I think it's a mistake in the paper. and if the lambda set to be 50, the loss will be way to high, I think it may be another mistake.

I actually trained with my implementation, and I can share my observations just for references. Compared with what I reproduced with original DPO (results), I found the loss of dpop was dramatically large at the beginning, but then drop to a normal level through several steps. However, the final results of dpop was around 3% behind the results of dpo.

have you actually tried text generating with your trained dpop model?

I included 3 generative benchmarks: GSM8K, BBH and IFEval. By the way, the lambada I used was exactly 50.

@tenggyut
Copy link

tenggyut commented Apr 3, 2024

moid(self.beta * logits)

my mistake,I'am not mean your impl。 it's @FeiWang96。 yours have a - outside the bracket which eacatly my point

@LuJunru
Copy link

LuJunru commented Apr 3, 2024

bracket

notice that in the paper the ratio is defined as : logpwinpref here my positive_reg = reference_chosen_logps - policy_chosen_logps that would be logprefpwin so the sign is - to make it consistent with the paper

I think it should be +. self.dpop_lambda * torch.relu(reference_chosen_logps - policy_chosen_logps)) is always non-negative, how this can be a penalty term if it always make original loss unchanged or smaller?. besides the idea of the paper is to constrain the chosen response not far from the good response, that been said the reference_chosen_logps should not be smaller than policy_chosen_logps otherwise should be penaltiesed.
so I think it's a mistake in the paper. and if the lambda set to be 50, the loss will be way to high, I think it may be another mistake.

I actually trained with my implementation, and I can share my observations just for references. Compared with what I reproduced with original DPO (results), I found the loss of dpop was dramatically large at the beginning, but then drop to a normal level through several steps. However, the final results of dpop was around 3% behind the results of dpo.

have you actually tried text generating with your trained dpop model? because if using -, I feel hard to understand......

I put - in brackets so it equals to +

@arkapal3
Copy link
Contributor

arkapal3 commented Apr 24, 2024

Hi there - author of DPOP here. Thanks for taking an interest in our paper, and apologies for taking a while to notice this issue.

First I see that there is actually a bracketing error in our paper which resulted in some confusion. In eqn(3), the penalty term - lamda * max(0, reference_chosen_logps - policy_chosen_logps) - should actually be inside the logsigmoid.

So following the TRL loss implementation:

       pi_logratios = policy_chosen_logps - policy_rejected_logps
       logits = pi_logratios - ref_logratios
       penalty_term = torch.maximum(torch.zeros_like(policy_chosen_logps), reference_chosen_logps - policy_chosen_logps)
       logits += - self.lambda * penalty_term

            losses = (
                -F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
                - F.logsigmoid(-self.beta * logits) * self.label_smoothing
            )

The idea here is that when policy_chosen_logps is less than reference_chosen_logps, this penalty is a positive value, and with the negative sign we are then trying to reduce this penalty.

We will update the manuscript to fix the bracketing.

@LuJunru
Copy link

LuJunru commented Apr 25, 2024

logits += - self.lambda * penalty_term

Thank you for providing the official realization! If my understanding is correct, it looks like further adding a logsigmoid(beta) operation on your penalty_term in the original eqn(3). I'll have an update experiment to see if it works.

@arkapal3
Copy link
Contributor

arkapal3 commented Apr 25, 2024

Yes good spot, that is another minor error in our manuscript 👍
We will fix that as well. This may have some repercussions for reported values of lambda, though we have since actually run testing over a wider range of lambda in [5, 500] and found the results to be largely insensitive to its value (at least, within this range).

@RikkiXu
Copy link

RikkiXu commented Apr 28, 2024

Hi there - author of DPOP here. Thanks for taking an interest in our paper, and apologies for taking a while to notice this issue.

First I see that there is actually a bracketing error in our paper which resulted in some confusion. In eqn(3), the penalty term - lamda * max(0, reference_chosen_logps - policy_chosen_logps) - should actually be inside the logsigmoid.

So following the TRL loss implementation:

       pi_logratios = policy_chosen_logps - policy_rejected_logps
       logits = pi_logratios - ref_logratios
       penalty_term = torch.maximum(torch.zeros_like(policy_chosen_logps), reference_chosen_logps - policy_chosen_logps)
       logits += - self.lambda * penalty_term

            losses = (
                -F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
                - F.logsigmoid(-self.beta * logits) * self.label_smoothing
            )

The idea here is that when policy_chosen_logps is less than reference_chosen_logps, this penalty is a positive value, and with the negative sign we are then trying to reduce this penalty.

We will update the manuscript to fix the bracketing.

Hi~ I tried the new code, but unfortunately both rejected and chosen rewards increased, and the final result on mt-bench will be much lower than dpo. My base model is mistral-7b, and the preference data set is HuggingFaceH4/ultrafeedback_binarized(I set lambda=50)

@spttt
Copy link

spttt commented Jun 21, 2024

We will update the manuscript to fix the bracketing.

Is there any new update?

@crwhite14
Copy link
Collaborator

Is there any new update?

Yes, the paper updated on July 3 has the bracketing fixed: https://arxiv.org/pdf/2402.13228

@MilchstraB
Copy link

Hi there - author of DPOP here. Thanks for taking an interest in our paper, and apologies for taking a while to notice this issue.
First I see that there is actually a bracketing error in our paper which resulted in some confusion. In eqn(3), the penalty term - lamda * max(0, reference_chosen_logps - policy_chosen_logps) - should actually be inside the logsigmoid.
So following the TRL loss implementation:

       pi_logratios = policy_chosen_logps - policy_rejected_logps
       logits = pi_logratios - ref_logratios
       penalty_term = torch.maximum(torch.zeros_like(policy_chosen_logps), reference_chosen_logps - policy_chosen_logps)
       logits += - self.lambda * penalty_term

            losses = (
                -F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
                - F.logsigmoid(-self.beta * logits) * self.label_smoothing
            )

The idea here is that when policy_chosen_logps is less than reference_chosen_logps, this penalty is a positive value, and with the negative sign we are then trying to reduce this penalty.
We will update the manuscript to fix the bracketing.

Hi~ I tried the new code, but unfortunately both rejected and chosen rewards increased, and the final result on mt-bench will be much lower than dpo. My base model is mistral-7b, and the preference data set is HuggingFaceH4/ultrafeedback_binarized(I set lambda=50)

Same here. My code is

logits -= self.args.dpop_lambda * \
    torch.maximum((ref_chosen_logps - chosen_logps), torch.tensor(0.0).to(device))
losses = (
    -F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
    - F.logsigmoid(-self.beta * logits) * self.label_smoothing
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants