-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Will you release the code? #2
Comments
my implementation based on the DPOTrainer in trl for your reference: def dpo_loss(
self,
policy_chosen_logps: torch.FloatTensor,
policy_rejected_logps: torch.FloatTensor,
reference_chosen_logps: torch.FloatTensor,
reference_rejected_logps: torch.FloatTensor,
reference_free: bool = False,
):
pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps
if reference_free:
ref_logratios = 0
logits = pi_logratios - ref_logratios
# add regularization here
positive_reg = reference_chosen_logps - policy_chosen_logps
losses = - ( F.logsigmoid(self.beta * logits) - self.dpop_lambda * torch.clamp(
positive_reg, min=0 # lambda * max(0, ratio)
))
chosen_rewards = (
self.beta * (policy_chosen_logps - reference_chosen_logps).detach()
)
rejected_rewards = (
self.beta * (policy_rejected_logps - reference_rejected_logps).detach()
)
return losses, chosen_rewards, rejected_rewards Update ffter DiscussionThanks for your discussions @tenggyut @LuJunru @FeiWang96 . |
Is it
|
notice that in the paper the ratio is defined as : |
This is my implementation. By the way, I think there are differences between equation 3 and further equations in the appendix. |
I think it should be +. so I think it's a mistake in the paper. and if the lambda set to be 50, the loss will be way to high, I think it may be another mistake. |
I actually trained with my implementation, and I can share my observations just for references. Compared with what I reproduced with original DPO (results), I found the loss of dpop was dramatically large at the beginning, but then drop to a normal level through several steps. However, the final results of dpop was around 3% behind the results of dpo. Update: I ran a new experiment with the official loss released by the author below (#2 (comment)). The results of dpop were better than previous version, but still 1% behind the results of dpo. |
have you actually tried text generating with your trained dpop model? because if using |
I included 3 generative benchmarks: GSM8K, BBH and IFEval. By the way, the lambada I used was exactly 50. |
my mistake,I'am not mean your impl。 it's @FeiWang96。 yours have a |
I put |
Hi there - author of DPOP here. Thanks for taking an interest in our paper, and apologies for taking a while to notice this issue. First I see that there is actually a bracketing error in our paper which resulted in some confusion. In eqn(3), the penalty term - lamda * max(0, reference_chosen_logps - policy_chosen_logps) - should actually be inside the logsigmoid. So following the TRL loss implementation:
The idea here is that when policy_chosen_logps is less than reference_chosen_logps, this penalty is a positive value, and with the negative sign we are then trying to reduce this penalty. We will update the manuscript to fix the bracketing. |
Thank you for providing the official realization! If my understanding is correct, it looks like further adding a |
Yes good spot, that is another minor error in our manuscript 👍 |
Hi~ I tried the new code, but unfortunately both rejected and chosen rewards increased, and the final result on mt-bench will be much lower than dpo. My base model is mistral-7b, and the preference data set is HuggingFaceH4/ultrafeedback_binarized(I set lambda=50) |
Is there any new update? |
Yes, the paper updated on July 3 has the bracketing fixed: https://arxiv.org/pdf/2402.13228 |
Same here. My code is logits -= self.args.dpop_lambda * \
torch.maximum((ref_chosen_logps - chosen_logps), torch.tensor(0.0).to(device))
losses = (
-F.logsigmoid(self.beta * logits) * (1 - self.label_smoothing)
- F.logsigmoid(-self.beta * logits) * self.label_smoothing
) |
No description provided.
The text was updated successfully, but these errors were encountered: