-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ranking algorithm: position unbiased option #4531
Conversation
robert-howley-zocdoc
commented
Aug 18, 2021
•
edited
Loading
edited
- adds support for the position unbiased adjustments described in the Unbiased LambdaMART paper
- this methodology attempts to correct for position bias in the result set
- implementation assumes queries are fed into training in the order in which they appeared
- note for fellow practitioners ...
- you'll often see lower ndcg@1 but higher ndcg at lower cutoffs relative to standard (no bias correction) training
- this isn't an unreasonable outcome of reducing position bias
- this pr is loosely inspired by the author's implementation here
- warning: interpreting that repo can be difficult bc some parameters were either implicitly or explicitly fixed
- free to talk in more detail about the implications of fixed sigmoid in that repo and how it translates
- created a branch position_unbiased_notes that has more detailed comments on how the formulas of the paper line up with the code changes
chore: ignore swig jni file
run tests on pr to position_debias
add unbiased config params
feat: rough migration of acbull's source
chore: give better name to lambdarank bias regularizer
remove unused param position_bins_, replaced by truncation_level
btw sorry for the long lapse. getting (internal) approval on the cla took a bit and then i was on vacation for 2+ weeks. back and will be responsive. |
no problem at all, we understand! If you're able to submit a separate PR with some of the recommended small changes like the I also want to set the expectation that I personally am probably not qualified to review that the submitted code is algorithmically correct, and that we'll need a review from @guolinke or @shiyu1994 or @btrotta for that. Thanks again for working with us on this great addition! |
It's been about two months since this PR had any activity. @shiyu1994 @guolinke @btrotta could one of you please provide a review? I'd like to bring this to a resolution before the 4.0 release. @robert-howley-zocdoc could you update this to the latest what I tried (click me)git clone \
[email protected]:Zocdoc/LightGBM.git \
zocdoc-lgb
cd zocdoc-lgb
git remote add upstream [email protected]:microsoft/LightGBM.git
git pull upstream master
git fetch origin position_unbiased
git checkout position_unbiased
git merge master
git push origin position_unbiased
Please make sure you have the correct access rights |
Sorry for the late response. Will review this soon. |
Hi @robert-howley-zocdoc could you clarify this assumption a bit?
I was trying to determine the document position appears in the calculation, does the assumption mean that for a query/document group like:
the implementation assumes that the record with idx=0 appeared first in the results, idx=1 second etc? So it assume that the input dataset was created with presentation order in mind correct? Otherwise we'd need to include another column indicating position in the calculation. |
Hi @shiyu1994 based on the comment above and in my limited testing of this PR, reading through the paper/code it looks correct to me. I'd like to still run a couple of tests with the same datasets as in the paper to see how close this gets to the results there. One question I had: After this gets merged one extension I'd like to work on is allowing the position to be a separate column that is available to the optimization process. The reason I'm suggesting this is that in real world data you often have missing rows/data and other logging errors that would break the contiguous assumption of this implementation, leading to incorrect position biases being updated. I wanted to get your early opinion on this potential contribution. Position data could be added as a metadata attribute, similar to query boundaries. |
Correct on all 3. I use |
Hi @robhowley one thing I'm noticing in the debug logs is that while the position biases monotonically decrease for most positions, the cutoff position (truncation level) has a large bias, I even see it go above 1 for Another thing I'm observing is that the |
int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1)); | ||
int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi I am trying to investigate what the issue is with the biases for the truncation cutoff spot that I'm observing and I'm wondering if the thresholding here is the culprit.
Within the context of this implementation, the case where debias_high_rank
gets the value truncation_level - 1
is when high > truncation_level - 1
, i.e. sorted_idx[high_rank] > truncation_level - 1
i.e. sorted_idx[j] > truncation_level - 1
.
Correct me if I'm wrong but the only case the above happens is when j
is the high_rank
, and j
starts from i+1
which can take a >truncation_level
value right?
I was wondering if it's better to check the truncation level for both high and low ranks and only update the position biases when both are lower than truncation_level
, instead of thresholding their values.
As I'm looking for the update to introduce the position column as a metadata value, what I'm thinking is along the lines of:
const data_size_t high = sorted_idx[high_rank];
const data_size_t low = sorted_idx[low_rank];
data_size_t high_position, low_position;
if (unbiased_) {
// record_positions_ is a vector of length data_size, with the position of each
// record in the data. start is the offset for the current query
high_position = record_positions_[start + sorted_idx[high_rank]];
low_position = record_positions_[start + sorted_idx[low_rank]];
} else {
high_position = high;
low_position = low;
}
if (unbiased_) {
double p_cost = log(1.0f / (1.0f - p_lambda)) * delta_pair_NDCG;
// more relevant (clicked) gets debiased by less relevant (unclicked), only if within truncation levels
if (high_position <= truncation_level_ && low_position <= truncation_level_) {
i_costs_buffer_[tid][high_position] += p_cost / j_biases_pow_[low_position];
j_costs_buffer_[tid][low_position] += p_cost / i_biases_pow_[high_position]; // and vice versa
}
}
// By default we set values of 1.0 as no-ops
double i_bias_pow = 1.0;
double j_bias_pow = 1.0;
// We only use actual bias values if they are both within the truncation limits
if (unbiased_ && high_position <= truncation_level_ && low_position <= truncation_level_) {
i_bias_pow = i_biases_pow_[high_position];
j_bias_pow = j_biases_pow_[low_position];
}
// update, either with 1.0 values if at least one of data points ended up outside the truncation threshold or the actual biases
p_lambda *= -sigmoid_ * delta_pair_NDCG / i_bias_pow / j_bias_pow;
p_hessian *= sigmoid_ * sigmoid_ * delta_pair_NDCG / i_bias_pow / j_bias_pow;
Does this make sense? The main suggestion is to not "clamp" the bias positions between [0, truncation_level - 1]
but rather not perform bias updates when one of the document-pair items falls outside the truncation level (making all bias positions valid).
It's a bit unclear to me still the difference between high/low
and high_rank/low_rank
. IIRC We sort records in a query by their score and to establish high/low rank we compare their label values.
What I want to ensure is that: since my record_positions_
will have the original data order, the extracted position corresponds to the expected record, which I think is what we accomplish by record_positions_[start + sorted_idx[high_rank]]
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thvasilo Thanks for reviewing this.
What confuses me here is that the index of the cost buffer is truncation of the high
and low
variables. It doesn't make sense to me because high
and low
are just document indices. According to equations (28)~(31) in the Unbiased Lambdarank paper, the truncation should be done on the current rank by the model instead of the document index. Since with truncation level, we are truncating the loss in equation (28) to consider only the top ranked i's.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe here static_cast<int>(std::min(high, truncation_level_ - 1));
and static_cast<int>(std::min(low, truncation_level_ - 1));
should be static_cast<int>(std::min(high_rank, truncation_level_ - 1));
and static_cast<int>(std::min(low_rank, truncation_level_ - 1));
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robert-howley-zocdoc Thanks for the great contribution. Please kindly check my comments. The implementation is ok overall, except some minor mistakes and inconsistence with the Unbiased Lambdarank paper.
@@ -199,15 +222,26 @@ class LambdarankNDCG : public RankingObjective { | |||
// get delta NDCG | |||
double delta_pair_NDCG = dcg_gap * paired_discount * inverse_max_dcg; | |||
// regular the delta_pair_NDCG by score distance | |||
if (norm_ && best_score != worst_score) { | |||
if ((norm_ || unbiased_) && best_score != worst_score) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear to me why with unbiased Lambdarank we must normalize the delta score here.
int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1)); | ||
int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thvasilo Thanks for reviewing this.
What confuses me here is that the index of the cost buffer is truncation of the high
and low
variables. It doesn't make sense to me because high
and low
are just document indices. According to equations (28)~(31) in the Unbiased Lambdarank paper, the truncation should be done on the current rank by the model instead of the document index. Since with truncation level, we are truncating the loss in equation (28) to consider only the top ranked i's.
// update | ||
p_lambda *= -sigmoid_ * delta_pair_NDCG; | ||
p_hessian *= sigmoid_ * sigmoid_ * delta_pair_NDCG; | ||
p_lambda *= -sigmoid_ * delta_pair_NDCG / i_biases_pow_[debias_high_rank] / j_biases_pow_[debias_low_rank]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suppose we are now before starting iteration k + 1
, then the i_biases_pow_
and j_biases_pow_
would be calculated based the prediction results up to iteration k - 1
. Only the newly updated i_biases_pow_
and j_biases_pow_
in UpdatePositionBiasesAndGradients
after GetGradients
consider the prediction values up to iteration k
. This is inconsistent with Algorithm 1 in the paper. Though I don't think this should affect the performance much.
int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1)); | ||
int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe here static_cast<int>(std::min(high, truncation_level_ - 1));
and static_cast<int>(std::min(low, truncation_level_ - 1));
should be static_cast<int>(std::min(high_rank, truncation_level_ - 1));
and static_cast<int>(std::min(low_rank, truncation_level_ - 1));
instead.
j_biases_pow_[i] = pow(j_costs_[i] / j_costs_[0], position_bias_regularizer); | ||
} | ||
|
||
LogDebugPositionBiases(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. Please remove this or wrap the debug code with #ifdef DEBUG
macro.
double bias_p_norm_; | ||
|
||
/*! \brief Position bias regularizer exponent, 1 / (1 + bias_p_norm_) */ | ||
double position_bias_regularizer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'd better add an underscore after the member's name to be consistent with the whole code base, i.e. position_bias_regularizer_
.
Hi @shiyu1994 if @robert-howley-zocdoc doesn't have the time to make requested changes I could do the final few changes on top of @robert-howley-zocdoc's great work to push this through. Then I can create a follow up PR with my extensions that enable a custom position column. |
@thvasilo Sounds great! |
@thvasilo Thanks! That would be great! Do you have the permission to pushing to this on-going branch? If not, let's figure out how to make it possible to you. |
yea, sorry for not being on top of this. i don't work at zocdoc anymore which makes time allocation to this a little tricky. feel free to takeover as needed. |
I'm picking up this. But it seems that I have no permission to push to Zocdoc/LightGBM. So I'll merge this into a branch in our repo and open a new PR. |
Thanks @shiyu1994 , yeah I agree with that approach! Thanks @robhowley for getting this work to this point, and for agreeing to let us pick it up and carry it forward. Really really appreciate it! |
Close this and move to #5393. |
This pull request has been automatically locked since there has not been any recent activity since it was closed. |