Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ranking algorithm: position unbiased option #4531

Closed
wants to merge 39 commits into from

Conversation

robert-howley-zocdoc
Copy link

@robert-howley-zocdoc robert-howley-zocdoc commented Aug 18, 2021

  • adds support for the position unbiased adjustments described in the Unbiased LambdaMART paper
    • this methodology attempts to correct for position bias in the result set
    • implementation assumes queries are fed into training in the order in which they appeared
  • note for fellow practitioners ...
    • you'll often see lower ndcg@1 but higher ndcg at lower cutoffs relative to standard (no bias correction) training
    • this isn't an unreasonable outcome of reducing position bias
  • this pr is loosely inspired by the author's implementation here
    • warning: interpreting that repo can be difficult bc some parameters were either implicitly or explicitly fixed
    • free to talk in more detail about the implications of fixed sigmoid in that repo and how it translates
  • created a branch position_unbiased_notes that has more detailed comments on how the formulas of the paper line up with the code changes

robhowley and others added 30 commits February 1, 2021 21:53
run tests on pr to position_debias
add unbiased config params
feat: rough migration of acbull's source
chore: give better name to lambdarank bias regularizer
remove unused param position_bins_, replaced by truncation_level
@robert-howley-zocdoc
Copy link
Author

robert-howley-zocdoc commented Sep 7, 2021

Thanks very much for the contribution @robert-howley-zocdoc !

Once you sign the CLA, we'd be happy to start providing a review. Ask here if you have any questions or concerns!

btw sorry for the long lapse. getting (internal) approval on the cla took a bit and then i was on vacation for 2+ weeks. back and will be responsive.

@jameslamb
Copy link
Collaborator

back and will be responsive.

no problem at all, we understand! If you're able to submit a separate PR with some of the recommended small changes like the .gitignore one (#4531 (comment)), you might be able to go faster here because you won't have to wait for maintainers to manually approve Continuous Integration runs (GitHub only applies that gate to first-time contributors in a repo).

I also want to set the expectation that I personally am probably not qualified to review that the submitted code is algorithmically correct, and that we'll need a review from @guolinke or @shiyu1994 or @btrotta for that.

Thanks again for working with us on this great addition!

@jameslamb
Copy link
Collaborator

It's been about two months since this PR had any activity.

@shiyu1994 @guolinke @btrotta could one of you please provide a review? I'd like to bring this to a resolution before the 4.0 release.


@robert-howley-zocdoc could you update this to the latest master? I tried to do that myself tonight, but I think you've opted out of allowing maintainers to push to branches on your fork.

what I tried (click me)
git clone \
    [email protected]:Zocdoc/LightGBM.git \
    zocdoc-lgb

cd zocdoc-lgb
git remote add upstream [email protected]:microsoft/LightGBM.git
git pull upstream master
git fetch origin position_unbiased
git checkout position_unbiased
git merge master
git push origin position_unbiased

ERROR: Permission to Zocdoc/LightGBM.git denied to jameslamb.
fatal: Could not read from remote repository.

Please make sure you have the correct access rights
and the repository exists.

@shiyu1994
Copy link
Collaborator

Sorry for the late response. Will review this soon.

@thvasilo
Copy link

thvasilo commented Dec 3, 2021

Hi @robert-howley-zocdoc could you clarify this assumption a bit?

implementation assumes queries are fed into training in the order in which they appeared

I was trying to determine the document position appears in the calculation, does the assumption mean that for a query/document group like:

idx  query_id     var1      var2      var3  relevance
0           0  0.624776  0.191463  0.598358          0
1           0  0.258280  0.658307  0.148386          0
2           0  0.893683  0.059482  0.340426          0
3           0  0.879514  0.526022  0.712648          1
4           0  0.188580  0.279471  0.062942          0

the implementation assumes that the record with idx=0 appeared first in the results, idx=1 second etc? So it assume that the input dataset was created with presentation order in mind correct? Otherwise we'd need to include another column indicating position in the calculation.

@thvasilo
Copy link

thvasilo commented Dec 20, 2021

Hi @shiyu1994 based on the comment above and in my limited testing of this PR, reading through the paper/code it looks correct to me. I'd like to still run a couple of tests with the same datasets as in the paper to see how close this gets to the results there.

One question I had: After this gets merged one extension I'd like to work on is allowing the position to be a separate column that is available to the optimization process. The reason I'm suggesting this is that in real world data you often have missing rows/data and other logging errors that would break the contiguous assumption of this implementation, leading to incorrect position biases being updated.

I wanted to get your early opinion on this potential contribution. Position data could be added as a metadata attribute, similar to query boundaries.

@robhowley
Copy link

@thvasilo

the implementation assumes that the record with idx=0 appeared first in the results, idx=1 second etc? So it assume that the input dataset was created with presentation order in mind correct? Otherwise we'd need to include another column indicating position in the calculation.

Correct on all 3. I use lightgbm via mmlspark so I added a sortWithinPartitions call to zocdoc's fork of that repo to take care of it. your point on real world data potentially having incomplete sets is completely valid.

@thvasilo
Copy link

thvasilo commented Dec 22, 2021

Hi @robhowley one thing I'm noticing in the debug logs is that while the position biases monotonically decrease for most positions, the cutoff position (truncation level) has a large bias, I even see it go above 1 for bias_i, whereas the biases for the positions before are very low as expected. Any idea what could be causing that? What's special about the bias value of the cutoff?

Another thing I'm observing is that the bias_j values for positions 2 and 3 are > 1 (I've seen this for many values of p_norm). Have you observed that or could this be a data issue (because of the real-world data issues I mentioned)?

Comment on lines +232 to +233
int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));
Copy link

@thvasilo thvasilo Jan 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi I am trying to investigate what the issue is with the biases for the truncation cutoff spot that I'm observing and I'm wondering if the thresholding here is the culprit.

Within the context of this implementation, the case where debias_high_rank gets the value truncation_level - 1 is when high > truncation_level - 1, i.e. sorted_idx[high_rank] > truncation_level - 1 i.e. sorted_idx[j] > truncation_level - 1.

Correct me if I'm wrong but the only case the above happens is when j is the high_rank, and j starts from i+1 which can take a >truncation_level value right?

I was wondering if it's better to check the truncation level for both high and low ranks and only update the position biases when both are lower than truncation_level, instead of thresholding their values.

As I'm looking for the update to introduce the position column as a metadata value, what I'm thinking is along the lines of:

const data_size_t high = sorted_idx[high_rank];
const data_size_t low = sorted_idx[low_rank];
data_size_t high_position, low_position;
if (unbiased_) {
  // record_positions_ is a vector of length data_size, with the position of each 
  // record in the data. start is the offset for the current query
  high_position = record_positions_[start + sorted_idx[high_rank]];
  low_position = record_positions_[start + sorted_idx[low_rank]];
} else {
  high_position = high;
  low_position = low;
}

if (unbiased_) {
  double p_cost = log(1.0f / (1.0f - p_lambda)) * delta_pair_NDCG;

  // more relevant (clicked) gets debiased by less relevant (unclicked), only if within truncation levels
  if (high_position <= truncation_level_ && low_position <= truncation_level_) {
    i_costs_buffer_[tid][high_position] += p_cost / j_biases_pow_[low_position];
    j_costs_buffer_[tid][low_position] += p_cost / i_biases_pow_[high_position];  // and vice versa
  }
}
// By default we set values of 1.0 as no-ops
double i_bias_pow = 1.0;
double j_bias_pow = 1.0;
// We only use actual bias values if they are both within the truncation limits
if (unbiased_ && high_position <= truncation_level_ && low_position <= truncation_level_) {
  i_bias_pow = i_biases_pow_[high_position];
  j_bias_pow = j_biases_pow_[low_position];
}
// update, either with 1.0 values if at least one of data points ended up outside the truncation threshold or the actual biases
p_lambda *= -sigmoid_ * delta_pair_NDCG / i_bias_pow / j_bias_pow;
p_hessian *= sigmoid_ * sigmoid_ * delta_pair_NDCG / i_bias_pow / j_bias_pow;

Does this make sense? The main suggestion is to not "clamp" the bias positions between [0, truncation_level - 1] but rather not perform bias updates when one of the document-pair items falls outside the truncation level (making all bias positions valid).

It's a bit unclear to me still the difference between high/low and high_rank/low_rank. IIRC We sort records in a query by their score and to establish high/low rank we compare their label values.

What I want to ensure is that: since my record_positions_ will have the original data order, the extracted position corresponds to the expected record, which I think is what we accomplish by record_positions_[start + sorted_idx[high_rank]].

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thvasilo Thanks for reviewing this.
What confuses me here is that the index of the cost buffer is truncation of the high and low variables. It doesn't make sense to me because high and low are just document indices. According to equations (28)~(31) in the Unbiased Lambdarank paper, the truncation should be done on the current rank by the model instead of the document index. Since with truncation level, we are truncating the loss in equation (28) to consider only the top ranked i's.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe here static_cast<int>(std::min(high, truncation_level_ - 1)); and static_cast<int>(std::min(low, truncation_level_ - 1)); should be static_cast<int>(std::min(high_rank, truncation_level_ - 1)); and static_cast<int>(std::min(low_rank, truncation_level_ - 1)); instead.

Copy link
Collaborator

@shiyu1994 shiyu1994 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-howley-zocdoc Thanks for the great contribution. Please kindly check my comments. The implementation is ok overall, except some minor mistakes and inconsistence with the Unbiased Lambdarank paper.

@@ -199,15 +222,26 @@ class LambdarankNDCG : public RankingObjective {
// get delta NDCG
double delta_pair_NDCG = dcg_gap * paired_discount * inverse_max_dcg;
// regular the delta_pair_NDCG by score distance
if (norm_ && best_score != worst_score) {
if ((norm_ || unbiased_) && best_score != worst_score) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear to me why with unbiased Lambdarank we must normalize the delta score here.

Comment on lines +232 to +233
int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thvasilo Thanks for reviewing this.
What confuses me here is that the index of the cost buffer is truncation of the high and low variables. It doesn't make sense to me because high and low are just document indices. According to equations (28)~(31) in the Unbiased Lambdarank paper, the truncation should be done on the current rank by the model instead of the document index. Since with truncation level, we are truncating the loss in equation (28) to consider only the top ranked i's.

// update
p_lambda *= -sigmoid_ * delta_pair_NDCG;
p_hessian *= sigmoid_ * sigmoid_ * delta_pair_NDCG;
p_lambda *= -sigmoid_ * delta_pair_NDCG / i_biases_pow_[debias_high_rank] / j_biases_pow_[debias_low_rank];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suppose we are now before starting iteration k + 1, then the i_biases_pow_ and j_biases_pow_ would be calculated based the prediction results up to iteration k - 1. Only the newly updated i_biases_pow_ and j_biases_pow_ in UpdatePositionBiasesAndGradients after GetGradients consider the prediction values up to iteration k. This is inconsistent with Algorithm 1 in the paper. Though I don't think this should affect the performance much.

Comment on lines +232 to +233
int debias_high_rank = static_cast<int>(std::min(high, truncation_level_ - 1));
int debias_low_rank = static_cast<int>(std::min(low, truncation_level_ - 1));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe here static_cast<int>(std::min(high, truncation_level_ - 1)); and static_cast<int>(std::min(low, truncation_level_ - 1)); should be static_cast<int>(std::min(high_rank, truncation_level_ - 1)); and static_cast<int>(std::min(low_rank, truncation_level_ - 1)); instead.

j_biases_pow_[i] = pow(j_costs_[i] / j_costs_[0], position_bias_regularizer);
}

LogDebugPositionBiases();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Please remove this or wrap the debug code with #ifdef DEBUG macro.

double bias_p_norm_;

/*! \brief Position bias regularizer exponent, 1 / (1 + bias_p_norm_) */
double position_bias_regularizer;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd better add an underscore after the member's name to be consistent with the whole code base, i.e. position_bias_regularizer_.

@thvasilo
Copy link

thvasilo commented Jun 21, 2022

Hi @shiyu1994 if @robert-howley-zocdoc doesn't have the time to make requested changes I could do the final few changes on top of @robert-howley-zocdoc's great work to push this through. Then I can create a follow up PR with my extensions that enable a custom position column.

@StrikerRUS
Copy link
Collaborator

@thvasilo Sounds great!

@shiyu1994
Copy link
Collaborator

@thvasilo Thanks! That would be great! Do you have the permission to pushing to this on-going branch? If not, let's figure out how to make it possible to you.

@robhowley
Copy link

yea, sorry for not being on top of this. i don't work at zocdoc anymore which makes time allocation to this a little tricky. feel free to takeover as needed.

@shiyu1994
Copy link
Collaborator

I'm picking up this. But it seems that I have no permission to push to Zocdoc/LightGBM. So I'll merge this into a branch in our repo and open a new PR.

@jameslamb
Copy link
Collaborator

Thanks @shiyu1994 , yeah I agree with that approach! Thanks @robhowley for getting this work to this point, and for agreeing to let us pick it up and carry it forward. Really really appreciate it!

@shiyu1994
Copy link
Collaborator

Close this and move to #5393.

@github-actions
Copy link

This pull request has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants