diff --git a/doc/tutorials/learning_to_rank.rst b/doc/tutorials/learning_to_rank.rst index 5afd9314f055..25fd563e9346 100644 --- a/doc/tutorials/learning_to_rank.rst +++ b/doc/tutorials/learning_to_rank.rst @@ -11,9 +11,9 @@ Learning to Rank ******** Overview ******** -Often in the context of information retrieval, learning to rank aims to train a model that arranges a set of query results into an ordered list `[1] <#references>`__. For surprivised learning to rank, the predictors are sample documents encoded as feature matrix, and the labels are relevance degree for each sample. Relevance degree can be multi-level (graded) or binary (relevant or not). The training samples are often grouped by their query index with each query group containing multiple query results. +Often in the context of information retrieval, learning-to-rank aims to train a model that arranges a set of query results into an ordered list `[1] <#references>`__. For surprivised learning-to-rank, the predictors are sample documents encoded as feature matrix, and the labels are relevance degree for each sample. Relevance degree can be multi-level (graded) or binary (relevant or not). The training samples are often grouped by their query index with each query group containing multiple query results. -XGBoost implements learning to rank through a set of objective functions and performane metrics. The default objective is ``rank:ndcg`` based on the ``LambdaMART`` `[2] <#references>`__ algorithm, which in turn is an adaptation of the ``LambdaRank`` `[3] <#references>`__ framework to gradient boosting trees. For a history and a summary of the algorithm, see `[5] <#references>`__. The implementation in XGBoost features deterministic GPU computation, distributed training, position debiasing and two different pair construction strategies. +XGBoost implements learning to rank through a set of objective functions and performance metrics. The default objective is ``rank:ndcg`` based on the ``LambdaMART`` `[2] <#references>`__ algorithm, which in turn is an adaptation of the ``LambdaRank`` `[3] <#references>`__ framework to gradient boosting trees. For a history and a summary of the algorithm, see `[5] <#references>`__. The implementation in XGBoost features deterministic GPU computation, distributed training, position debiasing and two different pair construction strategies. ************************************ Training with the Pariwise Objective @@ -38,7 +38,7 @@ Training with the Pariwise Objective | 2 | 1 | :math:`x_7` | +-------+-----------+---------------+ -Notice that the samples are sorted based on their query index in an non-decreasing order. Here the first three samples belong to the first query and the next four samples belong to the second. For the sake of simplicity, we will use a pseudo binary learning to rank dataset in the following snippets, with binary labels representing whether the result is relevant or not, and randomly assign the query group index to each sample. For an example that uses a real world dataset, please see :ref:`sphx_glr_python_examples_learning_to_rank.py`. +Notice that the samples are sorted based on their query index in a non-decreasing order. In the above example, the first three samples belong to the first query and the next four samples belong to the second. For the sake of simplicity, we will use a synthetic binary learning-to-rank dataset in the following code snippets, with binary labels representing whether the result is relevant or not, and randomly assign the query group index to each sample. For an example that uses a real world dataset, please see :ref:`sphx_glr_python_examples_learning_to_rank.py`. .. code-block:: python @@ -47,7 +47,7 @@ Notice that the samples are sorted based on their query index in an non-decreasi import xgboost as xgb - # Make a pseudo ranking dataset for demonstration + # Make a synthetic ranking dataset for demonstration X, y = make_classification(random_state=rng) rng = np.random.default_rng(1994) n_query_groups = 3 @@ -58,14 +58,14 @@ Notice that the samples are sorted based on their query index in an non-decreasi X = X[sorted_idx, :] y = y[sorted_idx] -The simpliest way to train a ranking model is by using the sklearn estimator interface. Continuing the previous snippet, we can train a simple ranking model without tuning: +The simpliest way to train a ranking model is by using the scikit-learn estimator interface. Continuing the previous snippet, we can train a simple ranking model without tuning: .. code-block:: python ranker = xgb.XGBRanker(tree_method="hist", lambdarank_num_pair_per_sample=8, objective="rank:ndcg", lambdarank_pair_method="topk") ranker.fit(X, y, qid=qid) -Please note that, as of writing, there's no learning to rank interface in sklearn. As a result, the :py:class:`xgboost.XGBRanker` does not fully conform the sklearn estimator guideline and can not be directly used with some of its utility functions. For instances, the ``auc_score`` and ``ndcg_score`` in sklearn don't consider group information nor the pairwise loss. Most of the metrics are implemented as part of XGBoost, but to use sklearn utilities like :py:func:`sklearn.model_selection.cross_validation`, we need to make some adjustments in order to pass the `qid` as an additional parameter for :py:meth:`xgboost.XGBRanker.score`. The `X` for :py:class:`xgboost.XGBRanker` may contain a special column called ``qid`` when it's a pandas dataframe or a cuDF dataframe: +Please note that, as of writing, there's no learning-to-rank interface in scikit-learn. As a result, the :py:class:`xgboost.XGBRanker` class does not fully conform the scikit-learn estimator guideline and can not be directly used with some of its utility functions. For instances, the ``auc_score`` and ``ndcg_score`` in scikit-learn don't consider query group information nor the pairwise loss. Most of the metrics are implemented as part of XGBoost, but to use scikit-learn utilities like :py:func:`sklearn.model_selection.cross_validation`, we need to make some adjustments in order to pass the ``qid`` as an additional parameter for :py:meth:`xgboost.XGBRanker.score`. Given a data frame ``X`` (either pandas or cuDF), add the column ``qid`` as follows: .. code-block:: python @@ -74,7 +74,7 @@ Please note that, as of writing, there's no learning to rank interface in sklear ranker.fit(df, y) # No need to pass qid as a separate argument from sklearn.model_selection import StratifiedGroupKFold, cross_val_score - # Works with cv in sklearn, along with HPO utilities like grid search cv. + # Works with cv in scikit-learn, along with HPO utilities like GridSearchCV kfold = StratifiedGroupKFold(shuffle=False) cross_val_score(ranker, df, y, cv=kfold, groups=df.qid) @@ -91,7 +91,7 @@ The above snippets build a model using ``LambdaMART`` with the ``NDCG@8`` metric ************* Position Bias ************* -Real relevance degree for query result is difficult to obtain as it often requires human judegs to examine the content of query results. When such labeled data is absent, we might want to train the model on ground truth data like user clicks. Another upside of using click data directly is that it can relect the up-to-date relevance status `[1] <#references>`__. However, user clicks are often nosiy and biased as users tend to choose results displayed in higher position. To ameliorate this issue, XGBoost implements the ``Unbiased LambdaMART`` `[4] <#references>`__ algorithm to debias the position-dependent click data. The feature can be enabled by the ``lambdarank_unbiased`` parameter, see :ref:`ltr-param` for related options and :ref:`sphx_glr_python_examples_learning_to_rank.py` for a worked example with simulated user clicks. +Obtaining real relevance degrees for query results is an expensive and strenuous task, as it requires human labelers to label all results one by one. When such labeling task is infeasible, we might want to train the learning-to-rank model on user click data instead, as it is relatively easy to collect. Another advantage of using click data directly is that it can reflect the most up-to-date user preferences `[1] <#references>`__. However, user clicks are often biased, as users tend to choose results that are displayed in higher positions. User clicks are also noisy, where users accidentally click on irrelevant documents. To ameliorate this issue, XGBoost implements the ``Unbiased LambdaMART`` `[4] <#references>`__ algorithm to debias the position-dependent click data. The feature can be enabled by the ``lambdarank_unbiased`` parameter; see :ref:`ltr-param` for related options and :ref:`sphx_glr_python_examples_learning_to_rank.py` for a worked example with simulated user clicks. **** Loss @@ -109,9 +109,9 @@ XGBoost implements different ``LambdaMART`` objectives based on different metric * Pairwise -The `LambdaMART` algorithm scales the logistic loss with learning to rank metrics like ``NDCG`` in the hope of including ranking infomation into the loss function. The ``rank:pairwise`` loss is the orginal version of the pairwise loss, also known as the `RankNet loss` `[7] <#references>`__ or the `pairwise logistic loss`. Unlike the ``rank:map`` and the ``rank:ndcg``, no scaling is applied (:math:`|\Delta Z_{ij}| = 1`). +The `LambdaMART` algorithm scales the logistic loss with learning to rank metrics like ``NDCG`` in the hope of including ranking information into the loss function. The ``rank:pairwise`` loss is the original version of the pairwise loss, also known as the `RankNet loss` `[7] <#references>`__ or the `pairwise logistic loss`. Unlike the ``rank:map`` and the ``rank:ndcg``, no scaling is applied (:math:`|\Delta Z_{ij}| = 1`). -Whether scaling with a LTR metric is actually more effective is still up for debate, `[8] <#references>`__ provides a theoretical foundation for general lambda loss functions and some insights into the framework. +Whether scaling with a LTR metric is actually more effective is still up for debate; `[8] <#references>`__ provides a theoretical foundation for general lambda loss functions and some insights into the framework. ****************** Constructing Pairs @@ -119,34 +119,46 @@ Constructing Pairs There are two implemented strategies for constructing document pairs for :math:`\lambda`-gradient calculation. The first one is the ``mean`` method, another one is the ``topk`` method. The preferred strategy can be specified by the ``lambdarank_pair_method`` parameter. -For the ``mean`` strategy, XGBoost samples ``lambdarank_num_pair_per_sample`` pairs for each document in a query list. For example, given a list of 3 documents and ``lambdarank_num_pair_per_sample`` is set to 2, XGBoost will randomly sample 6 pairs assuming the labels for these documents are different. On the other hand, if the pair method is set to ``topk``, XGBoost constructs about :math:`k \times |query|` number of pairs with :math:`|query|` pairs for each sample at the top :math:`k = lambdarank\_num\_pair` position. The number of pairs counted here is an approximation since we skip pairs that have the sample label. +For the ``mean`` strategy, XGBoost samples ``lambdarank_num_pair_per_sample`` pairs for each document in a query list. For example, given a list of 3 documents and ``lambdarank_num_pair_per_sample`` is set to 2, XGBoost will randomly sample 6 pairs, assuming the labels for these documents are different. On the other hand, if the pair method is set to ``topk``, XGBoost constructs about :math:`k \times |query|` number of pairs with :math:`|query|` pairs for each sample at the top :math:`k = lambdarank\_num\_pair` position. The number of pairs counted here is an approximation since we skip pairs that have the same label. ********************* Obtaining Good Result ********************* -Learning to rank is a sophisticated task and a field of heated research. It's not trivial to train a model that generalizes well. There are multiple loss functions available in XGBoost along with a set of hyper-parameters. This section contains some hints for how to choose those parameters as a starting point. One can further optimize the model by tuning these parameters. +Learning to rank is a sophisticated task and an active research area. It's not trivial to train a model that generalizes well. There are multiple loss functions available in XGBoost along with a set of hyperparameters. This section contains some hints for how to choose hyperparameters as a starting point. One can further optimize the model by tuning these hyperparameters. -The first question would be how to choose an objective that matches the task at hand. If your input data is multi-level relevance degree, then either ``rank:ndcg`` or ``rank:pairwise`` should be used. However, when the input is binary we have multiple options based on the target metric. `[6] <#references>`__ provides some guidelines on this topic and users are encouraged to see the analysis done in their work. The choice should be based on the number of `effective pairs`, which refers to the number of pairs that can generate non-zero gradient and contribute to training. `LambdaMART` with ``MRR`` has the least amount of effective pairs as the :math:`\lambda`-gradient is only non-zero when the pair contains a non-relevant document ranked higher than the top relevant document. As a result, it's not implemented in XGBoost. Since ``NDCG`` is a multi-level metric, it usually generate more effective pairs than ``MAP``. +The first question would be how to choose an objective that matches the task at hand. If your input data has multi-level relevance degrees, then either ``rank:ndcg`` or ``rank:pairwise`` should be used. However, when the input has binary labels, we have multiple options based on the target metric. `[6] <#references>`__ provides some guidelines on this topic and users are encouraged to see the analysis done in their work. The choice should be based on the number of `effective pairs`, which refers to the number of pairs that can generate non-zero gradient and contribute to training. `LambdaMART` with ``MRR`` has the least amount of effective pairs as the :math:`\lambda`-gradient is only non-zero when the pair contains a non-relevant document ranked higher than the top relevant document. As a result, it's not implemented in XGBoost. Since ``NDCG`` is a multi-level metric, it usually generate more effective pairs than ``MAP``. -However, when there's a sufficient amount of effective pairs, it's shown in `[6] <#references>`__ that matching the target metric with the objective is of significance. When the target metric is ``MAP`` and you are using a large dataset that can provide a sufficient amount of effective pairs, ``rank:map`` can in theory yield higher ``MAP`` value than the ``rank:ndcg``. +However, when there are sufficiently many effective pairs, it's shown in `[6] <#references>`__ that matching the target metric with the objective is of significance. When the target metric is ``MAP`` and you are using a large dataset that can provide a sufficient amount of effective pairs, ``rank:map`` can in theory yield higher ``MAP`` value than ``rank:ndcg``. -The choice of pair method (``lambdarank_pair_method``) and the number of pairs for each sample (``lambdarank_num_pair_per_sample``) is similar, as the mean-``NDCG`` considers more pairs than ``NDCG@10``, it can generate more effective pairs and provide more granularity. Also, using the ``mean`` strategy can help the model generalize with random sampling. However, one might want to focus the training on the top :math:`k` documents instead of using all pairs in practice, the tradeoff should be made based on the user's goal. +The consideration of effective pairs also applies to the choice of pair method (``lambdarank_pair_method``) and the number of pairs for each sample (``lambdarank_num_pair_per_sample``). For example, the mean-``NDCG`` considers more pairs than ``NDCG@10``, so the former generates more effective pairs and provides more granularity than the latter. Also, using the ``mean`` strategy can help the model generalize with random sampling. However, one might want to focus the training on the top :math:`k` documents instead of using all pairs, to better fit their real-world application. -When using mean value instead of targeting a specific position by calculating the target metric (like ``NDCG``) over the whole query list, user can specify how many pairs they want in each query by setting the ``lambdarank_num_pair_per_sample`` and XGBoost will randomly sample this amount of pairs for each element in the query group (:math:`|pairs| = |query| \times num\_pairsample`). Often time, setting it to 1 can produce reasonable result, with higher value producing more pairs (with the hope that a reasonable amount of them being effective). On the other hand, if you are prioritizing the top :math:`k` documents, the ``lambdarank_num_pair_per_sample`` should be set to slightly higher than :math:`k` (with a few more documents) to obtain a good training result. +When using the mean strategy for generating pairs, where the target metric (like ``NDCG``) is computed over the whole query list, users can specify how many pairs should be generated per each document, by setting the ``lambdarank_num_pair_per_sample``. XGBoost will randomly sample ``lambdarank_num_pair_per_sample`` pairs for each element in the query group (:math:`|pairs| = |query| \times num\_pairsample`). Often, setting it to 1 can produce reasonable results. In cases where performance is inadequate due to insufficient number of effective pairs being generated, set ``lambdarank_num_pair_per_sample`` to a higher value. As more document pairs are generated, more effective pairs will be generated as well. -In summary, to start off the training, if you have a large dataset, consider using the target-matching objective, otherwise ``NDCG`` or the RankNet loss (``rank:pairwise``) might be preferred. With the same target metric, use the ``lambdarank_num_pair_per_sample`` to specify the top :math:`k` documents for training if your dataset is large, and use the mean value version otherwise. Lastly, ``lambdarank_num_pair_per_sample`` can be used to control the amount of pairs for both methods. +On the other hand, if you are prioritizing the top :math:`k` documents, the ``lambdarank_num_pair_per_sample`` should be set slightly higher than :math:`k` (with a few more documents) to obtain a good training result. + +**Summary** If you have large amount of training data: + +* Use the target-matching objective. +* Choose the ``topk`` strategy for generating document pairs (if it's appropriate for your application). + +On the other hand, if you have comparatively small amount of training data: + +* Select ``NDCG`` or the RankNet loss (``rank:pairwise``). +* Choose the ``mean`` strategy for generating document pairs, to obtain more effective pairs. + +For any method chosen, you can modify ``lambdarank_num_pair_per_sample`` to control the amount of pairs generated. ******************** Distributed Training ******************** -XGBoost implements distributed learning-to-rank with integration of multiple frameworks including dask, spark, and pyspark. The interface is similar to single node. Please refer to document of the respective XGBoost interface for details. Scattering a query group onto multiple workers is theoretically sound but can affect the model accuracy. For most of the use cases, the small discrepancy is not an issue since when distributed training is involved the dataset is usually large. As a result, users don't need to partition the data based on group information. Given the dataset is correctly sorted, XGBoost can aggregate sample gradients accordingly. +XGBoost implements distributed learning-to-rank with integration of multiple frameworks including Dask, Spark, and PySpark. The interface is similar to the single-node counterpart. Please refer to document of the respective XGBoost interface for details. Scattering a query group onto multiple workers is theoretically sound but can affect the model accuracy. For most of the use cases, the small discrepancy is not an issue, as the amount of training data is usually large when distributed training is used. As a result, users don't need to partition the data based on query groups. As long as each data partition is correctly sorted by query IDs, XGBoost can aggregate sample gradients accordingly. ******************* -Reproducbile Result +Reproducible Result ******************* -Like any other tasks, XGBoost should generate reproducbile results given the same hardware and software environments, along with data partitions if distributed interface is used. Even when the underlying environment has changed, the result should still be consistent. However, when the ``lambdarank_pair_method`` is set to ``mean``, XGBoost uses sampling, and the random number generator used on Windows (MSVC) is different from the one used on other platforms like Linux (GCC, Clang), the output varies significantly between these platforms. +Like any other tasks, XGBoost should generate reproducible results given the same hardware and software environments (and data partitions, if distributed interface is used). Even when the underlying environment has changed, the result should still be consistent. However, when the ``lambdarank_pair_method`` is set to ``mean``, XGBoost uses random sampling, and results may differ depending on the platform used. The random number generator used on Windows (Microsoft Visual C++) is different from the ones used on other platforms like Linux (GCC, Clang), so the output varies significantly between these platforms. ********** References