From b968224a181e4067be1114ba83c5a12a5bb3d449 Mon Sep 17 00:00:00 2001 From: Jiaming Yuan Date: Sat, 23 Nov 2024 00:14:54 +0800 Subject: [PATCH] doc. --- doc/tutorials/dask.rst | 16 +++++++++------- python-package/xgboost/dask/__init__.py | 3 ++- 2 files changed, 11 insertions(+), 8 deletions(-) diff --git a/doc/tutorials/dask.rst b/doc/tutorials/dask.rst index 60c572fad95a..036b1e725d47 100644 --- a/doc/tutorials/dask.rst +++ b/doc/tutorials/dask.rst @@ -543,7 +543,7 @@ Learning to Rank There are two operation modes in the Dask learning to rank for performance reasons. The difference is whether a distributed global sort is needed. Please see :ref:`ltr-dist` for -how rankings work with distributed training in general. Below we will discuss some of the +how ranking works with distributed training in general. Below we will discuss some of the Dask-specific features. First, if you use the :py:class:`~xgboost.dask.DaskQuantileDMatrix` interface or the @@ -552,11 +552,12 @@ XGBoost will try to sort and group the samples for each worker based on the quer mode tries to skip the global sort and sort only worker-local data, and hence no inter-worker data shuffle. Please note that even worker-local sort is costly, particularly in terms of memory usage as there's no spilling when -:py:meth:`~pandas.DataFrame.sort_values` is used. XGBoost first checks whether the QID is -already sorted before actually performing the sorting operation. One can choose this if -the query groups are relatively consecutive, meaning most of the samples within a query -group are close to each other and are likely to be resided to the same worker. Don't use -this if you have performed a random shuffle on your data. +:py:meth:`~pandas.DataFrame.sort_values` is used, and we need to concatenate the +data. XGBoost first checks whether the QID is already sorted before actually performing +the sorting operation. One can choose this if the query groups are relatively consecutive, +meaning most of the samples within a query group are close to each other and are likely to +be resided to the same worker. Don't use this if you have performed a random shuffle on +your data. If the input data is random, then there's no way we can guarantee most of data within the same group being in the same worker. For large query groups, this might not be an @@ -565,7 +566,8 @@ samples from their group for all groups, which can lead to disastrous performanc case, we can partition the data according to query group, which is the default behavior of the :py:class:`~xgboost.dask.DaskXGBRanker` unless the ``allow_group_split`` is set to ``True``. This mode performs a sort and a groupby on the entire dataset in addition to an -encoding operation for the query group IDs, which can lead to slow performance. See +encoding operation for the query group IDs. Along with partition fragmentation, this +option can lead to slow performance. See :ref:`sphx_glr_python_dask-examples_dask_learning_to_rank.py` for a worked example. .. _tracker-ip: diff --git a/python-package/xgboost/dask/__init__.py b/python-package/xgboost/dask/__init__.py index 4a0ddaa6598f..dd659400d67d 100644 --- a/python-package/xgboost/dask/__init__.py +++ b/python-package/xgboost/dask/__init__.py @@ -1907,7 +1907,8 @@ def _argmax(x: Any) -> Any: Whether a query group can be split among multiple workers. When set to `False`, inputs must be Dask dataframes or series. If you have many small query groups, - this can significantly increase the fragmentation of the data. + this can significantly increase the fragmentation of the data, and the internal + DMatrix construction can take longer. .. warning::