From b69364e9280888914ea716f6e310add457b88d74 Mon Sep 17 00:00:00 2001 From: James Lamb Date: Fri, 11 Dec 2020 19:16:16 +0000 Subject: [PATCH] [docs] Add details on improving training speed (#3628) * [docs] Add details to docs on improving training speed * formatting * fix link * fix formatting * replace 'performance' with 'accuracy' and mention learning_rate * Apply suggestions from code review Co-authored-by: Nikita Titov * regenerate docs from config.h Co-authored-by: Nikita Titov --- docs/Parameters-Tuning.rst | 140 +++++++++++++++++++++++++++++++++++-- docs/Parameters.rst | 20 ++++-- include/LightGBM/config.h | 16 +++-- 3 files changed, 159 insertions(+), 17 deletions(-) diff --git a/docs/Parameters-Tuning.rst b/docs/Parameters-Tuning.rst index 1d16e823220d..8f47d03562bd 100644 --- a/docs/Parameters-Tuning.rst +++ b/docs/Parameters-Tuning.rst @@ -36,15 +36,145 @@ To get good results using a leaf-wise tree, these are some important parameters: For Faster Speed ---------------- -- Use bagging by setting ``bagging_fraction`` and ``bagging_freq`` +Add More Computational Resources +'''''''''''''''''''''''''''''''' -- Use feature sub-sampling by setting ``feature_fraction`` +On systems where it is available, LightGBM uses OpenMP to parallelize many operations. The maximum number of threads used by LightGBM is controlled by the parameter ``num_threads``. By default, this will defer to the default behavior of OpenMP (one thread per real CPU core or the value in environment variable ``OMP_NUM_THREADS``, if it is set). For best performance, set this to the number of **real** CPU cores available. + +You might be able to achieve faster training by moving to a machine with more available CPU cores. + +Using distributed (multi-machine) training might also reduce training time. See the `Distributed Learning Guide <./Parallel-Learning-Guide.rst>`_ for details. + +Use a GPU-enabled version of LightGBM +''''''''''''''''''''''''''''''''''''' + +You might find that training is faster using a GPU-enabled build of LightGBM. See the `GPU Tutorial <./GPU-Tutorial.rst>`__ for details. + +Grow Shallower Trees +'''''''''''''''''''' + +The total training time for LightGBM increases with the total number of tree nodes added. LightGBM comes with several parameters that can be used to control the number of nodes per tree. + +The suggestions below will speed up training, but might hurt training accuracy. + +Decrease ``max_depth`` +********************** + +This parameter is an integer that controls the maximum distance between the root node of each tree and a leaf node. Decrease ``max_depth`` to reduce training time. + +Decrease ``num_leaves`` +*********************** + +LightGBM adds nodes to trees based on the gain from adding that node, regardless of depth. This figure from `the feature documentation <./Features.rst#leaf-wise-best-first-tree-growth>`__ illustrates the process. + +.. image:: ./_static/images/leaf-wise.png + :align: center + +Because of this growth strategy, it isn't straightforward to use ``max_depth`` alone to limit the complexity of trees. The ``num_leaves`` parameter sets the maximum number of nodes per tree. Decrease ``num_leaves`` to reduce training time. + +Increase ``min_gain_to_split`` +****************************** + +When adding a new tree node, LightGBM chooses the split point that has the largest gain. Gain is basically the reduction in training loss that results from adding a split point. By default, LightGBM sets ``min_gain_to_split`` to 0.0, which means "there is no improvement that is too small". However, in practice you might find that very small improvements in the training loss don't have a meaningful impact on the generalization error of the model. Increase ``min_gain_to_split`` to reduce training time. + +Increase ``min_data_in_leaf`` and ``min_sum_hessian_in_leaf`` +************************************************************* + +Depending on the size of the training data and the distribution of features, it's possible for LightGBM to add tree nodes that only describe a small number of observations. In the most extreme case, consider the addition of a tree node that only a single observation from the training data falls into. This is very unlikely to generalize well, and probably is a sign of overfitting. + +This can be prevented indirectly with parameters like ``max_depth`` and ``num_leaves``, but LightGBM also offers parameters to help you directly avoid adding these overly-specific tree nodes. + +- ``min_data_in_leaf``: Minimum number of observations that must fall into a tree node for it to be added. +- ``min_sum_hessian_in_leaf``: Minimum sum of the Hessian (second derivative of the objective function evaluated for each observation) for observations in a leaf. For some regression objectives, this is just the minimum number of records that have to fall into each node. For classification objectives, it represents a sum over a distribution of probabilities. See `this Stack Overflow answer `_ for a good description of how to reason about values of this parameter. + +Grow Less Trees +''''''''''''''' + +Decrease ``num_iterations`` +*************************** + +The ``num_iterations`` parameter controls the number of boosting rounds that will be performed. Since LightGBM uses decision trees as the learners, this can also be thought of as "number of trees". + +If you try changing ``num_iterations``, change the ``learning_rate`` as well. ``learning_rate`` will not have any impact on training time, but it will impact the training accuracy. As a general rule, if you reduce ``num_iterations``, you should increase ``learning_rate``. + +Choosing the right value of ``num_iterations`` and ``learning_rate`` is highly dependent on the data and objective, so these parameters are often chosen from a set of possible values through hyperparameter tuning. + +Decrease ``num_iterations`` to reduce training time. + +Use Early Stopping +****************** + +If early stopping is enabled, after each boosting round the model's training accuracy is evaluated against a validation set that contains data not available to the training process. That accuracy is then compared to the accuracy as of the previous boosting round. If the model's accuracy fails to improve for some number of consecutive rounds, LightGBM stops the training process. + +That "number of consecutive rounds" is controlled by the parameter ``early_stopping_rounds``. For example, ``early_stopping_rounds=1`` says "the first time accuracy on the validation set does not improve, stop training". + +Set ``early_stopping_rounds`` and provide a validation set to possibly reduce training time. + +Consider Fewer Splits +''''''''''''''''''''' + +The parameters described in previous sections control how many trees are constructed and how many nodes are constructed per tree. Training time can be further reduced by reducing the amount of time needed to add a tree node to the model. + +The suggestions below will speed up training, but might hurt training accuracy. + +Enable Feature Pre-Filtering When Creating Dataset +************************************************** + +By default, when a LightGBM ``Dataset`` object is constructed, some features will be filtered out based on the value of ``min_data_in_leaf``. + +For a simple example, consider a 1000-observation dataset with a feature called ``feature_1``. ``feature_1`` takes on only two values: 25.0 (995 observations) and 50.0 (5 observations). If ``min_data_in_leaf = 10``, there is no split for this feature which will result in a valid split at least one of the leaf nodes will only have 5 observations. + +Instead of reconsidering this feature and then ignoring it every iteration, LightGBM filters this feature out at before training, when the ``Dataset`` is constructed. + +If this default behavior has been overridden by setting ``feature_pre_filter=False``, set ``feature_pre_filter=True`` to reduce training time. + +Decrease ``max_bin`` or ``max_bin_by_feature`` When Creating Dataset +******************************************************************** + +LightGBM training `buckets continuous features into discrete bins <./Features.rst#optimization-in-speed-and-memory-usage>`_ to improve training speed and reduce memory requirements for training. This binning is done one time during ``Dataset`` construction. The number of splits considered when adding a node is ``O(#feature * #bin)``, so reducing the number of bins per feature can reduce the number of splits that need to be evaluated. + +``max_bin`` is controls the maximum number of bins that features will bucketed into. It is also possible to set this maximum feature-by-feature, by passing ``max_bin_by_feature``. + +Reduce ``max_bin`` or ``max_bin_by_feature`` to reduce training time. + +Increase ``min_data_in_bin`` When Creating Dataset +************************************************** + +Some bins might contain a small number of observations, which might mean that the effort of evaluating that bin's boundaries as possible split points isn't likely to change the final model very much. You can control the granularity of the bins by setting ``min_data_in_bin``. + +Increase ``min_data_in_bin`` to reduce training time. + +Decrease ``feature_fraction`` +***************************** + +By default, LightGBM considers all features in a ``Dataset`` during the training process. This behavior can be changed by setting ``feature_fraction`` to a value ``> 0`` and ``<= 1.0``. Setting ``feature_fraction`` to ``0.5``, for example, tells LightGBM to randomly select ``50%`` of features at the beginning of constructing each tree. This reduces the total number of splits that have to be evaluated to add each tree node. + +Decrease ``feature_fraction`` to reduce training time. + +Decrease ``max_cat_threshold`` +****************************** + +LightGBM uses a `custom approach for finding optimal splits for categorical features <./Advanced-Topics.html#categorical-feature-support>`_. In this process, LightGBM explores splits that break a categorical feature into two groups. These are sometimes called "k-vs.-rest" splits. Higher ``max_cat_threshold`` values correspond to more split points and larger possible group sizes to search. + +Decrease ``max_cat_threshold`` to reduce training time. + +Use Less Data +''''''''''''' + +Use Bagging +*********** + +By default, LightGBM uses all observations in the training data for each iteration. It is possible to instead tell LightGBM to randomly sample the training data. This process of training over multiple random samples without replacement is called "bagging". + +Set ``bagging_freq`` to an integer greater than 0 to control how often a new sample is drawn. Set ``bagging_fraction`` to a value ``> 0.0`` and ``< 1.0`` to control the size of the sample. For example, ``{"bagging_freq": 5, "bagging_fraction": 0.75}`` tells LightGBM "re-sample without replacement every 5 iterations, and draw samples of 75% of the training data". + +Decrease ``bagging_fraction`` to reduce training time. -- Use small ``max_bin`` -- Use ``save_binary`` to speed up data loading in future learning +Save Constructed Datasets with ``save_binary`` +'''''''''''''''''''''''''''''''''''''''''''''' -- Use parallel learning, refer to `Parallel Learning Guide <./Parallel-Learning-Guide.rst>`__ +This only applies to the LightGBM CLI. If you pass parameter ``save_binary``, the training dataset and all validations sets will be saved in a binary format understood by LightGBM. This can speed up training next time, because binning and other work done when constructing a ``Dataset`` does not have to be re-done. For Better Accuracy diff --git a/docs/Parameters.rst b/docs/Parameters.rst index 2189f186d361..22e98217d614 100644 --- a/docs/Parameters.rst +++ b/docs/Parameters.rst @@ -312,7 +312,7 @@ Learning Control Parameters - frequency for bagging - - ``0`` means disable bagging; ``k`` means perform bagging at every ``k`` iteration + - ``0`` means disable bagging; ``k`` means perform bagging at every ``k`` iteration. Every ``k``-th iteration, LightGBM will randomly select ``bagging_fraction * 100 %`` of the data to use for the next ``k`` iterations - **Note**: to enable bagging, ``bagging_fraction`` should be set to value smaller than ``1.0`` as well @@ -322,7 +322,7 @@ Learning Control Parameters - ``feature_fraction`` :raw-html:`🔗︎`, default = ``1.0``, type = double, aliases: ``sub_feature``, ``colsample_bytree``, constraints: ``0.0 < feature_fraction <= 1.0`` - - LightGBM will randomly select part of features on each iteration (tree) if ``feature_fraction`` smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features before training each tree + - LightGBM will randomly select a subset of features on each iteration (tree) if ``feature_fraction`` is smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features before training each tree - can be used to speed up training @@ -330,7 +330,7 @@ Learning Control Parameters - ``feature_fraction_bynode`` :raw-html:`🔗︎`, default = ``1.0``, type = double, aliases: ``sub_feature_bynode``, ``colsample_bynode``, constraints: ``0.0 < feature_fraction_bynode <= 1.0`` - - LightGBM will randomly select part of features on each tree node if ``feature_fraction_bynode`` smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features at each tree node + - LightGBM will randomly select a subset of features on each tree node if ``feature_fraction_bynode`` is smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features at each tree node - can be used to deal with over-fitting @@ -348,6 +348,8 @@ Learning Control Parameters - if set to ``true``, when evaluating node splits LightGBM will check only one randomly-chosen threshold for each feature + - can be used to speed up training + - can be used to deal with over-fitting - ``extra_seed`` :raw-html:`🔗︎`, default = ``6``, type = int @@ -360,9 +362,11 @@ Learning Control Parameters - ``<= 0`` means disable + - can be used to speed up training + - ``first_metric_only`` :raw-html:`🔗︎`, default = ``false``, type = bool - - set this to ``true``, if you want to use only the first metric for early stopping + - LightGBM allows you to provide multiple evaluation metrics. Set this to ``true``, if you want to use only the first metric for early stopping - ``max_delta_step`` :raw-html:`🔗︎`, default = ``0.0``, type = double, aliases: ``max_tree_output``, ``max_leaf_output`` @@ -384,6 +388,8 @@ Learning Control Parameters - the minimal gain to perform split + - can be used to speed up training + - ``drop_rate`` :raw-html:`🔗︎`, default = ``0.1``, type = double, aliases: ``rate_drop``, constraints: ``0.0 <= drop_rate <= 1.0`` - used only in ``dart`` @@ -442,7 +448,9 @@ Learning Control Parameters - used for the categorical features - - limit the max threshold points in categorical features + - limit number of split points considered for categorical features. See `the documentation on how LightGBM finds optimal splits for categorical features <./Features.rst#optimal-split-for-categorical-features>`_ for more details + + - can be used to speed up training - ``cat_l2`` :raw-html:`🔗︎`, default = ``10.0``, type = double, constraints: ``cat_l2 >= 0.0`` @@ -668,7 +676,7 @@ Dataset Parameters - ``feature_pre_filter`` :raw-html:`🔗︎`, default = ``true``, type = bool - - set this to ``true`` to pre-filter the unsplittable features by ``min_data_in_leaf`` + - set this to ``true`` (the default) to tell LightGBM to ignore the features that are unsplittable based on ``min_data_in_leaf`` - as dataset object is initialized only once and cannot be changed after that, you may need to set this to ``false`` when searching parameters with ``min_data_in_leaf``, otherwise features are filtered by ``min_data_in_leaf`` firstly if you don't reconstruct dataset object diff --git a/include/LightGBM/config.h b/include/LightGBM/config.h index 02d4cd4e5918..8c017ad26926 100644 --- a/include/LightGBM/config.h +++ b/include/LightGBM/config.h @@ -304,7 +304,7 @@ struct Config { // alias = subsample_freq // desc = frequency for bagging - // desc = ``0`` means disable bagging; ``k`` means perform bagging at every ``k`` iteration + // desc = ``0`` means disable bagging; ``k`` means perform bagging at every ``k`` iteration. Every ``k``-th iteration, LightGBM will randomly select ``bagging_fraction * 100 %`` of the data to use for the next ``k`` iterations // desc = **Note**: to enable bagging, ``bagging_fraction`` should be set to value smaller than ``1.0`` as well int bagging_freq = 0; @@ -315,7 +315,7 @@ struct Config { // alias = sub_feature, colsample_bytree // check = >0.0 // check = <=1.0 - // desc = LightGBM will randomly select part of features on each iteration (tree) if ``feature_fraction`` smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features before training each tree + // desc = LightGBM will randomly select a subset of features on each iteration (tree) if ``feature_fraction`` is smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features before training each tree // desc = can be used to speed up training // desc = can be used to deal with over-fitting double feature_fraction = 1.0; @@ -323,7 +323,7 @@ struct Config { // alias = sub_feature_bynode, colsample_bynode // check = >0.0 // check = <=1.0 - // desc = LightGBM will randomly select part of features on each tree node if ``feature_fraction_bynode`` smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features at each tree node + // desc = LightGBM will randomly select a subset of features on each tree node if ``feature_fraction_bynode`` is smaller than ``1.0``. For example, if you set it to ``0.8``, LightGBM will select 80% of features at each tree node // desc = can be used to deal with over-fitting // desc = **Note**: unlike ``feature_fraction``, this cannot speed up training // desc = **Note**: if both ``feature_fraction`` and ``feature_fraction_bynode`` are smaller than ``1.0``, the final fraction of each node is ``feature_fraction * feature_fraction_bynode`` @@ -334,6 +334,7 @@ struct Config { // desc = use extremely randomized trees // desc = if set to ``true``, when evaluating node splits LightGBM will check only one randomly-chosen threshold for each feature + // desc = can be used to speed up training // desc = can be used to deal with over-fitting bool extra_trees = false; @@ -343,9 +344,10 @@ struct Config { // alias = early_stopping_rounds, early_stopping, n_iter_no_change // desc = will stop training if one metric of one validation data doesn't improve in last ``early_stopping_round`` rounds // desc = ``<= 0`` means disable + // desc = can be used to speed up training int early_stopping_round = 0; - // desc = set this to ``true``, if you want to use only the first metric for early stopping + // desc = LightGBM allows you to provide multiple evaluation metrics. Set this to ``true``, if you want to use only the first metric for early stopping bool first_metric_only = false; // alias = max_tree_output, max_leaf_output @@ -367,6 +369,7 @@ struct Config { // alias = min_split_gain // check = >=0.0 // desc = the minimal gain to perform split + // desc = can be used to speed up training double min_gain_to_split = 0.0; // alias = rate_drop @@ -417,7 +420,8 @@ struct Config { // check = >0 // desc = used for the categorical features - // desc = limit the max threshold points in categorical features + // desc = limit number of split points considered for categorical features. See `the documentation on how LightGBM finds optimal splits for categorical features <./Features.rst#optimal-split-for-categorical-features>`_ for more details + // desc = can be used to speed up training int max_cat_threshold = 32; // check = >=0.0 @@ -606,7 +610,7 @@ struct Config { // desc = set this to ``false`` to use ``na`` for representing missing values bool zero_as_missing = false; - // desc = set this to ``true`` to pre-filter the unsplittable features by ``min_data_in_leaf`` + // desc = set this to ``true`` (the default) to tell LightGBM to ignore the features that are unsplittable based on ``min_data_in_leaf`` // desc = as dataset object is initialized only once and cannot be changed after that, you may need to set this to ``false`` when searching parameters with ``min_data_in_leaf``, otherwise features are filtered by ``min_data_in_leaf`` firstly if you don't reconstruct dataset object // desc = **Note**: setting this to ``false`` may slow down the training bool feature_pre_filter = true;