forked from dmlc/xgboost
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Added SKLearn-like random forest Python API. (dmlc#4148)
* Added SKLearn-like random forest Python API. - added XGBRFClassifier and XGBRFRegressor classes to SKL-like xgboost API - also added n_gpus and gpu_id parameters to SKL classes - added documentation describing how to use xgboost for random forests, as well as existing caveats
- Loading branch information
1 parent
6fb4c5e
commit a36c3ed
Showing
4 changed files
with
240 additions
and
55 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,89 @@ | ||
######################### | ||
Random Forests in XGBoost | ||
######################### | ||
|
||
XGBoost is normally used to train gradient-boosted decision trees and other gradient | ||
boosted models. Random forests use the same model representation and inference, as | ||
gradient-boosted decision trees, but a different training algorithm. There are XGBoost | ||
parameters that enable training a forest in a random forest fashion. | ||
|
||
|
||
**************** | ||
With XGBoost API | ||
**************** | ||
|
||
The following parameters must be set to enable random forest training. | ||
|
||
* ``booster`` should be set to ``gbtree``, as we are training forests. Note that as this | ||
is the default, this parameter needn't be set explicitly. | ||
* ``subsample`` must be set to a value less than 1 to enable random selection of training | ||
cases (rows). | ||
* One of ``colsample_by*`` parameters must be set to a value less than 1 to enable random | ||
selection of columns. Normally, ``colsample_bynode`` would be set to a value less than 1 | ||
to randomly sample columns at each tree split. | ||
* ``num_parallel_tree`` should be set to the size of the forest being trained. | ||
* ``num_boost_round`` should be set to 1. Note that this is a keyword argument to | ||
``train()``, and is not part of the parameter dictionary. | ||
* ``eta`` (alias: ``learning_rate``) must be set to 1 when training random forest | ||
regression. | ||
* ``random_state`` can be used to seed the random number generator. | ||
|
||
|
||
Other parameters should be set in a similar way they are set for gradient boosting. For | ||
instance, ``objective`` will typically be ``reg:linear`` for regression and | ||
``binary:logistic`` for classification, ``lambda`` should be set according to a desired | ||
regularization weight, etc. | ||
|
||
If both ``num_parallel_tree`` and ``num_boost_round`` are greater than 1, training will | ||
use a combination of random forest and gradient boosting strategy. It will perform | ||
``num_boost_round`` rounds, boosting a random forest of ``num_parallel_tree`` trees at | ||
each round. If early stopping is not enabled, the final model will consist of | ||
``num_parallel_tree`` * ``num_boost_round`` trees. | ||
|
||
Here is a sample parameter dictionary for training a random forest on a GPU using | ||
xgboost:: | ||
|
||
params = { | ||
'colsample_bynode': 0.8, | ||
'learning_rate': 1, | ||
'max_depth': 5, | ||
'num_parallel_tree': 100, | ||
'objective': 'binary:logistic', | ||
'subsample': 0.8, | ||
'tree_method': 'gpu_hist' | ||
} | ||
|
||
A random forest model can then be trained as follows:: | ||
|
||
bst = train(params, dmatrix, num_boost_round=1) | ||
|
||
|
||
************************** | ||
With Scikit-Learn-Like API | ||
************************** | ||
|
||
``XGBRFClassifier`` and ``XGBRFRegressor`` are SKL-like classes that provide random forest | ||
functionality. They are basically versions of ``XGBClassifier`` and ``XGBRegressor`` that | ||
train random forest instead of gradient boosting, and have default values and meaning of | ||
some of the parameters adjusted accordingly. In particular: | ||
|
||
* ``n_estimators`` specifies the size of the forest to be trained; it is converted to | ||
``num_parallel_tree``, instead of the number of boosting rounds | ||
* ``learning_rate`` is set to 1 by default | ||
* ``colsample_bynode`` and ``subsample`` are set to 0.8 by default | ||
* ``booster`` is always ``gbtree`` | ||
|
||
Note that these classes have a smaller selection of parameters compared to using | ||
``train()``. In particular, it is impossible to combine random forests with gradient | ||
boosting using this API. | ||
|
||
|
||
******* | ||
Caveats | ||
******* | ||
|
||
* XGBoost uses 2nd order approximation to the objective function. This can lead to results | ||
that differ from a random forest implementation that uses the exact value of the | ||
objective function. | ||
* XGBoost does not perform replacement when subsampling training cases. Each training case | ||
can occur in a subsampled set either 0 or 1 time. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.