-
Notifications
You must be signed in to change notification settings - Fork 108
Distributed Tree Ensemble Model Training in Shifu
Starting from Shifu 0.9.0, Shifu has provided native and distributed support on tree ensemble models like Random Forest (RF) and Gradient Boosted Trees (GBT or GBM).
- Distributed model training based on Hadoop to accelerate training process
- Large scale data volume and more than 2000 features are supported well in Shifu
- Native support for categorical features, no need feature transformation
- Easy to deploy compressed model of trees to production
- Very good model performance especially in GBT
- Bagging of GBT/RF are supported well and improve model performance well.
- Data are split by records and into multiple workers;
- Master chooses nodes and send to workers for node statistics;
- When node statistics collected into master, tree is grown in master and generate new nodes which will be sent to workers for next node statistics. If tree is stopped according to max depth or others like minInfoGain, and number of trees set in params are done, the whole model training are then stopped.
- Every iteration worker receives current tree updates and predict new nodes and do statistics; binning is important here as statistics are all on bins.
- Different between RF and GBT is that, RF are all trained together while GBT each time only one tree is trained later new residency will be trained as a new tree.
Parameters related to tree ensemble models are showed in below:
"train" : {
"baggingNum" : 5, // number of bagging jobs
"baggingWithReplacement" : true, // if sampling with replacement, in GBT, this is disabled why in RF, this is enabled by default
"baggingSampleRate" : 1, // sample rate, by default is 1
"validSetRate" : 0.2, // 0.2 are used for training, others are for training, if 0 means all data for training
"numTrainEpochs" : 20000, // better to set a very big value to make job not stopped by max iterations, just by trees are finished.
"isContinuous" : false, // if continuous model training on current model, only GBT is supported, RF is not
"workerThreadCount" : 4, // # of threads in worker are used to do node statistics in parallel, better in 4-8
"algorithm" : "GBT", // GBT or RF
"params" : {
"TreeNum":500, // # of trees in GBT or RF
"FeatureSubsetStrategy": "ONETHIRD", // feature sampling in each node split, 'ALL', 'HALF', 'ONETHIRD', 'TWOTHIRDS' ...
"MaxDepth": 7, // max depth per each tree
"Impurity":"variance", // variance for tree impurity improvement. For GBT, only variance, for RF, entropy, gini and variance are all supported.
"LearningRate": 0.05, // learning rate per each tree in GBT
"MinInstancesPerNode": 5, // in each leaf, how many instances at most
"MinInfoGain": 0.0, // min info gain in each leaf
"Loss": "squared" // squared error loss for GBT, others like 'absolute', 'log'
},
},
Please be notice that 'numTrainEpochs' is better to set a larger value like 20000 iterations to make sure iteration is not stopped by max iteration.
Tree models like GBT and RF are all in binary format in Shifu which is useful to save model space. Please refer tree models. To run it in production, it is easy and example can be found in code. Only shifu-.jar and guagua-core-.jar are the dependencies and you can run model prediction in production.
Here is an example to run Shifu GBT model: Java Code and Bash Code.
PMML model format for tree ensemble model is coming.
Compared with our production models in H2O GBT, our model with 800 trees is better in recall which is 84.3% compared with 82.97%. With 5 bagging of GBT models, Shifu GBT reaches 85.3% best recall (catch rate).
80 million records and 630 features (100 categorical features), in Shifu, GBT of 500 trees with max depth 7 are trained in 8-10 hours in Shifu 0.9.0. With new improvements in Shifu 0.10.0 (currently in develop branch). Such run time are decreased to 2.7 -3.3.hours. About 3 time improvements which is pretty fast. At most 500 workers are tested well and fast in our shared Hadoop cluster.
"params" : {
"TreeNum":500,
"FeatureSubsetStrategy": "ONETHIRD",
"MaxDepth": 7,
"Impurity":"variance",
"LearningRate": 0.04,
"MinInstancesPerNode": 5,
"MinInfoGain": 0.0,
"Loss": "squared"
},
For GBT, better to set more trees and smaller learning rate, max depth per tree should not be two deep as each learning is a weak learner.
For RF, each tree should be a better-predictive model compared with GBT, max depth should be set larger to 12-15. Over 15 depth which will cause some memory issues.
"baggingNum": 20,
"params" : {
"TreeNum":40,
"FeatureSubsetStrategy": "HALF",
"MaxDepth": 14,
"Impurity":"variance",
"MinInstancesPerNode": 5,
"MinInfoGain": 0.0,
"Loss": "squared"
},
If for Random forest, please tune these two parameters in $SHIFU_HOME/bin/shifuconfig for evaluation or it will be OOM issue
mapreduce.map.memory.mb=2400
mapreduce.map.java.opts=-Xms2G -Xmx2G -server -XX:MaxPermSize=128M -XX:PermSize=64M -XX:+UseParallelGC -XX:+UseParallelOldGC -XX:ParallelGCThreads=8 -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
Score of GBT is usually not in [0,1]. While as 0-1 classification/regression, GBT raw score is better to be aligned in [0,1], below strategies are supported in shifu eval step (by setting evals#gbtScoreConvertStrategy in ModelConfig.json):
- RAW: Raw score without transformed
- OLD_SIGMOID: old sigmoid transform to [0, 1]
- SIGMOID: default sigmoid transform to [0, 1]
- CUTOFF: cutoff score > 1 to 1, score < 0 to 0
- HALF_CUTOFF: cutoff score > 1 to 1, score < 0 to 0
- MAXMIN: transform score to [0, 1] by (score -MIN)/(MAX-MIN)
Categorical feature, say one feature including categories 'A', 'B', 'C', 'D'; by default to split them is based on positive values descending sort like to 'B', 'D', 'C', 'A', then to group one by one from 'B' to 'A'. There is a new parameter:
"baggingNum": 20,
"params" : {
"CateSortMode": "SORT",
},
Another mode 'SHUFFLE' can also be set to leverage shuffle at first and then group according to the new seqence.