Native Bagging Modeling Framework

Bagging is a simple model ensemble technique to improve model performance. In Shifu, bagging is supported native in all algorithms.

How to Enable Bagging in Shifu

 "train" : {
    "baggingNum" : 5,
    "baggingWithReplacement" : false,
    "baggingSampleRate": 1,
    ...
 }

If baggingNum is set to multiple, Shifu will train multiple bagging jobs in parallel. By sampling enabled, each model is for different training data. In evaluation part, all models will be used to evaluate test data and final average model score are used to predict test data performance.

How to Deploy Bagging in Production

Multiple models will be found in /models/ like model0.nn-model5.nn. Such model files can be deployed in production and with ModelRunner support to load multiple models and do averaging internal to get a final model score.

By using export command, models like LR and NN will be exported to standard PNMML format and which can easity to be deployed in production.

Bagging of Gradient Boosted Trees

In Shifu GBT, if baggingNum is set to 5, 5 GBT models will be trained and at last they will be averaged to get a better performance results.

This is a very good feature to improve stability of GBT model and which showes 3-5 percent improvement compared 5 GBT models with 1 GBT model.

Bagging of Random Forest

Usually in each job treeNum can be set for number of trees in Random Forest, consider capability of each job. Set baggingNum to higher values which can easy get more trees in Random Forest and which is proved to be fast in practice.

Bagging by Different Parameters

Bagging in above are all by sampling data, while in Shifu bagging by different algorithm parameters is also supported. Check here how to support grid search. In grid search, users can set multiple parameter combinations and multiple model will be trained. Then do it in evaluation step by multiple models.

 "params" : {
      "NumHiddenLayers" : 1,
      "ActivationFunc" : [ "tanh" ],
      "NumHiddenNodes" : [ [30], [45], [60] ],
      "LearningRate" : 0.1,
      "FeatureSubsetStrategy" : 1,
      "DropoutRate": 0.1,
      "Propagation" : "Q"
    },

Three model will hidden node 30, 45, 60 will be trained on three models, then you can do evaluation on such three models without any configurations.

Shifu: A Distributed Model Training Framework on Hadoop

DOWNLOAD

Provide feedback

Saved searches