Sampling & Filtering in Shifu Training Step

Filter

Filtering supported in Shifu is very flexible. If you have one input data set but you would only train a part of data, you can specify your filterExpressions:

     "dataSet" : {
       "dataPath" : "...",
       ...
       "filterExpressions" : "type == 'DEV'",
       ...
      }

In eval part, such kind of filters can also be set there and the filter expressions are supported by JExpression format: http://commons.apache.org/proper/commons-jexl/reference/syntax.html.

Bagging/cross validation/grid search are features supported well in Shifu. While most of such features leverage bagging to split training/validation data set. This page explains how Shifu sampling performs in different algorithms or in different feature scenarios.

Cross Validation

Cross validation is enabled by this parameter:

    "train" : {
       ...
       "numKFold" : 5,
       ...
    }

If 'numKFold' is set not to -1. The cross validation in training is enabled. With 5 set here, you will see 5 cross validation training job. No matter how other bagging parameters set, Cross validation will ignore them. While how data is selected.

Models after cross validation can also be used in evaluation. All k-fold models will be used if it is cross validastion mode.

How Cross Validation Split Data into Training/Validation

In worker, hashcode of each record will be computed according to all selected features. The hashcode formular is:

    ( hashcode(a) * 31 + hashcode(b) ) * 31 + hashcode(c)

'a', 'b' and 'c' are all features in that record.

Because of hashcode is fixed per each record, even data is distributed in worker, in cross validation, data can still be split into fixed validation/training data set. To split the data set, in each job, there is a job index start from 0, 1, 2, ... Then by use such formula, data can be set into validation, otherwise it is training data:

    hashcode(record) % k  == <job index>

By using hashcode, even data is not shuffled well, data is split well in training/validation.

Bagging Sampling Logic

    "train" : {
       "baggingNum" : 5,
       "baggingWithReplacement" : true,
       "baggingSampleRate" : 0.9,
       "stratifiedSample": false,
       "validSetRate" : 0.2,
       "algorithm" : "NN",
       ...

Bagging are supported well in Shifu, how Shifu do bagging with sampling.

'validSetRate' is used to split data into training/validation data set. No matter 'NN' or 'GBT', in each worker, each record is sampled to be validation or not. Here, we got 20% validation data, others are training data. If 'dataSet::validationDataPath' is set in dataset part and here 'validSetRate' will be ignored. All data will be used for training.
'stratifiedSample' if set it to true, it will try to keep ratio of positive in training/validation the same as whole dataset.
'baggingSampleRate', after data is split into training/validation, then baggingSampleRate is effective to do sampling in training data. Here for real training data is 80% data then sampling 90% training data which is 0.8 * 0.9 = 72% data are used in training.
'baggingWithReplacement' is a flag to do sampling with replacement or not. Be careful this flag only is enabled in 'NN' and 'LR' bagging training. 'RF' random forest will ignore such parameter since all data are sampling with replacement in RF. In 'GBT', such parameter will also be ignored, sampling in 'GBT' is without replacement no matter set true or false.

Column Sampling in Neural Network

If you set 'FeatureSubsetStrategy' in params in NN training, column sampling will enabled in neural network training. You can test such parameter by adding more bagging jobs.

Sampling in Tree Ensemble Model

Sampling in Random Forest

In random forest, bagging with replacement sampling is always effective in each tree growth. Besides data sampling, there is also a parameter for column sampling.

    "train" : {
       ...
       "params" : {
          "FeatureSubsetStrategy" : "ONETHIRD", // or number in (0, 1)
       }

Sampling in Gradient Boosted Trees

In GBT, bagging with replacement sampling is always not effective in each tree growth. Besides data sampling, there is also a parameter for column sampling like the one in RF.

For GBT, in each tree building, training data will be re-sampling again in next tree building. For example, you take 72% data in training to construct a tree, then for a new tree, 80% data will be re-sampled again to get a new 72% sampling training data. This logic works for each tree building.

Sample Negative Records Only

Usually sampling are sampling all records, no matter which class it is.

By set sampleNegativeOnly to true (by default false). Bagging sample rate will only works on negative records, no positive record will be sampled out.

    "train" : {
       "baggingNum" : 5,
       ...
       "sampleNegativeOnly" : true,
       ...

Shifu: A Distributed Model Training Framework on Hadoop

DOWNLOAD

Provide feedback

Saved searches

Use saved searches to filter your results more quickly