Skip to content

Tutorial Build Your First ML Model

Zhang Pengshan (David) edited this page Jul 12, 2019 · 42 revisions

Pipeline in Shifu

Shifu Pipeline

This picture shows Shifu's whole pipeline of building models. Just with some json configurations, the whole pipeline will be executed and later you can get your first ML model and evaluation result.

How to Install Shifu

  • Get latest Shifu build from here.

    Two versions are supported, one is shifu--cdh-20.tar.gz which is for Hadoop version 1. The other one is shifu--hdp-yarn.tar.gz which is for Hadoop version 2 which is YARN platform and tested well from Hadoop 2.2.x to Hadoop 2.7.x.

  • Or build new package by [source code] (https://github.com/ShifuML/shifu) (Use 'mvn clean install' after you download source code)

  • Unzip Shifu package and then configure env parameters

    SHIFU_HOME=<folder you unzip Shifu package>

    PATH=${SHIFU_HOME}/bin:$PATH

  • Validate Your Installation

    shifu version

    Shifu version and build messages will be displayed in console

How to Run Shifu Pipeline

Shifu will parse your Hadoop platform settings and set all Hadoop conf for Shifu runtime. All logics are in bash ${SHIFU_HOME}/bin/shifu

  • shifu new <ModelName>

    This command will create a new ModelName folder for training, in the new folder, You will find some auto-created files:

    1. ModelConfig.json: Some input and model pipeline configurations and they will be discussed more later.

        "basic" : {
           "name" : "turtorial",
           "author" : "shifu",
           "description" : "Created at 2016-11-28 20:08:31",
           "version" : "0.2.0",
           "runMode" : "DIST",
           "postTrainOn" : false,
           "customPaths" : { }
        },
        "dataSet" : {
           "source" : "HDFS",
           "dataPath" : "hdfs:/user/shifu/DataSet1",
           "dataDelimiter" : "|",
           "headerPath" : "hdfs:/user/shifu/DataSet1/.pig_header",
           "headerDelimiter" : "|",
           "filterExpressions" : "",
           "weightColumnName" : "",
           "targetColumnName" : "diagnosis",
           "posTags" : [ "M" ],
           "negTags" : [ "B" ],
           "missingOrInvalidValues" : [ "", "*", "#", "?", "null", "~" ],
           "metaColumnNameFile" : "columns/meta.column.names",
           "categoricalColumnNameFile" : "columns/categorical.column.names"
         },
         ...
      • basic::name is the name of your Model and is the same as ModelName
      • basic::runMode can be 'local' or 'mapred'/'dist', by default is local which means run jobs on local machine; 'mapred'/'dist' means jobs are running in Hadoop platform
      • dataSet::source has two types: 'local' or 'hdfs' which means data in local or hadoop file system.
      • dataSet::dataPath is the data path for model training. If 'hdfs' source, dataPath should be files or folders in HDFS; HDFS glob expression is supported here, for example: you can use such dataPath: hdfs:/user/shifu/{2016/01,2016/02}/trainingdata; you can take our example data in ${SHIFU_HOME}/example/cancer-judgement/ and push it into your HDFS for testing.
      • dataSet::headerPath: which is a file for data header, if it is null, first line of your dataPath will be parsed as headers.
      • dataSet::dataDelimiter & dataSet::headerDelimiter: the delimiter of data and data header
      • dataSet::filterExpressions: User-specified expressions like ' columnA == '2' ' are supported to filter data in stats and training, compilcated one like " population=='NSF' or population=='eCHQ' ", more details can be found in http://commons.apache.org/proper/commons-jexl/reference/syntax.html. A new feature is to verify this parameter and details can be found in: https://github.com/ShifuML/shifu/wiki/Filter-Expressions-Testing-for-Train-Dataset-or-Eval-Dataset.
      • dataSet::weightColumnName: if your training or stats are based on weighted columns. For example in our risk training, it should be dollar columns which means our target is to save dollar-wise loss. If not set, it is unit-wised.
      • dataSet::targetColumnName: which column is your target column, please make sure it is successfully configured.
      • dataSet::posTags: elements in such list will be treated as positive like 1 in binary classification.
      • dataSet::negTags: elements in such list will be treated as negative like 0 in binary classification.
      • dataSet::missingOrInvalidValues: values in such list will be treated as invalid.
      • dataSet::metaColumnNameFile: meta column config file which is by default and created well in columns folder
      • dataSet::categoricalColumnNameFile: categorical column config files which list all categorical features and will be set in init step
    2. columns/meta.column.names: Empty file which specifies columns like ID / date columns that couldn't be used for building models

    3. columns/categorical.column.names: Empty file which specifies categorical columns

    4. columns/forceremove.column.names: Empty file which specifies columns which must be removed in model training

    5. columns/forceselect.column.names: Empty file which specifies columns which must be selected in model training

    6. columns/Eval1score.meta.column.names: Empty file which specifies evaluation meta columns

    Mostly in this part, user should config basic and dataSet path well, then in next steps all running are based on successful data paths and modes.

  • cd <ModelName>;shifu init

    All next steps from init should be run in (ModelName folder), this design is to make sure user could build different models in different folder in parallel.

    Init step will create another important file - ColumnConfig.json by ModelConfig.json. ColumnConfig.json is a json file includes all statistic info and mostly info will be filled later in 'stats' step.

    So far numerical or categorical columns must be specified by users in columns/categorical.column.names. This is very important to do the right column stats and transform. Please do make sure you configure categorical columns here well. Any variable that is not specified in columns/categorical.column.names will be treated as numerical variable by default.

  • shifu stats

    Stats step is used to collect statistics like mean, stddev, KS and other info by using MapReduce/Pig/Spark jobs.

          "stats" : {
             "maxNumBin" : 10,
             "binningMethod" : "EqualPositive",
             "sampleRate" : 0.8,
             "sampleNegOnly" : false,
             "binningAlgorithm" : "SPDTI"
          },
    • stats::maxNumBin: how many bins (buckets) in each numerical columns will be computed. The more the better results but more computations. Better in 10-50. For categorical features
    • stats::binningMethod: What kind of binning method: 'EqualPositive' in each bin the same positive number of records, others like 'EqualNegative', 'EqualTotal' and 'EqualInterval' ...
    • stats::sampleRate: usually you can do sampling for stats to accelerate this steps
    • stats::sampleNegOnly: If only sample negative records, this is useful for some cases negative are much more that positive records.
    • stats::binningAlgorithm: By default it is 'SPDTI' which is histogram-based statistics.

    After stats running, you can find ColumnConfig.json updated in ModelName folder with mean, ks, binning and other stats info which can be used in next steps.

  • shifu norm

    For logistic regression or neural network models, training input data should be normalized like z-score normalization or max-min normalization or woe normalization. Such normalization methods are all supported in this step.

    For tree ensemble models like Random Forest or Gradient Boosted Trees, no need norm step after shifu 0.10.x. while in shifu 0.9.x, norm is still needed, actually norm is just to generate clean data for further training. Start from shifu 0.10.0, by run norm which will generate both real normalization outputs and cleaned data outputs for tree model input.

       "normalize" : {
          "stdDevCutOff" : 4.0,
          "sampleRate" : 1.0,
          "sampleNegOnly" : false,
          "normType" : "ZSCALE"
       },
    • normalize::stdDevCutOff: stddev cut off for zscore, if abs value after zscore are still larger than this value, will be cut off to this value.
    • normalize::sampleRate: samplining data for next step training.
    • normalize::sampleNegOnly: If only sample negative records, this is useful for some cases negative are much more that positive records.
    • normalize::normType: can be 'zscale'/'zscore', 'maxmin', 'woe', 'woe_zscale', case insensitive.

    'woe' norm type is very important, it leverages binning information to transform numerical values into discrete values. This norm type improves model performance very well.

  • New shifu varsel In Shifu 0.10.0

    After stats and norm, varsel step is used for feature selection according to some statistic information like KS or IV value.

    "varSelect" : {
      "forceEnable" : true,
      "forceSelectColumnNameFile" : "columns/forceselect.column.names",
      "forceRemoveColumnNameFile" : "columns/forceremove.column.names",
      "filterEnable" : true,
      "filterNum" : 100,
      "filterOutRatio" : 0.05,
      "filterBy" : "FI",
      "missingRateThreshold" : 0.98,
      "params" : null
    }
    • varSelect::forceEnable: If enable force remove and force selection features
    • varSelect::filterEnable: If enable filter in variable selection
    • varSelect::filterNum: The number of variables need to be selected for model training, filterNum has higher priority than filterOutRatio, in another word, once filterNum is set filterOutRatio is ignored.
    • varSelect::filterOutRatio: ratio of variables should be filtered out after running shifu varselect
    • varSelect::filterBy: type of variable selection type, like 'KS', 'IV', 'SE', 'ST', 'FI'

Feature selection by 'KS' or 'IV' are coarse level just by feature quality. 'SE', 'ST' and 'FI' are feature selection methods based on model training. For any detailed information, please check [https://github.com/ShifuML/shifu/wiki/Variable-Selection-in-Shifu](Variable Selection in Shifu)

  • Old shifu varsel Before Shifu 0.10.0

    After stats and norm, varsel step is used for feature selection according to some statistic information like KS or IV value.

    TODO: KS IV

       "varSelect" : {
          "forceEnable" : true,
          "forceSelectColumnNameFile" : "columns/forceselect.column.names",
          "forceRemoveColumnNameFile" : "columns/forceremove.column.names",
          "filterEnable" : true,
          "filterNum" : 200,
          "filterBy" : "KS",
          "wrapperEnabled" : false,
          "wrapperNum" : 50,
          "wrapperRatio" : 0.05,
          "wrapperBy" : "S",
          "missingRateThreshold" : 0.98,
          "filterBySE" : true,
       },
    • varSelect::forceEnable: If enable force remove and force selection features
    • varSelect::filterEnable: If enable filter in variable selection
    • varSelect::filterNum: How many features are selected including force selected features if forceEnable is tree
    • varSelect::filterBy: Sorted by KS or IV
    • varSelect::wrapperEnabled: If wrapper is enabled for feature selection, wrapper related feature selection will be discuss later expecially for sensitivity analysis.
    • varSelect::missingRateThreshold: If missing rate over such threshold, such feature will be dropped.

    After this step finished, check 'finalSelect' = true in ColumnConfig.json to check the features selected in further training.

  • shifu train

    One of Shifu's pros is that training in Shifu is very powerful:

    • Distributed Logistic Regression / Neural Network / Tree Ensemble training are supported if runMode is 'dist'
    • Bagging and validation are native supported with just a configuration
    • All distributed training are fault tolerance and tested well in a busy shared Hadoop cluster. Straggler issue is solved well to make sure training running smoothly in cluster.
    • Bagging can based on different parameters and different bagging data which is enabled by just set baggingSampleRate.
       "train" : {
          "baggingNum" : 5,
          "baggingWithReplacement" : true,
          "baggingSampleRate" : 1.0,
          "validSetRate" : 0.2,
          "numTrainEpochs" : 200,
          "isContinuous" : false,
          "workerThreadCount" : 4,
          "algorithm" : "NN",
          "params" : {
             "NumHiddenLayers" : 1,
             "ActivationFunc" : [ "tanh" ],
             "NumHiddenNodes" : [ 50 ],
             "RegularizedConstant" : 0.0,
             "LearningRate" : 0.1,
             "Propagation" : "R"
           },
       },
    • train::baggingNum: How many models will be trained. In DIST mode, this means how many training jobs, each job is to train one model.
    • train::baggingWithReplacement: If bagging is combined with replacement sampling like Random Forest.
    • train::baggingSampleRate: How many training data will be used in training and validation, by default it is 1.
    • train::validSetRate: How many data are for validation data, others are for training
    • train::numTrainEpochs: How many iterations are used to train NN/LR models
    • train::isContinuous: If existing models in models folder and such one is set to true, training will start from existing NN/LR/GBT models. Such feature is not supported in Random forest.
    • train::workerThreadCount: Data are distributed in each Hadoop task, in each task, how many threads are used to training model in parallel. This can accelerate training. By default it is 4. In a shared cluster better set to 4-8. Set it higher sometimes may have CPU issues in a shared cluster without set CPU isolation well.
    • train::algorithm: 'NN', 'LR', 'GBT', 'RF' are supported so far in Shifu. For different algorithm, different train::params should be set well.
    • train::params::NumHiddenLayers: How many hidden layers in Neural network
    • train::params::ActivationFunc: Activation functions in each hidden laryer.
    • train::params::NumHiddenNodes: Hidden nodes in each layer.
    • train::params::LearningRate: Learning rate for neural network building
    • train::params::Propagation: 'R', 'Q', 'B' are supported well, here 'B' is BackPropagation, 'Q' is QuickPropagation. 'R' is ResilentPropagation. By default it is 'Q'.

    After training is finished, you can find models trained in local folder /models/. Which can be used in production or evaluation step.

  • shifu eval

    Evaluation step is to evaluate models you just trained. If multiple models are found in models folder. all will be evaluated and 'mean' model score is used to do final performance report.

     "evals" : [ {
         "name" : "Eval1",
         "dataSet" : {
           "source" : "HDFS",
           "dataPath" : "hdfs:/user/shifu/EvalSet1",
           "validationDataPath" : null,
           "dataDelimiter" : "|",
           "headerPath" : "hdfs:/user/shifu/EvalSet1/.pig_header",
           "headerDelimiter" : "|",
           "filterExpressions" : "",
           "weightColumnName" : "",
           "targetColumnName" : "diagnosis",
           "posTags" : [ "M" ],
           "negTags" : [ "B" ],
           "missingOrInvalidValues" : [ "", "*", "#", "?", "null", "~" ],
           "metaColumnNameFile" : "columns/meta.column.names",
           "categoricalColumnNameFile" : "columns/categorical.column.names"
        },
      "performanceBucketNum" : 10,
      "performanceScoreSelector" : "mean",
      "scoreMetaColumnNameFile" : "columns/Eval1score.meta.column.names",
    }
    • Evaluation supports multiple evaluation data set setting.
    • evals::dataSet: most time are the same as the ones in dataSet part, data path and schema can be specified in eval even different compared with training data set.
    • evals::performanceBucketNum: Bucket number to check points in final report.
    • evals::performanceScoreSelector: By default it is mean value for all bagging models.
    • evals::scoreMetaColumnNameFile: this is a file name, such file specifies champion model score field name in eval data set and such performance will be set together in eval data set performance chart for comparison.

    Evaluation results can be found in console like AUC or Gain Chart, Precision-Recall chart, and html format report can be found in evaluation local folder. Then you can get your models and model performance in json or html formats. Multiple evaluations can be supported by specifying multiple eval data sets by different data folder or different schema. Such eval data sets are run in parallel to speed eval performance.

  • Shifu Configurations in ModelConfig.json

Where Can I Find All Configuration Supported in Shifu?

  1. User-specified properties in ModelConfig.json, please check this Meta Configuration File

  2. System related properties in ${SHIFU_HOME}/conf/shifuConfig for you to tune/optimize but it won't impact final model result(Only for distributed performance). Here is the link.

Clone this wiki locally