Skip to content

Distributed Neural Network Training in Shifu

Zhang Pengshan (David) edited this page Jul 1, 2019 · 12 revisions

Neural Network model is supported in Shifu based on encog framework. While encog is single machine based. In Shifu, encog code is changed to do distributed model training.

Distributed Neural Network Training

Distributed Neural Network Model Training

  • Data are split into workers.
  • Each worker collects gradients on each connection of neural network settings and send back to master
  • Master accumulates all gradients and get a updated model based on global gradients and send new model to workers
  • Workers continue working on collecting gradients based on current model
  • Model training will be stopped according to max iterations set or early stop if enabled.
  • Bagging of different NN model parameters are also supported.

Configurations in Neural Network Model Training

  "train" : {
    "baggingNum" : 5,
    "baggingWithReplacement" : true,
    "baggingSampleRate" : 1.0,
    "validSetRate" : 0.1,
    "trainOnDisk" : false,
    "numTrainEpochs" : 1000,
    "workerThreadCount": 4,
    "algorithm" : "NN",
    "params" : {
      "NumHiddenLayers" : 2,
      "ActivationFunc" : [ "Sigmoid", "Sigmoid" ],
      "NumHiddenNodes" : [ 45, 45 ],
      "LearningRate" : 0.1,
      "FeatureSubsetStrategy" : 1,
      "DropoutRate": 0.1,
      "Propagation" : "Q"
    },
  },
  • params::NumHiddenLayers: how many hidden layers,
  • params::ActivationFunc: activation functions in all layers, can be 'sigmoid', 'tanh', 'relu', ...
  • params::LearningRate: learning rate setting, 0.1-2 is a good choice
  • params::DropoutRate: dropout ratio in each weight updating per epoch
  • params::Propagation: Q for QuickPropagation, B for BackPropagation, R for ResilientPropagation, no need set learning rate if set R propagation.
  • params::FeatureSubsetStrategy: feature level sampling per each bagging job. 1 means sample all features, if you set it to smaller one like 0.5, more bagging jobs needed. Range is in (0, 1].

The differences in different algorithms are in part of params and model parameters are different.

How to Tune Parameters to Accelerate NN Model Training

  • 'guagua.split.maxCombinedSplitSize' in $SHIFU_HOME/conf/shifuconfig can be configured for input size per workers. If iteration run time and number of records per each worker are large, please tune this size to be smaller, by default it is 256000000. Compression format files should be tuned well.
  • 'guagua.min.workers.ratio' in $SHIFU_HOME/conf/shifuconfig is to solve straggler issues, by default it is 0.97, set it to smaller one which means each iteration master will only wait for such rate of workers to be finished. This is a important parameter in shared cluster to accelerate model training.
  • 'workerThreadCount' is the thread count in each worker, which is used to accelerate worker computing, by default it is 4-8, set it larger not always means good performance, as in that node, CPU will be very busy if more workers.
  • 'mapreduce.map.memory.mb' and 'mapreduce.map.java.opts' can be tuned in $SHIFU_HOME/conf/shifuconfig to tune worker jvm memory which is helpful to load all data in memory.

How to Improve Neural Network (Shallow) Performance

  • For shallow NN models, 1-2 hidden layers is a good choice for almost all the cases. With billion records and 10K features. 1-2 hidden layer is ok
  • 'tanh' is better in most cases we evaluated for hidden layer activation function.
  • 'R' resilient propagation is better in most cases compared with back and quick propagation.

Binary Format Neural Network Model

In bmodels folder next to models folder, you will see binary models there, with binary model format, it is compressed well and can be executed by Model Engine with better performance compared with jPMML.

Shifu also support to export multiple bagging models into one binary bagging model. If you trained 5 models, by using 'shifu export -t bagging' you can see one unified model in /onebaggingmodel folder. With our own Model Engine, no need do transform several times, and users can treat it like one model.

RELU Activation Support

'relu' is supported as activation function like 'sigmoid', 'tanh' from Shifu 0.10. Below test result show 'relu' is more stable than 'sigmoid' and 'tanh'.

RELU Performance

In such testing of NN with RELU, the normalization is WOE_ZSCALE and the training parameters is in below:

  "train" : {
    "baggingNum" : 1,
    "baggingWithReplacement" : false,
    "baggingSampleRate" : 1.0,
    "validSetRate" : 0.1,
    "numTrainEpochs" : 1500,
    "isContinuous" : false,
    "workerThreadCount" : 4,
    "algorithm" : "NN",
    "params" : {
      "Propagation" : "R",
      "LearningRate" : 0.1,
      "NumHiddenLayers" : 1,
      "NumHiddenNodes" : [ 50 ],
      "ActivationFunc" : [ "relu" ]
    },
    "customPaths" : {}
  },

One things is that 'relu' is not supported in PMML, so Shifu PMML doesn't support relu either, for model deployment, please check Shifu model engine in this page.

Loss Type Support

Before Shifu 0.11.0, only squared loss is supported in Shifu NN while cross-entropy is a popular loss function which has better robustness in prediction. It is supported since Shifu 0.11.0. Such parameter 'Loss' can be set in train#params:

  "train" : {
    "baggingNum" : 1,
    "baggingWithReplacement" : false,
    "baggingSampleRate" : 1.0,
    "validSetRate" : 0.1,
    "numTrainEpochs" : 1500,
    "isContinuous" : false,
    "workerThreadCount" : 4,
    "algorithm" : "NN",
    "params" : {
      "Propagation" : "R",
      "LearningRate" : 0.1,
      "NumHiddenLayers" : 1,
      "NumHiddenNodes" : [ 50 ],
      "Loss" : "squared",  
      "ActivationFunc" : [ "relu" ]
    },
    "customPaths" : {}
  },

Cross-entropy loss is 'log' of 'Loss' parameter. From testing, not bad performance change in cross-entropy loss compared with squared loss. (May be better in some cases, should add more tests)

References

Clone this wiki locally