Distributed Neural Network Training in Shifu

Neural Network model is supported in Shifu based on encog framework. While encog is single machine based. In Shifu, encog code is changed to do distributed model training.

Distributed Neural Network Training

Distributed Neural Network Model Training

Data are split into workers.
Each worker collects gradients on each connection of neural network settings and send back to master
Master accumulates all gradients and get a updated model based on global gradients and send new model to workers
Workers continue working on collecting gradients based on current model
Model training will be stopped according to max iterations.

Configurations in Neural Network Model Training

  "train" : {
    "baggingNum" : 5,
    "baggingWithReplacement" : true,
    "baggingSampleRate" : 1.0,
    "validSetRate" : 0.1,
    "trainOnDisk" : false,
    "numTrainEpochs" : 1000,
    "workerThreadCount": 4,
    "algorithm" : "NN",
    "params" : {
      "NumHiddenLayers" : 2,
      "ActivationFunc" : [ "Sigmoid", "Sigmoid" ],
      "NumHiddenNodes" : [ 45, 45 ],
      "LearningRate" : 0.1,
      "FeatureSubsetStrategy" : 1,
      "Propagation" : "Q"
    },
  },

params::NumHiddenLayers: how many hidden layers,
params::ActivationFunc: activation functions in all layers, can be 'sigmoid', 'tanh', ...
params::LearningRate: learning rate setting, 0.1-2 is a good choice
params::Propagation: Q for QuickPropagation, B for BackPropagation, R for ResilientPropagation
params::FeatureSubsetStrategy: feature level sampling per each bagging job. 1 means sample all features, if you set it to smaller one like 0.5, more bagging jobs needed. Range is in (0, 1].

The differences in different algorithms are in part of params and model parameters are different.

How to Tune Parameters to Accelerate NN Model Training

'guagua.split.maxCombinedSplitSize' in $SHIFU_HOME/conf/shifuconfig can be configured for input size per workers. If iteration run time and number of records per each worker are large, please tune this size to be smaller, by default it is 256000000. Compression format files should be tuned well.
'guagua.min.workers.ratio' in $SHIFU_HOME/conf/shifuconfig is to solve straggler issues, by default it is 0.97, set it to smaller one which means each iteration master will only wait for such rate of workers to be finished. This is a important parameter in shared cluster to accelerate model training.
'workerThreadCount' is the thread count in each worker, which is used to accelerate worker computing, by default it is 4-8, set it larger not always means good performance, as in that node, CPU will be very busy if more workers.
'mapreduce.map.memory.mb' and 'mapreduce.map.java.opts' can be tuned in $SHIFU_HOME/conf/shifuconfig to tune worker jvm memory which is helpful to load all data in memory.

How to Improve Neural Network (Shallow) Performance

For shallow NN models, 1-2 hidden layers is a good choice for almost all the cases. With billion records and 10K features. 1-2 hidden layer is ok
'tanh' is better in most cases we evaluated for hidden layer activation function.
'R' resilient propagation is better in most cases compared with back and quick propagation.

References

Encog
[How to set hidder layer?] (http://stats.stackexchange.com/questions/63152/what-does-the-hidden-layer-in-a-neural-network-compute)
Back propagation
Resilient propagation

Shifu: A Distributed Model Training Framework on Hadoop

DOWNLOAD

Provide feedback

Saved searches