-
Notifications
You must be signed in to change notification settings - Fork 108
Distributed Neural Network Training in Shifu
Zhang Pengshan (David) edited this page May 9, 2017
·
12 revisions
Neural Network model is supported in Shifu based on encog framework. While encog is single machine based. In Shifu, encog code is changed to do distributed model training.
- Data are split into workers.
- Each worker collects gradients on each connection of neural network settings and send back to master
- Master accumulates all gradients and get a updated model based on global gradients and send new model to workers
- Workers continue working on collecting gradients based on current model
- Model training will be stopped according to max iterations.
"train" : {
"baggingNum" : 5,
"baggingWithReplacement" : true,
"baggingSampleRate" : 1.0,
"validSetRate" : 0.1,
"trainOnDisk" : false,
"numTrainEpochs" : 1000,
"workerThreadCount": 4,
"algorithm" : "NN",
"params" : {
"NumHiddenLayers" : 2,
"ActivationFunc" : [ "Sigmoid", "Sigmoid" ],
"NumHiddenNodes" : [ 45, 45 ],
"LearningRate" : 0.1,
"FeatureSubsetStrategy" : 1,
"Propagation" : "Q"
},
},
- params::NumHiddenLayers: how many hidden layers,
- params::ActivationFunc: activation functions in all layers, can be 'sigmoid', 'tanh', ...
- params::LearningRate: learning rate setting, 0.1-2 is a good choice
- params::Propagation: Q for QuickPropagation, B for BackPropagation, R for ResilientPropagation
- params::FeatureSubsetStrategy: feature level sampling per each bagging job. 1 means sample all features, if you set it to smaller one like 0.5, more bagging jobs needed. Range is in (0, 1].
The differences in different algorithms are in part of params and model parameters are different.
- 'guagua.split.maxCombinedSplitSize' in $SHIFU_HOME/conf/shifuconfig can be configured for input size per workers. If iteration run time and number of records per each worker are large, please tune this size to be smaller, by default it is 256000000. Compression format files should be tuned well.
- 'guagua.min.workers.ratio' in $SHIFU_HOME/conf/shifuconfig is to solve straggler issues, by default it is 0.97, set it to smaller one which means each iteration master will only wait for such rate of workers to be finished. This is a important parameter in shared cluster to accelerate model training.
- 'workerThreadCount' is the thread count in each worker, which is used to accelerate worker computing, by default it is 4-8, set it larger not always means good performance, as in that node, CPU will be very busy if more workers.
- 'mapreduce.map.memory.mb' and 'mapreduce.map.java.opts' can be tuned in $SHIFU_HOME/conf/shifuconfig to tune worker jvm memory which is helpful to load all data in memory.
- For shallow NN models, 1-2 hidden layers is a good choice for almost all the cases. With billion records and 10K features. 1-2 hidden layer is ok
- 'tanh' is better in most cases we evaluated for hidden layer activation function.
- 'R' resilient propagation is better in most cases compared with back and quick propagation.