Automatic detecting and fixing DNN training problems at run time.
- AutoTrainer: It mainly contains the source codes of the AUTOTRAINER (the folder
data
andutils
), two demo cases. You can find a easy start here. The way to run the demo cases has been shown here. - Motivation: It contains two test cases showing that 1) Training problem occurrence is highly random. 2) The time when a training problem occurs is random. The way to reproduce these cases can be found here.
- Supplement_data: It contains the required experiments and corresponding results (e.g., all model details, and the accuracy improvement table). We also provide all necessary experiment data in here and the repair results figures in here
- misc.: The
README.md
shows how to use our demos, the repo structure, the way to reproduce our experiments and our experiment results. And therequirement.text
shows all the dependencies of AUTOTRAINER.
- AutoTrainer/
- data/
- demo_case/
- Gradient_Vanish_Case/
- Oscillating_Loss_Case/
- Improper_Activation_case/
- utils/
- reproduce.py
- README.md
- Motivation/
- DyingReLU/
- OscillatingLoss/
- README.md
- Supplement_data/
- README.md
- requirements.txt
Here we prepare two simple cases which are based on the Circle and Blob dataset on Sklearn. You can just enter the corresponding folder and run demo.py
directly to see how AUTOTRAINER solve the problems in these case.
$ pip install -r requirements.txt
$ cd AutoTrainer/demo_case/Gradient_Vanish_Case
# or use `cd AutoTrainer/demo_case/Oscillating_Loss_Case`
$ python demo.py
To evaluate the effectiveness of AUTOTRAINER, we run 701 collected model training scripts to test the effectiveness of AUTOTRAINER. From these models, AUTOTRAINER has detected 422 buggy models and 506 training problems. Then AUTORTRAINER tries the candidate solutions and repairs 414 models buggy models, the repaired rate reaches 98.42%.
Additionally, the model accuracy improvement distribution of the 414 repaired buggy models is shown in the above figure. The average accuracy improvement reaches 36.42%. Specifically, over 133 models get an increase of 50% and over 50%. The maximum improvement reaches 90.17%
To evaluate the efficiency of AUTOTRAINER, we run all 495 model trainings with and without AUTOTRAINER enabled. For normal training, the runtime overhead is closely related to the problem checker frequency which is about to 1%. The above figure shows how this frequency affect the runtime overhead. It is worth mentioning that the runtime overhead on smaller datasets is usually larger (e.g., Blob vs. MNIST in the figure).
For the repaired trainings, AUTOTRAINER takes 1.14 more training time on average. We performed a more profound analysis to understand the overhead of individual components and found that retraining takes over 99% and the rest two parts (i.e., problem checker and repair) take less than 1%. To repair a problem, AUTOTRAINER may try several times, which leads to AUTOTRAINER training several models.
- Download data from Google Drive.
- Find the model and the corresponding configuration you want to reproduce or test. The models has been saved in different directories which are named by the datasets. Each model and its configuration file, and experimental results are placed in a separate subdirectory.
- Use the
reproduce.py
to test and reproduce our experiments. (make sure you have install the environment and read the 'setup' of AUTOTRAINER)
$ cd AutoTrainer
$ python reproduce.py -mp THE_MODEL_PATH -cp THE_CONFIGURATION_PATH
# the result will be saved in the `tmp` direction and the output message will be shown on the terminal.
- Prepare your model and the training configuration file. The training configuration should be a set saved as a
pkl
file which includes batch size, dataset, max training epoch, loss, optimizer, and its parameters. You can refer to this to complete the configuration file. - Rewrite the
get_dataset()
inreproduce.py
if you need to use your own dataset. You should add the way to load and preprocess your data. - Adust the configuration parameters. These parameters are saved in the params in
reproduce.py
and they are all set the default value which is mentioned in our paper. You can adjust them according to the learning tasks. - Run the
reproduce.py
following the guide.
@inproceedings{DBLP:conf/icse/ZhangZMS21,
author = {Xiaoyu Zhang and
Juan Zhai and
Shiqing Ma and
Chao Shen},
title = {{AUTOTRAINER:} An Automatic {DNN} Training Problem Detection and Repair
System},
booktitle = {43rd {IEEE/ACM} International Conference on Software Engineering,
{ICSE} 2021, Madrid, Spain, 22-30 May 2021},
pages = {359--371},
publisher = {{IEEE}},
year = {2021},
url = {https://doi.org/10.1109/ICSE43902.2021.00043},
doi = {10.1109/ICSE43902.2021.00043},
timestamp = {Sat, 06 Aug 2022 22:05:44 +0200},
biburl = {https://dblp.org/rec/conf/icse/ZhangZMS21.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}