Time Limit: 1 week.
There are 3 primary tasks of the assignment.
- Train a model on the mixed domain
- Train domain expert teachers and using the distillation loss train the student.
- Improve results by suggesting some techniques.
The following improvement techniques have been adopted:
- Early stopping teacher and student model based on least validation loss.
- Alpha hyperparameter tunning.
- Knowledge distillation annealing.
A few basic suggested methods not implemented due to hardware limitations and tuning time required:
- Adding batchnormalization to the fully connected layers before ReLU activation.
- Changing Adam optimizer to AdamW.
- Using cyclic learning rate schedulers for super convergence.
The metric of evaluation is accuracy. The validation accuracy is used as a measure of performance. The experiments are run using 5 given seeds [0, 10, 1234, 99, 2021] to make it reproducible and to have a statistical performance measure of the model.
The assigment is worked on a binary classification data of 3 different domains. There are 4 different variations of the data based on how close the centroid of the point cloud clusters are for the different domains. The data pickle files are placed in the data directory.
The python notebooks of the assignment experiments on the 4 datasets are here:
- Data with complete seperation
- Data with 75% of complete seperation
- Data with 50% of complete seperation
- Data with 25% of complete seperation
In a new conda
or virtualenv
environment, run
pip install -r requirements.txt
Use the provided environment.yml
file to install the dependencies into an environment.
conda env create
conda activate knowledge_distil
- Hinton, G., Vinyals, O. and Dean, J., 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Jafari, A., Rezagholizadeh, M., Sharma, P. and Ghodsi, A., 2021. Annealing Knowledge Distillation. arXiv preprint arXiv:2104.07163.
- AdamW and Super-convergence is now the fastest way to train neural nets by Sylvain Gugger and Jeremy Howard.