Skip to content

Latest commit

 

History

History
70 lines (33 loc) · 4.11 KB

README.md

File metadata and controls

70 lines (33 loc) · 4.11 KB

301_project

This projects builds upon Yoon Kim’s Convolutional Neural Networks for Sentence Classification, showing that w2v embedding with cnn inception network achieves neat performance on text classification while being fast and efficient. Thus I'd like to investigate how other structures of cnn perofrm on text classification, like Lenet and the simplest structure of cnn with just 1 convolution layer, why certain structures work, and how do different embedding layers impact this task as well.

lenet_normal_cnn.ipynb contains results of using lenet and the simplest cnn structure on text classification, both with the normal embedding layer as well as pretrained word2vec embedding.

The result shows that cnn sequential depth don't matter that much for text classification task, as the simplest cnn model already achieve even slightly higher performance than using lenet:

Special thanks to these 2 posts and tutorials:

CNN in keras with pretrained word2vec weights

How to Develop a Multichannel CNN Model for Text Classification

lenet:

image

simplest cnn:

image

clean_normal_embed.ipynb contains results of using normal embedding trained from scratch, with inception cnn.

clean_w2v: using pre-trained w2v with inception cnn.

The above 2 files show that using the pre-trained w2v achieves slightly better performance than using the embedding layer trained from scratch.

w2v:

image

normal embedding:

image

dirty_tokenization folder are experimentations by using a rather dirty tokenization.

image

The folder contains the following files: normal_cnn_trainable_false.ipynb: using pretrained w2v and inception cnn

dirty_embed_trainable_true.ipynb: using embedding initiated from scratch and inception cnn

The results show that a dirty tokenization method results in lower accuracy (0.78 for w2v) and takes much longer time to train (30 epochs) to converge.

image

tune_normal_embed.ipynb contains a failed attempt of trying to conduct hyperband on cnn inception for text classification. The result is likely due to the large size of the embedding layer. Even setting the embedding dimension to just 100 still results in a error on resource draining in hyperband.

Hyperband appears to work with setting embedding size to just 32, but the tradeoff is having really poor accuracy, too the point of being meaningless.

image

image

other_tests is a folder that contains some other failure tests. Inside it is a folder named overfitting. In it are 2 notebooks of simplest cnn structure and inception cnn structure that results in high performance (accuracy of 88) but extreme overfitting

image

The reason is likely due to not adding the "L2 regularization" term, ephasizing the importance of regularization in this task.