- Qi Hu Email: [email protected]
- This is a NLP machine learning project
- The function of this model is to classify a single sentence to a certain domain, for example "I' like some Thai food" will be classified to the RESTAURANT domain.
- Dataset is not included in this project. You can prepare your own data. The format of data should be as follow:
- training, testing and validating data are stored separately in 3 files. Each file is composed of multiple lines (Each line is a data sample).
- For each sample of data, there are 3 parts separated by TAB sign ('\t'): domain (label), sentence, word_dictionary_feature (this is a pre-processed feature, representing the high level feature of each word in the sentence)
- Use word embedding for each raw input word
- Use CNN model to extract feature from the sentence (after embedding layer)
- Combine the CNN feature and word_dictionary_feature together and feed into the fully-connected layer
- Use Softmax for the final classification
- I tried different feature, parameters and network structures (different position of drop layer and fully-connected layer)
- Train: train.py
- Test: test.py
- If you want to try different parameters, change them in the train.py/test.py scripts. If you want to tried different models (network structure), change them in corresponding script in the folder 'model'.
- model (folder): Different CNN models. Each file defines one model, and the differences lie in the features and network structure.
- utils (folder): Some tool functions for data pre-processing, data loading, error analysis, raw data analysis
- train.py: Train the basic model with no word_dictionary_feature
- train_vd.py: Train the model with word_dictionary_feature
- train_cmp2vd.py: Train the model with simplified word_dictionary_feature (choose only one most important feature for each word)