Skip to content

herrids/dl_semantic_classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Deep Learning Semantic Classification

A project to showcase how to use Deep Learning for semantic classification using the Yelp Dataset.

Setup

  1. Download the original YELP dataset HERE
    • copy the yelp_academic_dataset_review.json into the /data folder
    • The JSON is only necessary when you want to run Step 4. In case you download the pickle reviews_optimized.pickle you can skip this step
  2. Make sure that docker and docker-compose are installed
  3. Start the docker container with docker-compose run app which will build the image and bring you into the python shell. Then all files can be run with exec(open('filename.py').read()). The folder /data is mounted as Volume. You can shut down the container with exit() Recommendation: skip 4 to 8 and proceed right to 9. Then all you have to do is download the model file, link below. These steps take ca. 30min running time and 20GB of RAM and another 15GB of disk space
  4. Run prepare_raw_data.py
    • It returns a dataframe containing only the review, stars and date such that the dataset fits into memory
    • Save this dataframe as a pickle file (reviews_optimized.pickle) (to preserve datatypes and make file handling more easy)
    • Alternatively download reviews_optimized.pickle HERE. Make sure to save it in the /data folder.
  5. Run data_loading.py to undersample the dataset (for resource constraints)
    • Trainingset will contain balanced classes, Validation and test set will follow original class distributions
    • Returns a Dict containing six data frames: x_train, y_train, x_val, y_val, x_test, y_test.
    • Save this dict to disk as 3c_subsampled_data.pickle.
    • Alternatively download 3c_subsampled_data.pickle HERE. Make sure to save it in the /data folder.
  6. Download the word vector file from fasttext HERE
    • copy the .bin file into /data
    • The file is only necessary when you want to run Step 7
  7. Run create_embeddings.pyto create the embedding matrix based on the training set.
    • Matrix will only contain the vectors for the words appearing in subsample
    • Saves embedding_matrix.pickle and tokenizer.pickle to disk. Both are needed for training & testing
    • Alternatively download embedding_matrix.pickle HERE and tokenizer.pickle HERE. Make sure to save them in the /data folder.
  8. Run train.py to train the model yourself.
    • Make sure embedding_matrix.pickle and tokenizer.pickle (from step 7) are available.
    • Saves model.hdf5 to disk.
    • Alternatively download model.hdf5 HERE. Make sure to save it in the /data folder
  9. Run test.py to examine the model performance on the test set
  10. Run baseline_NaiveBayes.py to get the performance of Naive Bayes so you can compare the model.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published