Deep Learning Semantic Classification

A project to showcase how to use Deep Learning for semantic classification using the Yelp Dataset.

Setup

Download the original YELP dataset HERE
- copy the yelp_academic_dataset_review.json into the /data folder
- The JSON is only necessary when you want to run Step 4. In case you download the pickle reviews_optimized.pickle you can skip this step
Make sure that docker and docker-compose are installed
Start the docker container with docker-compose run app which will build the image and bring you into the python shell. Then all files can be run with exec(open('filename.py').read()). The folder /data is mounted as Volume. You can shut down the container with exit() Recommendation: skip 4 to 8 and proceed right to 9. Then all you have to do is download the model file, link below. These steps take ca. 30min running time and 20GB of RAM and another 15GB of disk space
Run prepare_raw_data.py
- It returns a dataframe containing only the review, stars and date such that the dataset fits into memory
- Save this dataframe as a pickle file (reviews_optimized.pickle) (to preserve datatypes and make file handling more easy)
- Alternatively download reviews_optimized.pickle HERE. Make sure to save it in the /data folder.
Run data_loading.py to undersample the dataset (for resource constraints)
- Trainingset will contain balanced classes, Validation and test set will follow original class distributions
- Returns a Dict containing six data frames: x_train, y_train, x_val, y_val, x_test, y_test.
- Save this dict to disk as 3c_subsampled_data.pickle.
- Alternatively download 3c_subsampled_data.pickle HERE. Make sure to save it in the /data folder.
Download the word vector file from fasttext HERE
- copy the .bin file into /data
- The file is only necessary when you want to run Step 7
Run create_embeddings.pyto create the embedding matrix based on the training set.
- Matrix will only contain the vectors for the words appearing in subsample
- Saves embedding_matrix.pickle and tokenizer.pickle to disk. Both are needed for training & testing
- Alternatively download embedding_matrix.pickle HERE and tokenizer.pickle HERE. Make sure to save them in the /data folder.
Run train.py to train the model yourself.
- Make sure embedding_matrix.pickle and tokenizer.pickle (from step 7) are available.
- Saves model.hdf5 to disk.
- Alternatively download model.hdf5 HERE. Make sure to save it in the /data folder
Run test.py to examine the model performance on the test set
Run baseline_NaiveBayes.py to get the performance of Naive Bayes so you can compare the model.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.idea		.idea
app		app
data		data
notebook		notebook
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Deep Learning Semantic Classification

Setup

About

Releases

Packages

Languages

herrids/dl_semantic_classification

Folders and files

Latest commit

History

Repository files navigation

Deep Learning Semantic Classification

Setup

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages