A project to showcase how to use Deep Learning for semantic classification using the Yelp Dataset.
- Download the original YELP dataset HERE
- copy the
yelp_academic_dataset_review.json
into the/data
folder - The JSON is only necessary when you want to run Step 4. In case you download the pickle
reviews_optimized.pickle
you can skip this step
- copy the
- Make sure that docker and docker-compose are installed
- Start the docker container with
docker-compose run app
which will build the image and bring you into the python shell. Then all files can be run withexec(open('filename.py').read())
. The folder/data
is mounted as Volume. You can shut down the container withexit()
Recommendation: skip 4 to 8 and proceed right to 9. Then all you have to do is download the model file, link below. These steps take ca. 30min running time and 20GB of RAM and another 15GB of disk space - Run
prepare_raw_data.py
- It returns a dataframe containing only the review, stars and date such that the dataset fits into memory
- Save this dataframe as a pickle file (
reviews_optimized.pickle
) (to preserve datatypes and make file handling more easy) - Alternatively download
reviews_optimized.pickle
HERE. Make sure to save it in the/data
folder.
- Run
data_loading.py
to undersample the dataset (for resource constraints)- Trainingset will contain balanced classes, Validation and test set will follow original class distributions
- Returns a Dict containing six data frames: x_train, y_train, x_val, y_val, x_test, y_test.
- Save this dict to disk as
3c_subsampled_data.pickle
. - Alternatively download
3c_subsampled_data.pickle
HERE. Make sure to save it in the/data
folder.
- Download the word vector file from fasttext HERE
- copy the .bin file into
/data
- The file is only necessary when you want to run Step 7
- copy the .bin file into
- Run
create_embeddings.py
to create the embedding matrix based on the training set. - Run
train.py
to train the model yourself.- Make sure
embedding_matrix.pickle
andtokenizer.pickle
(from step 7) are available. - Saves
model.hdf5
to disk. - Alternatively download
model.hdf5
HERE. Make sure to save it in the/data
folder
- Make sure
- Run
test.py
to examine the model performance on the test set - Run
baseline_NaiveBayes.py
to get the performance of Naive Bayes so you can compare the model.