This project has five main steps:
- Data Collection
- Data Cleaning
- Word Embedding
- Topic Extraction
- Sentiment Analysis
git clone https://github.com/hehlinge42/nlp_consulting_project.git
cd nlp_consulting_project
pip install -r requirements.txt
- Tools to scrap TripAdvisor's UK website (https://www.tripadvisor.co.uk/) for restaurants and their associated reviews made by different users.
cd scraper
- See dedicated README in the folder.
- Tool to clean and tokenize the reviews scraped from TripAdvisor.
cd cleaner
- See dedicated README in the folder.
- Tool to embed tokenized reviews into numerical vector.
cd embedder
- See dedicated README in the folder.
- Tool to embed tokenized reviews into numerical vector and predict associated ratings using a Hierarchical Attention Network (HAN).
cd attention_embedder
- See dedicated README in the folder.
- As seen from image below simply run the following command and set user defined parameters via GUI:
python3 launch_program.py
GUI User defined settings:
- --Save Wordcloud: option to create wordclouds per restaurants.
- --Save TFIDF: option to create TFIDF embedding per restaurants.
- --Embedding Technique: define embedding technique (lsi, word2vec, fasttext) supported.
Script to merge data from multiple scrapping runs, create a balanced dataset of reviews (ratings 1-5), clean selected reviews and embed words into vectors depending on user defined embedding technique (lsi, word2vec, fasttext and all are supported).
Project realized by @elalamik, @erraya, @hehlinge42, @louistransfer and @MaximeRedstone