Deployment Of Machine Learning Models Trained on Automatically Annotated Hindi and English Datasets
Code authors: Aarav Babu, Aditi Singh, Atul Krishnan
The popularity of E-Commerce has grown multifold across the years since its inception. With the rise in traffic in such websites, there has been a growing dependency on the views and opinions of the products available. Companies and customers greatly rely on these reviews for sales, market demands, constructive criticism, and to check the quality of the product being bought. In our project, we’re looking at ways to avoid the manual labor involved in labeling the dataset through automatic annotating frameworks and are producing a dataset of Hindi reviews to help in the field of Sentiment Analysis in Indian Languages.
We have successfully trained a highly accurate model on data which was automatically annotated to segregate these reviews into Suggestions and Complaints. The custom ML model was then deployed to identify and classify the given reviews (both for a huge number and for single reviews). We are using FLASK API to run our deployment. It is free for research use and if you find it useful or use it in your research, please acknowledge our git repository.
- Getting Started with deployment
- About the Datasets, Automatic Annotation tools and Models
- Future Scope
- Contact us
- To see the working model of our deployment, first download the zip file under /Deployment and extract it to the required folder.
- Open command prompt in the folder path.
- First execute the command
pip install virtualenv
- Create a new virtual environment using the command:
virtualenv abc
Alternate command:
python -m virtualenv abc
- To enter the virtual environment, use the command:
abc\Scripts\Activate
where “abc” is the name of the virtual environment. 6. Once in virtualenv execute the command:
pip install -r requirements.txt
to download all the dependencies. 7. Then run the following commands:
set FLASK_APP=app.py (app name)
set FLASK_ENV=development
flask run
- Copy and Paste the URL provided in the output screen onto any web browser.
- Depending on the type of file you downloaded (Mass/Single), enter either a URL of the product with its page number (for eg: https://www.amazon.in/Vivo-Midnight-Additional-Exchange-Offers/product-reviews/B09Q5Z5M9D/ref=cm_cr_arp_d_paging_btm_next_2?ie=UTF8&reviewerType=all_reviews&pageNumber=2) or a single review (for eg: “The phone is fast enough for daily use. Software needs much more improvement seems like some settings for UI is missing. Search inside settings won't show any results even if they exists”) .
How to make your own .pkl file:
- Go to the /Deployment folder and then open the deployingmodel.ipynb file.
- Change the Dataset to your own dataset.
- Drop/Select whatever columns you want.
- Use a classifier and estimator of your choice.
- Run the full code.
- Download the .pkl file from outputs and use it in your own deployment.
The datasets have been provided under the /Datasets folder. All the reviews that we’ve used have been scraped by us using the python codes provided under the /Web Scraping folder. We have also created a one-of-a-kind Hindi/Hinglish balanced product reviews dataset that has also been web scraped (code provided) by us. It consists of 4331 reviews out of which 2284 were suggestions and 2047 were complaints.
We have provided the code for 3 Automatic Annotation Frameworks namely Vader, TextBlob, and Flair that we have used to label the datasets automatically under /AutomaticAnnotation folder. Our analysis across the 3 tools show that Flair was the better framework while TextBlob proved to be the fastest. Vader provided a good accuracy across datasets without much variance.
The Automatically Annotated Datasets were then tried on various standard models to see the accuracy with which predictions can be made. On comparing the accuracies obtained, with a Manually labeled dataset, they were around a similar area which was very good. Bagging + XGBoosting gave the best accuracy which we used in the model we eventually deployed.
The field of opinion mining and sentiment analysis is among the most popular right now. In our study, we have shown the advantages of using Automatic Annotating Frameworks for labeling datasets to help ease the process, Published an exciting authentic Hindi/Hinglish review dataset that’ll help in further studies in SAIL. We have set the stage for a new set of studies in the field of Automatic Annotation in Indian Languages and helped build trust for it in English. We hope you like our project and it helps you to work more in this field.
Atul Krishnan: [email protected]
Github: AtulKrishnan
Aditi Singh: [email protected]
Github: aditi-singh2
Aarav Babu: [email protected]
Github: aarav-babu