Skip to content

macs30123-s23/final-project-z_g_h

Repository files navigation

Predicting Amazon Reviews Stars

Introduction

In this project, we leveraged different machine learning models to predict the review stars of Amazon products from customers' reviews and the category of the product. The data is collected from the Multilingual Amazon Reviews Corpus (https://registry.opendata.aws/amazon-reviews-ml/) which is freely accessible from the AWS Open Data Registry. We conducted our analysis on English reviews, French reviews and Chinese reviews, and successfully obtained a prediction accuracy of almost 90% on French and English review stars by using a multiclass logistic classifier.

Problem

Customer reviews are a crucial indicator of a product's quality and could affect the future demand for a product. Sellers improve their products and services according to the review; buyers decide which product to purchase based on past reviews of a product. The problem of fake and misleading comments could cause many negative impacts on both consumers and businesses. Fake comments might lead customers to buy products that do not meet their expectations. When consumers encounter fake reviews, it becomes difficult to differentiate between genuine and fabricated feedback. This loss of trust undermines the credibility of review platforms and makes it harder for consumers to make informed decisions. Moreover, fake and misleading comment creates a disadvantage for small businesses with limited resources may struggle to compete against larger companies that can afford to manipulate reviews. Fake or misleading reviews can disproportionately affect smaller businesses that heavily rely on positive online reviews to build their customer base and establish trust. Thus, companies such as Amazon that provide online shopping platforms often put lots of resources into identifying and removing fake and misleading comments to protect their platform's reputation. By training a model that could predict a customer's review star, we could speed up and automate the process of reading and evaluating every review on a platform like Amazon. And it could definitely help identify potentially fake or misleading reviews. If a review comment strongly suggests a positive or negative sentiment, but the predicted star rating does not align, it may indicate suspicious activity or manipulation. Furthermore, our trained model could be applied to platforms where customers do not explicitly give a star or rating for the product they review, the predicted review stars could be used to categorize and analyze those reviews efficiently.

Data and Method

The Multilingual Amazon Reviews Corpus contains reviews in English, Japanese, German, French, Chinese, and Spanish, collected between November 1, 2015 and November 1, 2019. Each observation contains the title and content of a review, the given star rating, the product ID and the product category. Reviews in each language include 210000 observations. To find the optimal classification models, we try to train a multiclass logistic classifier, random forests, and a multinomial naive Bayes classifier and apply cross-validation to determine the optimal parameter for each model. The process of grid search could be extremely time-consuming, so it is essential to use tools from large-scale computing to optimize our computation to reduce our computing time.

PySpark

To study the English reviews, we set up our work environment in an AWS EMR cluster and use the Pyspark framework to parallelize our computing. To construct our features, we built a pipeline that we first applied RegexTokenizer to the review title and review content then use CountVectorizer to vectorize them. In addition to that, inside the pipeline, we used StringIndexer to categorize the product category and the labels so that they could be fitted into the machine learning model. Furthermore, we implemented TextBlob to calculate the sentiment score of the review body and review title and used the sentiment score as another feature to train our machine learning models. When using PySpark, the framework will automatically parallelize our computation. More specifically, we used PySpark's machine learning library, MLlib. When we used the pipeline we build to chain multiple transformers to construct our features, operations such as tokenizing text, and converting categorical variables to numeric are performed on each partition of the data in parallel. Similarly, during model training, the algorithm processes partitions of the data across multiple nodes. Moreover, PySpark also parallelizes model selection and hyperparameter tuning processes such as grid search and cross-validation. For each combination of hyperparameters, PySpark can train a model on a different partition of data. Also, for cross-validation, each fold can be evaluated on a different node of the cluster. This makes the hyperparameter tuning and cross-validation processes more efficient.

In our investigation of the Chinese review dataset, distinct from the Latin-based languages of English and French, our objective was to achieve a notable level of accuracy. Regrettably, the previously employed sentiment models, namely nltk and textblob, were no longer accessible. Consequently, we directed our attention towards alternative models, namely SnowNLP and cncenti. SnowNLP calculated the sentiment score, while cncenti focused on quantifying the occurrence of positive and negative words within the text, resulting in the extraction of three distinct features. However, the outcome revealed no discernible variation when compared to the absence of natural language processing NLP features. Irrespective of whether we included NLP features or not, the accuracy remained stagnant at approximately 50% with little change. This suggests that the models utilized failed to demonstrate any notable improvements. Therefore, we draw the conclusion that deriving sentiment in the Chinese language proves to be a formidable challenge, and the performance of the corresponding models falls short. Consequently, further research and development efforts are necessary to address this limitation.

Dask

In our examination of the French review, we employed Dask, a Python parallel computing library that seamlessly integrates with the established Python ecosystem. Leveraging Dask's DataFrame, we were able to handle datasets larger than available memory, while simultaneously harnessing Dask-ML for scalable machine learning tasks in Python. This combination enabled us to leverage popular machine learning libraries like Scikit-Learn in a distributed computing environment facilitated by Dask.

In this part, we employed several key strategies to predict Amazon product review star ratings. We used the HashingVectorizer from Dask-ML and the TfidfVectorizer from Scikit-Learn for vectorization, converting text data into a numerical format suitable for machine learning algorithms. The SGDClassifier and LogisticRegression models from Scikit-Learn were then trained using an incremental learning approach offered by the Incremental wrapper from Dask-ML. We also conducted sentiment analysis using the TextBlob library to generate sentiment scores (polarity and subjectivity) that served as additional features for our model. To handle the large-scale data, all these tasks were executed in a distributed way using Dask's YarnCluster, enabling us to distribute the computations across multiple cores or nodes.

Result

LANGUAGE Model Feature ACCURACY
ENGLISH naive bayes sentiment polarity and subjectivity 0.62
logistic sentiment polarity 0.95
Random sentiment polarity and subjectivity 0.86
FRENCH logistic Vectorize 0.59
logistic Vectorize without stop_word 0.83
logistic sentiment polarity and subjectivity 0.89
CHINESE Logistic sentiment, sentimental words count 0.491
Random forest sentiment, sentimental words count 0.44
Naive Bayes sentiment, sentimental words count 0.5

Conclusion

In this study, we applied Pyspark and Dask to conduct large-scale computing on Amazon's product review data. By adding NLP features such as sentiment scores and subjectivity into our machine learning model, we successfully boosted our model's performance and obtained nearly 90% accuracy in prediction by using a multiclass logistic classifier for French reviews and random forests for English reviews. Our models could benefit companies such as Amazon in identifying fake or misleading comments, automating the comment review process, and categorizing reviews which have no associated star ratings.

About

final-project-z_g_h created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •