Sentiment Analysis of Reviews

The primary goal of this project is to effectively analyze customer reviews to understand the sentiment and quality perception of products based on user-generated content. The analysis aims to identify patterns and trends in the data that provide insights into customer satisfaction and product quality. Additionally, the project seeks to classify each review based on the sentiment expressed for each product, aiding in the qualitative assessment of feedback.

Data Sources

The primary dataset used for this analysis contains detailed information about product reviews, including text reviews, ratings, and other metadata.

Reviews Dataset

Data Description

The dataset consists of 5,68,411 rows and 10 columns, including unique identifiers for reviews, products, and users, as well as textual data for reviews and summaries. The columns are:

Id: Unique identifier for each review.
ProductId: Unique identifier for the product being reviewed.
ProfileName: Name of the user profile.
HelpfulnessNumerator: Number of users who found the review helpful.
HelpfulnessDenominator: Number of users who indicated whether they found the review helpful or not.
Score: Rating given to the product by the reviewer.
Time: Timestamp when the review was posted.
Unemployment: The unemployment rate in the region
Summary: Summary of the review.
Text: Full text of the review.

Tools

Python: Data Cleaning and Analysis

Download Python
Jupyter Notebook: For interactive data analysis and visualization

Install Jupyter

Libraries

Below are the links for details and commands (if required) to install the necessary Python packages:

pandas: Go to Pandas Installation or use command: pip install pandas
numpy: Go to NumPy Installation or use command: pip install numpy
matplotlib: Go to Matplotlib Installation or use command: pip install matplotlib
seaborn: Go to Seaborn Installation or use command: pip install seaborn
scikit-learn: Go to Scikit-Learn Installation or use command: pip install scikit-learn
NLTK: Go to NLTK Installation or use command: pip install nltk

EDA Steps

Exploratory Data Analysis (EDA) involved exploring the reviews data to answer key questions, such as:

What is the distribution of scores?
How do review lengths vary?
What are the common themes in positive and negative reviews?

Data Preprocessing Steps and Inspiration

Data Cleaning:

Handling Missing Values: Any missing values are identified and removed to ensure the quality of the data. Removing Duplicates: Duplicate entries are checked and removed to ensure the uniqueness of data points for accurate analysis. Consistency Checks: Ensuring that helpfulness numerators do not exceed denominators and standardizing text data for uniformity.

Data Transformation:

Converting Data Types: The data type for the column 'Time' is changed to 'datetime' from 'int64'. Feature Engineering: New features such as helpfulness ratio, text length, and summary length are generated.

Text Preprocessing

Tokenization: Breaking down the text into individual words or tokens.
Stop Words Removal: Eliminating common words that offer little value for analysis.
Lemmatization: Converting words into their base form.
Vectorization: Transforming text data into numerical format using techniques like Count Vectorization and TF-IDF.

Inspiration for Data Preprocessing

The inspiration for the specific preprocessing steps comes from typical challenges encountered in natural language processing and sentiment analysis tasks, particularly noise reduction, dimensionality reduction, and bias removal.

Graphs/Visualizations

Choosing the Algorithm for the Project

Logistic Regression - TFIDF: Uses term frequency-inverse document frequency (TFIDF) to weigh words based on their importance and logistic regression for binary classification, providing a balance of interpretability and performance.
Naive Bayes - Count Vectorizer: Utilizes the count vectorizer to transform text data into token counts and applies Naive Bayes for probabilistic classification, effective for large datasets and capturing word frequency.
Logistic Regression - Count Vectorizer: Combines count vectorizer for token counts with logistic regression to predict sentiment, suitable for linear relationships and high-dimensional data.
Naive Bayes - TFIDF: Employs TFIDF to emphasize important words and Naive Bayes for classification, balancing word importance and probabilistic predictions.
NLTK SIA Polarity Scores: Utilizes the Sentiment Intensity Analyzer from NLTK to quickly assess sentiment polarity scores, offering a simple and fast sentiment analysis approach.

Assumptions

Independence of Features: Assuming that words are independent of each other.
Linear Relationships: Assuming linear separability of sentiment based on word presence.
Text Preprocessing Decisions: Assuming preprocessing steps adequately capture important features.
Quality and Completeness of Data: Assuming the dataset accurately represents the population of interest.
Sentiment Labeling Accuracy: Assuming sentiment labels are correct.

Model Evaluation Metrics

Accuracy: Measures the proportion of total predictions that were correct.
Precision: Measures the accuracy of positive predictions.
Recall(Sensitivity): Measures the ability to find all relevant cases within a dataset.
F1 Score: The harmonic mean of precision and recall.

Results

Breakdown of Each Model's Performance

Logistic Regression - TFIDF: Accuracy: 91.29%, Precision: 0.84 (negative), 0.93 (positive), Recall: 0.73 (negative), 0.96 (positive), F1-Score: 0.79 (negative), 0.95 (positive)

Naive Bayes - Count Vectorizer: Accuracy: 89.42%, Precision: 0.77 (negative), 0.93 (positive), Recall: 0.73 (negative), 0.94 (positive), F1-Score: 0.75 (negative), 0.93 (positive)

Logistic Regression - Count Vectorizer: Accuracy: 91.64%, Precision: 0.84 (negative), 0.93 (positive), Recall: 0.76 (negative), 0.96 (positive), F1-Score: 0.80 (negative), 0.95 (positive)

Naive Bayes - TFIDF: Accuracy: 85.38%, Precision: 0.90 (negative), 0.85 (positive), Recall: 0.37 (negative), 0.99 (positive), F1-Score: 0.52 (negative), 0.91 (positive)

NLTK SIA Polarity Scores: Accuracy: 81.97%, Precision: 0.74 (negative), 0.83 (positive), Recall: 0.26 (negative), 0.97 (positive), F1-Score: 0.39 (negative), 0.89 (positive)

Model	Accuracy	Precision (Negative)	Precision (Positive)	Recall (Negative)	Recall (Positive)	F1-Score (Negative)	F1-Score (Positive)
Logistic Regression - TF-IDF	91.29%	0.84	0.93	0.73	0.96	0.79	0.95
Naive Bayes - Count Vectorizer	89.42%	0.77	0.93	0.73	0.94	0.75	0.93
Logistic Regression - Count Vectorizer	91.64%	0.84	0.93	0.76	0.96	0.80	0.95
Naive Bayes - TF-IDF	85.38%	0.90	0.85	0.37	0.99	0.52	0.91

Balanced Performance: Logistic Regression – Count Vectorizer stands out as the best model due to its high accuracy and balanced precision and recall across both classes.

Recommendations

Implement targeted improvements based on feedback from reviews.
Use sentiment analysis results to guide product development and marketing strategies.
Continuously update and refine the models with new data for improved accuracy.

Limitations

Data Quality: Potential inaccuracies due to underreporting or subjective nature of reviews.
Model Limitations: Models may not capture all nuances of sentiment in reviews.
External Factors: Other factors not included in the analysis can impact sentiment.

Future Possibilities of the Project

Advanced Predictive Modeling: Explore advanced models like NBEATS, NHITS, PatchTST, VARMAX, VAR, and KATS for enhanced accuracy.
Store-Specific/Product-Specific Analysis: Conduct detailed analysis for each product category in each store to uncover unique patterns and optimize models for individual characteristics.
External Factors Integration: Incorporate additional factors like economic indicators, social events, and regional factors for a comprehensive approach.

References

Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media, Inc.
Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (3rd ed.).

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
Count of Products in Each Segment.png		Count of Products in Each Segment.png
Data Description1.png		Data Description1.png
Data Description2.png		Data Description2.png
Distribution of Helpfulness Ratio.png		Distribution of Helpfulness Ratio.png
Distribution of Review Scores.png		Distribution of Review Scores.png
Distribution of Summary Length.png		Distribution of Summary Length.png
Distribution of Text Length.png		Distribution of Text Length.png
Logistic Regression with Count Vectorizer Results.png		Logistic Regression with Count Vectorizer Results.png
Logistic Regression with TF-IDF Results.png		Logistic Regression with TF-IDF Results.png
Models Accuracy.png		Models Accuracy.png
NLTK SIA Polarity Scores.png		NLTK SIA Polarity Scores.png
Naive Bayes Classifier with Count Vectorizer Results.png		Naive Bayes Classifier with Count Vectorizer Results.png
Naive Bayes Classifier with TF-IDF Results.png		Naive Bayes Classifier with TF-IDF Results.png
README.md		README.md
Scatter plot of Demand vs Rating(Score).png		Scatter plot of Demand vs Rating(Score).png
Sentiment_Analysis_Reviews.docx		Sentiment_Analysis_Reviews.docx
Sentiment_Analysis_Reviews.pdf		Sentiment_Analysis_Reviews.pdf
Word Cloud - Bad Reviews.png		Word Cloud - Bad Reviews.png
Word Cloud - Good Reviews.png		Word Cloud - Good Reviews.png
final_project_reviews.ipynb		final_project_reviews.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sentiment Analysis of Reviews

Table of Contents

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Data Cleaning:

Data Transformation:

Text Preprocessing

Inspiration for Data Preprocessing

Graphs/Visualizations

Choosing the Algorithm for the Project

Assumptions

Model Evaluation Metrics

Results

Breakdown of Each Model's Performance

Recommendations

Limitations

Future Possibilities of the Project

References

About

Releases

Packages

Languages

tgchacko/Sentiment-Analysis

Folders and files

Latest commit

History

Repository files navigation

Sentiment Analysis of Reviews

Table of Contents

Project Overview

Data Sources

Data Description

Tools

EDA Steps

Data Preprocessing Steps and Inspiration

Data Cleaning:

Data Transformation:

Text Preprocessing

Inspiration for Data Preprocessing

Graphs/Visualizations

Choosing the Algorithm for the Project

Assumptions

Model Evaluation Metrics

Results

Breakdown of Each Model's Performance

Recommendations

Limitations

Future Possibilities of the Project

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages