Data Preprocessing Steps and Inspiration
Choosing the Algorithm for the Project
Future Possibilities of the Project
The primary goal of this project is to effectively analyze customer reviews to understand the sentiment and quality perception of products based on user-generated content. The analysis aims to identify patterns and trends in the data that provide insights into customer satisfaction and product quality. Additionally, the project seeks to classify each review based on the sentiment expressed for each product, aiding in the qualitative assessment of feedback.
The primary dataset used for this analysis contains detailed information about product reviews, including text reviews, ratings, and other metadata.
The dataset consists of 5,68,411 rows and 10 columns, including unique identifiers for reviews, products, and users, as well as textual data for reviews and summaries. The columns are:
- Id: Unique identifier for each review.
- ProductId: Unique identifier for the product being reviewed.
- ProfileName: Name of the user profile.
- HelpfulnessNumerator: Number of users who found the review helpful.
- HelpfulnessDenominator: Number of users who indicated whether they found the review helpful or not.
- Score: Rating given to the product by the reviewer.
- Time: Timestamp when the review was posted.
- Unemployment: The unemployment rate in the region
- Summary: Summary of the review.
- Text: Full text of the review.
-
Python: Data Cleaning and Analysis
-
Jupyter Notebook: For interactive data analysis and visualization
Libraries
Below are the links for details and commands (if required) to install the necessary Python packages:
- pandas: Go to Pandas Installation or use command:
pip install pandas
- numpy: Go to NumPy Installation or use command:
pip install numpy
- matplotlib: Go to Matplotlib Installation or use command:
pip install matplotlib
- seaborn: Go to Seaborn Installation or use command:
pip install seaborn
- scikit-learn: Go to Scikit-Learn Installation or use command:
pip install scikit-learn
- NLTK: Go to NLTK Installation or use command:
pip install nltk
Exploratory Data Analysis (EDA) involved exploring the reviews data to answer key questions, such as:
- What is the distribution of scores?
- How do review lengths vary?
- What are the common themes in positive and negative reviews?
Handling Missing Values: Any missing values are identified and removed to ensure the quality of the data. Removing Duplicates: Duplicate entries are checked and removed to ensure the uniqueness of data points for accurate analysis. Consistency Checks: Ensuring that helpfulness numerators do not exceed denominators and standardizing text data for uniformity.
Converting Data Types: The data type for the column 'Time' is changed to 'datetime' from 'int64'. Feature Engineering: New features such as helpfulness ratio, text length, and summary length are generated.
- Tokenization: Breaking down the text into individual words or tokens.
- Stop Words Removal: Eliminating common words that offer little value for analysis.
- Lemmatization: Converting words into their base form.
- Vectorization: Transforming text data into numerical format using techniques like Count Vectorization and TF-IDF.
The inspiration for the specific preprocessing steps comes from typical challenges encountered in natural language processing and sentiment analysis tasks, particularly noise reduction, dimensionality reduction, and bias removal.
-
Logistic Regression - TFIDF: Uses term frequency-inverse document frequency (TFIDF) to weigh words based on their importance and logistic regression for binary classification, providing a balance of interpretability and performance.
-
Naive Bayes - Count Vectorizer: Utilizes the count vectorizer to transform text data into token counts and applies Naive Bayes for probabilistic classification, effective for large datasets and capturing word frequency.
-
Logistic Regression - Count Vectorizer: Combines count vectorizer for token counts with logistic regression to predict sentiment, suitable for linear relationships and high-dimensional data.
-
Naive Bayes - TFIDF: Employs TFIDF to emphasize important words and Naive Bayes for classification, balancing word importance and probabilistic predictions.
-
NLTK SIA Polarity Scores: Utilizes the Sentiment Intensity Analyzer from NLTK to quickly assess sentiment polarity scores, offering a simple and fast sentiment analysis approach.
- Independence of Features: Assuming that words are independent of each other.
- Linear Relationships: Assuming linear separability of sentiment based on word presence.
- Text Preprocessing Decisions: Assuming preprocessing steps adequately capture important features.
- Quality and Completeness of Data: Assuming the dataset accurately represents the population of interest.
- Sentiment Labeling Accuracy: Assuming sentiment labels are correct.
- Accuracy: Measures the proportion of total predictions that were correct.
- Precision: Measures the accuracy of positive predictions.
- Recall(Sensitivity): Measures the ability to find all relevant cases within a dataset.
- F1 Score: The harmonic mean of precision and recall.
- Logistic Regression - TFIDF: Accuracy: 91.29%, Precision: 0.84 (negative), 0.93 (positive), Recall: 0.73 (negative), 0.96 (positive), F1-Score: 0.79 (negative), 0.95 (positive)
- Naive Bayes - Count Vectorizer: Accuracy: 89.42%, Precision: 0.77 (negative), 0.93 (positive), Recall: 0.73 (negative), 0.94 (positive), F1-Score: 0.75 (negative), 0.93 (positive)
- Logistic Regression - Count Vectorizer: Accuracy: 91.64%, Precision: 0.84 (negative), 0.93 (positive), Recall: 0.76 (negative), 0.96 (positive), F1-Score: 0.80 (negative), 0.95 (positive)
- Naive Bayes - TFIDF: Accuracy: 85.38%, Precision: 0.90 (negative), 0.85 (positive), Recall: 0.37 (negative), 0.99 (positive), F1-Score: 0.52 (negative), 0.91 (positive)
- NLTK SIA Polarity Scores: Accuracy: 81.97%, Precision: 0.74 (negative), 0.83 (positive), Recall: 0.26 (negative), 0.97 (positive), F1-Score: 0.39 (negative), 0.89 (positive)
Model | Accuracy | Precision (Negative) | Precision (Positive) | Recall (Negative) | Recall (Positive) | F1-Score (Negative) | F1-Score (Positive) |
---|---|---|---|---|---|---|---|
Logistic Regression - TF-IDF | 91.29% | 0.84 | 0.93 | 0.73 | 0.96 | 0.79 | 0.95 |
Naive Bayes - Count Vectorizer | 89.42% | 0.77 | 0.93 | 0.73 | 0.94 | 0.75 | 0.93 |
Logistic Regression - Count Vectorizer | 91.64% | 0.84 | 0.93 | 0.76 | 0.96 | 0.80 | 0.95 |
Naive Bayes - TF-IDF | 85.38% | 0.90 | 0.85 | 0.37 | 0.99 | 0.52 | 0.91 |
Balanced Performance: Logistic Regression – Count Vectorizer stands out as the best model due to its high accuracy and balanced precision and recall across both classes.
- Implement targeted improvements based on feedback from reviews.
- Use sentiment analysis results to guide product development and marketing strategies.
- Continuously update and refine the models with new data for improved accuracy.
- Data Quality: Potential inaccuracies due to underreporting or subjective nature of reviews.
- Model Limitations: Models may not capture all nuances of sentiment in reviews.
- External Factors: Other factors not included in the analysis can impact sentiment.
- Advanced Predictive Modeling: Explore advanced models like NBEATS, NHITS, PatchTST, VARMAX, VAR, and KATS for enhanced accuracy.
- Store-Specific/Product-Specific Analysis: Conduct detailed analysis for each product category in each store to uncover unique patterns and optimize models for individual characteristics.
- External Factors Integration: Incorporate additional factors like economic indicators, social events, and regional factors for a comprehensive approach.
- Bird, S., Klein, E., & Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media, Inc.
- Jurafsky, D., & Martin, J. H. (2019). Speech and Language Processing (3rd ed.).