Welcome to the Twitter Sentiment Analysis Project! This project uses machine learning techniques to analyze the sentiment of tweets—whether they convey positive or negative emotions. Our primary focus is to build a robust predictive model capable of classifying tweets based on their sentiment, empowering businesses, researchers, or developers to gauge the public's opinion on various topics.
The dataset used for this project is the Sentiment140 dataset with 1.6 million tweets sourced from Kaggle. The dataset is extensive, containing both positive and negative tweets, making it a great resource for training machine learning models.
The Sentiment140 dataset contains 1.6 million tweets with labels:
- 0 = Negative Sentiment
- 4 = Positive Sentiment (later mapped to 1 for simplicity)
Each tweet is accompanied by several metadata fields (user ID, date, etc.), but for the purpose of sentiment analysis, we focus primarily on the tweet text and sentiment labels.
Key Data Columns:
target
: Sentiment label (0 = negative, 1 = positive)text
: Tweet content
The goal of this project is to:
- Analyze the sentiment of tweets: Distinguish between positive and negative sentiments.
- Build predictive models: Use machine learning algorithms to classify tweets based on their content.
- Gain insights: Understand the strengths and limitations of different machine learning models applied to text classification.
- Preprocessing Tweets: A comprehensive text preprocessing pipeline, including removing special characters, links, and usernames, tokenizing words, and applying stemming techniques.
- Model Building: Two models were implemented and evaluated:
- Logistic Regression: A reliable and interpretable baseline model.
- Naive Bayes: A simple yet powerful algorithm for text classification.
- Model Comparison: Accuracy, precision, recall, and F1-scores were calculated for both models to determine the best-performing one.
- Visual Insights: Confusion matrix visualizations with both count and percentage values for clearer understanding.
- ROC Curve Analysis: ROC and AUC scores were plotted to assess the trade-off between sensitivity and specificity.
We started with the Sentiment140 dataset consisting of 1.6 million tweets. The target labels are:
0
for negative sentiment1
for positive sentiment
- Cleaning the text: Removed unwanted noise such as URLs, mentions, and special characters.
- Tokenization and Stemming: Converted the cleaned tweets into tokens and applied stemming using the PorterStemmer.
- Vectorization: Used TF-IDF (Term Frequency-Inverse Document Frequency) to convert the preprocessed tweets into numerical format suitable for model training.
Two machine learning models were implemented:
- Logistic Regression: A linear model providing a strong baseline with accuracy scores of 79.65% on the training data and 78.02% on the test data.
- Naive Bayes: Achieved 79.95% accuracy on the training data and 76.30% on the test data.
These models were trained using 80% of the data and tested on the remaining 20% to evaluate their performance.
- Confusion Matrix: Plotted confusion matrices to visualize the model's predictions. Each cell shows the counts and corresponding percentages of correct and incorrect predictions for positive and negative tweets.
- Classification Report: Calculated key metrics like precision, recall, and F1-score for both positive and negative sentiment classes.
- ROC Curve: Plotted the ROC curve and computed the AUC score to evaluate the trade-off between the true positive and false positive rates.
The trained models were saved using Python's pickle module for future use, allowing predictions on unseen data without retraining the models.
Metric | Logistic Regression | Naive Bayes |
---|---|---|
Training Accuracy | 79.65% | 79.95% |
Test Accuracy | 78.02% | 76.30% |
Precision (Pos) | 0.77 | 0.77 |
Recall (Pos) | 0.80 | 0.75 |
- Logistic Regression consistently provided better overall performance, making it a great choice for this problem. It achieved balanced precision and recall for both positive and negative tweets.
- Naive Bayes was slightly less accurate but offers a lightweight and interpretable alternative, particularly excelling in terms of recall for negative tweets.
- Text preprocessing plays a crucial role in improving model accuracy. Removing noise and stemming helped improve the predictions significantly.
- TF-IDF Vectorization effectively transformed text data into meaningful numeric representations, capturing the importance of words in the dataset.
- Explore other machine learning algorithms such as SVM or deep learning models like RNNs for improved accuracy.
- Implement advanced NLP techniques such as word embeddings (Word2Vec, GloVe) to capture semantic meaning in tweets.
- Create a user-friendly web app using Streamlit or Flask to deploy the sentiment analysis model for real-time predictions.
This project demonstrates how machine learning can be used to analyze the sentiment of tweets and how different models behave in text classification tasks. By implementing and evaluating Logistic Regression and Naive Bayes models, we were able to gain valuable insights into the effectiveness of each approach.
Future enhancements could involve using more advanced techniques such as deep learning or transformer-based models (e.g., BERT) to further boost classification accuracy.