This script performs sentiment analysis on a dataset of comments using Natural Language Processing (NLP) techniques and machine learning algorithms.
- Python 3.x
- Libraries: pandas, numpy, seaborn, matplotlib, nltk, scikit-learn
-
Data Preparation: The script loads a dataset from a CSV file containing comments. Ensure that the CSV file is located in the specified path.
-
Sentiment Scoring: The script calculates sentiment scores for each comment using the VADER sentiment analyzer. It categorizes the comments into positive, neutral, and negative sentiments.
-
Text Preprocessing: The comments undergo preprocessing to remove URLs, HTML tags, noise texts, punctuation, numbers, and stopwords. Additionally, the text is tokenized, stemmed, and converted to lowercase.
-
TF-IDF Representation: The preprocessed comments are transformed into TF-IDF (Term Frequency-Inverse Document Frequency) representations, which are numerical representations suitable for machine learning algorithms.
-
Model Training: The script splits the dataset into training and testing sets. It performs hyperparameter tuning using GridSearchCV to optimize the parameters of a Random Forest classifier.
-
Model Evaluation: The trained classifier is evaluated using the testing dataset. The script prints the classification report, including precision, recall, F1-score, and accuracy.
-
Testing the Model: Users can input their own comments to test the trained model. The script preprocesses the input comment, makes a prediction using the trained classifier, and displays the predicted sentiment label.
- Ensure that the dataset file
comments_1st.csv
is located in the specified path. - Run the script in a Python environment.
- Follow the prompts to input comments for testing the model.
- Review the output to see the predicted sentiment labels for the input comments.
- NLTK Documentation: https://www.nltk.org/
- Scikit-learn Documentation: https://scikit-learn.org/stable/documentation.html
- VADER Sentiment Analysis: https://github.com/cjhutto/vaderSentiment