Analyzing whether Client feedback is Positive or Negative.
We are using the reviews data from a Restaurant. All kinds of feedback their customers provide and we are going to analyze whether those feedbacks are positive or negative feedbacks. So this dataset is only of two columns. One is feedback and other is type of feedback. We apply here some of Natural Language Processing (NLP) techniques to understand human feedback language. Hope you already have some python basics and you have ready any python environment like Jupyter. You need to install all the below libraries in your python environment. You can do this in Python IDLE using simple pip command. numpy, pandas, re, nltk, scikit-learn
import pandas as pd
dataset = pd.read_csv('Restaurant_Reviews.tsv', delimiter = '\t', quoting = 3)
1. removing unwanted characters first. Like commans, punctuation marks any symbols and anything other than a-z letters.
2. The second step is to remove stopwords. Stopwords are like a, the, is , was, yet, so... These kind of words are not useful while predicting customer feelings. So we remove them.
3. We have to divide the sentences into words. So we will have final words like good, bad, not, really, taste.. These words will help us understanding the client thoughts.
So we do all these steps and put the final words in a python array. Here we use the famous 'nltk' library which is widely used on this process. It already has the all stopwords in 'stopwords' module, for several languages.
So let's see the code
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
corpus = []
for i in range(0, 1000):
review = re.sub('[^a-zA-Z]', ' ', dataset['Review'][i])
review = review.lower()
review = review.split()
ps = PorterStemmer()
review = [ps.stem(word) for word in review if not word in set(stopwords.words('english'))]
review = ' '.join(review)
corpus.append(review)
Now in corpus array we have all the important words we need.
Lets see a small example how it has to be done, before going into the code. If we have 2 reviews as 'good taste' and 'bad taste'. (Customer may not give exactly like this, they might have said 'This is Good in Taste', but we removed stop words and converted them to lower case.) So we make those words are features and replace them what are words are given by review1 and review2 and so on. Please see below.
For all the rows, we might see large number of words, we can limit these numbers in the code. Let's see code.
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(max_features = 1500)
X = cv.fit_transform(corpus).toarray()
y = dataset.iloc[:, 1].values
# Splitting the dataset into the Training set and Test set
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0)
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)
# Predicting the Test set results
y_pred = classifier.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
print(metrics.accuracy_score(y_test, y_pred);
Out: 0.8597005208833334
Lets see the model evaluation with an example. Consider a customer said, "The food quality is very very bad had order some soup it was so terrible could eat more than a spoonful. They need to change the chef at the earliest." For this we apply same logics as above and see what will be the outcome.
text = "The food quality is very very bad had order some soup it was so terrible could eat more than a spoonful. They need to change the chef at the earliest."
corpus2 = []
review2 = re.sub("[^a-zA-Z]", ' ', text)
review2 = review2.lower()
review2 = review2.split()
ps2 = PorterStemmer()
review2 = [ps2.stem(word) for word in review2 if not word in set(stopwords.words('english'))]
review2 = " ".join(review2)
corpus2.append(review2)
from sklearn.feature_extraction.text import CountVectorizer
cv2 = CountVectorizer(max_features = 1500)
X2 = cv2.fit_transform(corpus + corpus2).toarray()
my = X2[-1].reshape(1, -1)
result = classifier.predict(my)
if result == 1:
answear = "Positive"
else:
answear = "Negative"
print(answear)
Out: Negative
1. Let's consider for our Restaurant, we mainly get reviews in three categories - Starters, Main course and Desserts. Customers while giving feedback for Starters, they might keep mentioning the word 'starter'. With this the frequency of this word (Term Frequency) will be high for this category, same with other categories. This unnecessary high frequency suppresses the value of other important words that actually need to be used in machin learning training process.
2. Next the other problem can with the words like 'food', which will present in all the three categories of starters, main course as well as desserts. If there is a common word with good frequency in all the categories (Inverse Document Frequency), the model unable to train/classify whether a feedback is related to which category in the first place.
To solve these problems we use the concept called TF-IDF (Term Frequency - Inverse Document Frequency). The purpose of TF_IDF is to highlight the words which are frequent in a category but not across categories. The basic idea is we divide the frequency count with the total number of words in each category, this will normalise the all the counts and the problem with high frequency of words will be reduced.
In this process, we use a formula to multiply the frequency of a word in particular category is by a logarithmic function which divides the total number of words with the frquency of that word in number of categories. Below formula will help in supressing the effect of commonly occuring words in all categories. This is very useful in the text analytics.
w(i,j) = f(i,j)*log(N/f(i))
Here, w(i,j) = for a Word i in j category
f(i,j) = frequency of word i in j category
N = Number of categories
f(i) = Frequency of word i occuring in number of categories
In python, we can use 'feature_extraction' library in sklearn to achieve this. Please refer below code. A unique bag of words from entire dataset is referred as a corpus in general terms.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
vectorizer.fit(corpus)
Congratulations, you made it this far. We have read, cleaned and processed the data. We trained data with naive bayes model and got the results. From the above we see the model is giving good accuracy score 85.9%. This can be increased after TF_IDF conversion and as well as any Bagging or Boosting techniques, by means training multiple times by shuffling, different sets of data.
Thank you..