-
Notifications
You must be signed in to change notification settings - Fork 2
/
Report_final.Rmd
858 lines (687 loc) · 39.6 KB
/
Report_final.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
---
title: "Sentiment Analysis for Amazon Review and Drug Review"
author: "fa22-prj-rongl2-xdai12-zixingd2"
date: "`r Sys.Date()`"
output:
pdf_document:
fig_caption: true
number_sections: true
bibliography: references.bib
csl: harvard1.csl
---
\tableofcontents
\newpage
# Introduction
In recent years, machine learning techniques have become increasingly popular and relevant to solving text and sentiment-related problems. It has boosted performance on several tasks and significantly reduced the necessity for human efforts. For this project, we focused on text classification, especially sentiment analysis, on two datasets, *Amazon Review* and *Drug Review*. Although the *Amazon Review* dataset is popular and was being used for many research papers and projects, most code works were carried out using Python. Therefore, we decided to explore more on implementing four classic Natural Language Processing (NLP) methods using R packages. We first replicated the R code from existing literature for the *Amazon Review* dataset. We adapted them to a newer but less popular UCI Machine Learning Repository dataset. Our goal for the project is to compare classifiers, including \verb|BoW|, \verb|Word2Vec|, \verb|GloVe|, \verb|fastText| for two different datasets.
# Data
## Dataset Overview
For the *Amazon Review* dataset, we use the dataset constructed and made available by @zhang2015character. The dataset contains about 1,800,000 training samples and 200,000 testing samples with three attributes, which are classification labels (1 for negative reviews and 2 for positive reviews), the title of each review text, and the review text body. Due to the limit of computer computation ability, we pulled out the first 100,000 data samples and split 80% of the data into the training set and 20% of the data into the testing set.
For the Drug dataset, we downloaded from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Drug+Review+Dataset+%28Drugs.com%29). The dataset has 215,063 samples with 6 attributes, including drugName, condition, review, rating (1 to 10), date, usefulCount. Similar to *Amazon Review* dataset, we split the whole dataset into training (80%) and testing (20%) datasets. In order to replicate the code, we categorized the dataset into two labels, where rating range from [1 to 4] are categorized as [1(negative)] and the rating range from [7 to 10] are categorized as [2(positive)]. We combined the text columns together and removed symbols that are not a number or a word. Moreover, we only retained the columns of rating and merged columns of review, name, condition as one text body attribute. Hence our drug dataset has the same format as *Amazon Review*.
## Dataset Prepossessing
Take the code for Amazon reviews as example.
### Amazon review
We first set seed and read the data. Note that we split the original data into training and testing according to ratio 8:2.
```{r}
set.seed(1)
N_amazon <- 100000
N_train_amazon <- 0.8*N_amazon
reviews_amazon <- readLines("amazon_review_polarity_csv/train.csv", n = N_amazon)
reviews_amazon <- data.frame(reviews_amazon)
```
Then we separating the sentiment and the review text.
```{r, message=FALSE}
library(tidyr)
reviews_amazon <- separate(data = reviews_amazon, col = reviews_amazon,
into = c("Sentiment", "SentimentText"), sep = 4)
```
Here we got the data before prepossessing.
![Head rows of the data before prepossessing](before.png){width=60%}
Since the unnecessary punctuation may cause the problem for our sentiment analysis, we remove the punctuation with the following code:
```{r}
# Retaining only alphanumeric values in the sentiment column
reviews_amazon$Sentiment <- gsub("[^[:alnum:] ]","",reviews_amazon$Sentiment)
# Retaining only alphanumeric values in the sentiment text
reviews_amazon$SentimentText <- gsub("[^[:alnum:] ]"," ",reviews_amazon$SentimentText)
# Replacing multiple spaces in the text with single space
reviews_amazon$SentimentText <- gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "",
reviews_amazon$SentimentText, perl=TRUE)
# Writing the output to a file that can be consumed in other models
write.table(reviews_amazon,file = "Sentiment Analysis Dataset.csv",row.names = F,
col.names = T,sep=',')
```
This will give us the following output:
![Head rows of the data after prepossessing](after.png){width=60%}
Since The fastText algorithm expects the dataset to be in a format: __ label __ \text{<x> <text>}, where x is the class name and text is the review text, we need to transform our data format as required.
```{r}
reviews_amazon <- readLines('amazon_review_polarity_csv/train.csv', n = N_amazon)
# Basic EDA to confirm that the data is read correctly
print(class(reviews_amazon))
print(length(reviews_amazon))
# Replacing the positive sentiment value 2 with __label__2
reviews_amazon <- gsub("\\\"2\\\",","__label__2 ",reviews_amazon)
# Replacing the negative sentiment value 1 with __label__1
reviews_amazon <- gsub("\\\"1\\\",","__label__1 ",reviews_amazon)
# Removing the unnecessary \" characters
reviews_amazon <- gsub("\\\""," ",reviews_amazon)
# Replacing multiple spaces in the text with single space
reviews_amazon <- gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", reviews_amazon, perl=TRUE)
# Writing the revamped file to the directory so we could use it with
# fastText sentiment analyzer project
fileConn <- file("Sentiment Analysis Dataset_ft.txt")
writeLines(reviews_amazon, fileConn)
close(fileConn)
```
This will give us the following dataset:
![Head rows of the prepossessing data for FastText](FastText.png){width=60%}
### Drug Review
The prepossing method for Drug Review are similar to Amazon Review.
```{r, echo=FALSE}
N_Drug <- 146942
reviews_text_Drug <- readLines("Drug Train.csv", n = N_Drug)
reviews_text_Drug <- data.frame(reviews_text_Drug)
```
```{r, echo=FALSE}
reviews_text_Drug <- separate(data = reviews_text_Drug, col = reviews_text_Drug,
into = c("Sentiment", "SentimentText"), sep = 4)
reviews_text_Drug <- reviews_text_Drug[-1,]
N_Drug <- N_Drug - 1
```
```{r, echo=FALSE}
# Retaining only alphanumeric values in the sentiment column
reviews_text_Drug$Sentiment <- gsub("[^[:alnum:] ]","",reviews_text_Drug$Sentiment)
# Retaining only alphanumeric values in the sentiment text
reviews_text_Drug$SentimentText <- gsub("[^[:alnum:] ]"," ",reviews_text_Drug$SentimentText)
# Replacing multiple spaces in the text with single space
reviews_text_Drug$SentimentText <- gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "",
reviews_text_Drug$SentimentText, perl=TRUE)
```
However we did find some problems during our training processes, which is the Drug Review are actually imbalanced. So we need to balanced our Drug Review, make them equal numbers of the positive reviews and negative reviews.
```{r}
# Checking the summary of our label for Drug Review
(Sentimentable = table(reviews_text_Drug$Sentiment))
# Balance our Drug Review
minlabel <- names(which(Sentimentable == min(Sentimentable)))
maxlabel <- names(which(Sentimentable == max(Sentimentable)))
n_maxlabel <- min(Sentimentable)
minlabelid <- c(1:N_Drug)[reviews_text_Drug$Sentiment==minlabel]
maxlabelid <- sample(c(1:N_Drug)[reviews_text_Drug$Sentiment==maxlabel],n_maxlabel)
balanceid <- sample(c(minlabelid,maxlabelid))
reviews_text_Drug <- reviews_text_Drug[balanceid,]
N_Drug <- nrow(reviews_text_Drug)
N_train_Drug <- round(0.8*N_Drug)
```
```{r, echo=FALSE}
# Writing the output to a file that can be consumed in other projects
write.table(reviews_text_Drug,file = "Sentiment Analysis Dataset_Drug.csv",
row.names = F, col.names = T,sep=',')
```
```{r, echo=FALSE}
reviews_text_Drug <- readLines("Drug Train.csv", n = 146942)
reviews_text_Drug <- reviews_text_Drug[-1]
reviews_text_Drug <- reviews_text_Drug[balanceid]
# Replacing the positive sentiment value 2 with __label__2
reviews_text_Drug<-gsub("\\\"2\\\",","__label__2 ",reviews_text_Drug)
# Replacing the negative sentiment value 1 with __label__1
reviews_text_Drug<-gsub("\\\"1\\\",","__label__1 ",reviews_text_Drug)
# Removing the unnecessary \" characters
reviews_text_Drug<-gsub("\\\""," ",reviews_text_Drug)
# Replacing multiple spaces in the text with single space
reviews_text_Drug<-gsub("(?<=[\\s])\\s*|^\\s+|\\s+$", "", reviews_text_Drug, perl=TRUE)
# Writing the revamped file to the directory so we could use it with
# fastText sentiment analyzer project
fileConn<-file("Sentiment Analysis Dataset_ft_Drug.txt")
writeLines(reviews_text_Drug, fileConn)
close(fileConn)
```
# BoW approach with Naive Bayes
Bag of Words (BoW) method is widely used in NLP and computer vision fields. It takes the occurrence of each word in the text regardless of grammar and makes it into “bags” to characterize the text. To implement BoW method for our dataset, *Amazon Review* and *Drug Review*, we first use \verb|VCorpus| function and \verb|DocumentTermMatrix| function in the \verb|tm| package to convert text into a Document Term Matrix (DTM). By adjusting the built-in parameter in the \verb|DocumentTermMatrix| function, we do not have to worry about cleaning the dataset with stop words. In order to make the model more precise, we removed words that do not occur in 99% of the documents by using \verb|removeSparseTerms| function.
After finishing the process of BoW conversion, we can use the DTM to create word clouds for both positive and negative sentiment cases. And also to better interpret our word cloud, we simply use the two-sample t-test to better discriminant our words. Finally, we followed @rCode and used the Naive Bayes sentiment classifier to perform predictions. Utilizing the \verb|nb_sent_classifier| in the \verb|e1071| package, we obtained the prediction results with approximate **81.19%** for *Amazon Review* data and **74.77%** for *Drug Review*.
## Amazon review
```{r, message=FALSE}
library(SnowballC)
library(tm)
# Reading the transformed file as a dataframe
text_amazon <- read.table(file='Sentiment Analysis Dataset.csv', sep=',', header = TRUE)
# Transforming the text into volatile corpus
amazon_corp <- VCorpus(VectorSource(text_amazon$SentimentText))
```
```{r}
# Creating document term matrix (DTM)
dtm_amazon <- DocumentTermMatrix(amazon_corp, control = list( tolower = TRUE,
removeNumbers = TRUE, stopwords = TRUE, removePunctuation = TRUE, stemming = TRUE))
# Basic EDA on dtm
inspect(dtm_amazon)
```
We see that the DTM is 100% sparse. Since the DTM tends to get very big, we removed sparse terms, that is, terms occurring only in very few documents, and tried to reduce the size of the matrix without losing significant relations inherent to the matrix.
```{r}
# Removing sparse terms
dtm_amazon = removeSparseTerms(dtm_amazon, 0.99)
inspect(dtm_amazon)
```
Using the DTM, we can create word clouds for better understanding for our sentiment text. However, we found that some of the selected words are difficult to interpret. So we decided to first filter out a vocab with words that we can interpret. We used a simple screening method, which we described in discriminant analysis before: two-sample t-test. Assume we have one-dimensional observations from two groups
$$X_{1}, X_{2}, \dots, X_{m}, \quad Y_{1}, Y_{2}, \dots, Y_{n}X$$
to test whether the X population and the Y population have the same mean, we computed the following two-sample t-statistic $\frac{\bar{X} - \bar{Y}}{\sqrt{\frac{s^2_X}{m} + \frac{s^2_Y}{n}}}.$
where $s_X^2$ denotes the sample variance of X.
Again, we used the training data from the first split. Since dtm_amazon is a large sparse matrix, we used commands from the R package \verb|slam| to efficiently compute the mean and var for each column of dtm_amazon.
```{r}
# Word Cloud preparing
v.size = dim(dtm_amazon)[2]
ytrain = as.numeric(text_amazon$Sentiment)
```
```{r, message=FALSE}
# Using two-sample t-test to find the most represent words to show our Word Cloud
library(slam)
summ = matrix(0, nrow=v.size, ncol=4)
summ[,1] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_amazon[ytrain==2, ]), mean)
summ[,2] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_amazon[ytrain==2, ]), var)
summ[,3] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_amazon[ytrain==1, ]), mean)
summ[,4] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_amazon[ytrain==1, ]), var)
n1 = sum((ytrain)-1);
n = length(ytrain)
n0 = n - n1
myp = (summ[,1] - summ[,3])/
sqrt(summ[,2]/n1 + summ[,4]/n0)
```
We ordered words by the magnitude of their t-statistics, which are then divided into two lists: positive words and negative words.
```{r}
words = colnames(dtm_amazon)
id = order(abs(myp), decreasing=TRUE)
pos.list = words[id[myp[id]>0]]
posvalue = myp[id][myp[id]>0][1:50]
neg.list = words[id[myp[id]<0]]
negvalue = myp[id][myp[id]<0][1:50]
```
Using the \verb|wordcloud| package to plot the Word Cloud for most represent words, both positive and negative.
```{r, message=FALSE, fig.show='hide', warning=FALSE}
# Word Cloud for positive words
library(wordcloud)
wordcloud(words = pos.list[1:50], freq = posvalue, scale=c(6,.2), min.freq = 5,
random.order=FALSE, rot.per=0.35, colors = brewer.pal(8, "Dark2"))
# Word Cloud for negative words
wordcloud(words = neg.list[1:50], freq = abs(negvalue), scale=c(4.3,.2), min.freq = 5,
random.order=FALSE, rot.per=0.35, colors = brewer.pal(8, "Dark2"))
```
```{r, message=FALSE}
library(png)
par(mfrow=c(1, 2), mar=c(1, 0, 3, 0))
plot.new()
plot.window(xlim=c(0, 1), ylim=c(0, 1), asp=1)
rasterImage(readPNG("amazonpos"), 0, 0, 1, 1)
title('Positive words', line = -0.5)
plot.new()
plot.window(xlim=c(0, 1), ylim=c(0, 1), asp=1)
rasterImage(readPNG("amazonneg"), 0, 0, 1, 1)
title('Negative words', line = -0.5)
title("Word Clouds from Amazon reviews", line = -22, outer = TRUE)
```
Then, we continue using the DTM that to training with machine learning classification. We divide our DTM into training (80%) and testing (20%) datasets.
```{r}
# Splitting the train and test DTM
dtm_amazon_train <- dtm_amazon[1:N_train_amazon, ]
dtm_amazon_test <- dtm_amazon[(N_train_amazon+1):N_amazon, ]
dtm_amazon_train_labels <- as.factor(as.character(text_amazon[1:N_train_amazon, ]$Sentiment))
dtm_amazon_test_labels <- as.factor(as.character(text_amazon[(N_train_amazon+1):N_amazon, ]$Sentiment))
```
Here we use Naive Bayes to create a classifier. Since Naive Bayes is generally trained on data with nominal features, DTM need to be converted to nominal prior to feeding the dataset as input for creating the model with Naive Bayes.
```{r}
# Convert the cell values with a non-zero value to Y, and in case of a zero we convert it to N
cellconvert<- function(x) { x <- ifelse(x > 0, "Y", "N") }
# Applying the function to rows in training and test datasets
dtm_amazon_train <- apply(dtm_amazon_train, MARGIN = 2,cellconvert)
dtm_amazon_test <- apply(dtm_amazon_test, MARGIN = 2,cellconvert)
```
Then, we proceed to build a text sentiment analysis classifier using the Naive Bayes algorithm from the \verb|e1071| package.
```{r, message=FALSE}
# Training the naive bayes classifier on the training dtm
library(e1071)
nb_amazon_senti_classifier <- naiveBayes(dtm_amazon_train,dtm_amazon_train_labels)
```
Finally, we used the trained Naive Bayes model to predict sentiment on the test data DTM. The accuracy for testing *Amazon review* is
```{r, message=FALSE}
# Making predictions on the test data dtm
nb_amazon_predicts <- predict(nb_amazon_senti_classifier, dtm_amazon_test, type="class")
# Computing accuracy of the model
library(rminer)
print(mmetric(nb_amazon_predicts, dtm_amazon_test_labels, c("ACC")))
```
## Drug Review
The works for *Drug Review* are similar to before.
```{r, message=FALSE, echo=FALSE, results='hide'}
library(SnowballC)
library(tm)
# Reading the transformed file as a dataframe
text_Drug <- read.table(file='Sentiment Analysis Dataset_Drug.csv', sep=',', header = TRUE)
# Transforming the text into volatile corpus
train_corp_Drug <- VCorpus(VectorSource(text_Drug$SentimentText))
```
Raw DTM for *Drug Review*
```{r, echo=FALSE}
# Creating document term matrix
dtm_train_Drug <- DocumentTermMatrix(train_corp_Drug, control =
list(tolower = TRUE, removeNumbers = TRUE, stopwords = TRUE,
removePunctuation = TRUE, stemming = TRUE))
# Basic EDA on dtm
inspect(dtm_train_Drug)
```
DTM after removing the sparse columns
```{r, echo=FALSE}
# Removing sparse terms
dtm_train_Drug <- removeSparseTerms(dtm_train_Drug, 0.99)
inspect(dtm_train_Drug)
```
```{r, echo=FALSE}
# Word Cloud preparing
v.size = dim(dtm_train_Drug)[2]
ytrain = as.numeric(text_Drug$Sentiment)
```
```{r, message=FALSE, echo=FALSE}
# Using two-sample t-test to find the most different word to show our Word Cloud
library(slam)
summ = matrix(0, nrow=v.size, ncol=4)
summ[,1] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_train_Drug[ytrain==2, ]), mean)
summ[,2] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_train_Drug[ytrain==2, ]), var)
summ[,3] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_train_Drug[ytrain==1, ]), mean)
summ[,4] = colapply_simple_triplet_matrix(
as.simple_triplet_matrix(dtm_train_Drug[ytrain==1, ]), var)
n1 = sum((ytrain)-1);
n = length(ytrain)
n0 = n - n1
myp = (summ[,1] - summ[,3])/
sqrt(summ[,2]/n1 + summ[,4]/n0)
```
```{r, echo=FALSE}
words = colnames(dtm_train_Drug)
id = order(abs(myp), decreasing=TRUE)
pos.list = words[id[myp[id]>0]]
posvalue = myp[id][myp[id]>0][1:50]
neg.list = words[id[myp[id]<0]]
negvalue = myp[id][myp[id]<0][1:50]
```
```{r, message=FALSE, echo=FALSE, fig.show='hide', warning=FALSE}
# Word Cloud for positive words
library(wordcloud)
wordcloud(words = pos.list[1:50], freq = posvalue, scale=c(3,.5), min.freq = 5,
random.order=FALSE, rot.per=0.35, colors = brewer.pal(8, "Dark2"))
# Word Cloud for negative words
wordcloud(words = neg.list[1:50], freq = abs(negvalue), scale=c(3,.5), min.freq = 5,
random.order=FALSE, rot.per=0.35, colors = brewer.pal(8, "Dark2"))
```
We also use the \verb|wordcloud| package to plot the Word Cloud for *Drug Review*.
```{r, message=FALSE, echo=FALSE}
library(png)
par(mfrow=c(1, 2), mar=c(1, 0, 3, 0))
plot.new()
plot.window(xlim=c(0, 1), ylim=c(0, 1), asp=1)
rasterImage(readPNG("drugpos"), 0, 0, 1, 1)
title('Positive words', line = -0.5)
plot.new()
plot.window(xlim=c(0, 1), ylim=c(0, 1), asp=1)
rasterImage(readPNG("drugneg"), 0, 0, 1, 1)
title('Negative words', line = -0.5)
title("Word Clouds from Drug reviews", line = -22, outer = TRUE)
```
```{r, echo=FALSE}
# Splitting the train and test DTM
dtm_train_train_Drug <- dtm_train_Drug[1:N_train_Drug, ]
dtm_train_test_Drug <- dtm_train_Drug[(N_train_Drug+1):N_Drug, ]
dtm_train_train_Drug_labels <- as.factor(as.character(text_Drug[1:N_train_Drug, ]$Sentiment))
dtm_train_test_Drug_labels <- as.factor(as.character(text_Drug[(N_train_Drug+1):N_Drug, ]$Sentiment))
```
```{r, echo=FALSE}
# Convert the cell values with a non-zero value to Y, and in case of a zero we convert it to N
cellconvert <- function(x) { x <- ifelse(x > 0, "Y", "N") }
```
```{r, echo=FALSE}
# Applying the function to rows in training and test datasets
dtm_train_train_Drug <- apply(dtm_train_train_Drug, MARGIN = 2,cellconvert)
dtm_train_test_Drug <- apply(dtm_train_test_Drug, MARGIN = 2,cellconvert)
```
```{r, message=FALSE, echo=FALSE}
# Training the naive bayes classifier on the training dtm
library(e1071)
nb_senti_classifier_Drug <- naiveBayes(dtm_train_train_Drug, dtm_train_train_Drug_labels)
```
```{r, echo=FALSE}
# Making predictions on the test data dtm
nb_predicts_Drug <- predict(nb_senti_classifier_Drug, dtm_train_test_Drug, type="class")
```
The accuracy for testing *Drug Review* is
```{r, message=FALSE, echo=FALSE}
# Computing accuracy of the model
library(rminer)
print(mmetric(nb_predicts_Drug, dtm_train_test_Drug_labels, c("ACC")))
```
# Pretrained word2vec word embedding with Random Forest algorithm
\verb|Word2vec| was developed by @mikolov2013efficient as a response to making the neural-network-based training of the embedding more efficient, and since then it has become the standard method for developing pretrained word embedding. The \verb|softmaxreg| library in R offers pretrained \verb|word2vec| word embedding that can be used for building our sentiment analysis engine for the sentiment reviews data. The pretrained vector is built using the word2vec model, and it is based on the [Reuter_50_50 dataset UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets/Reuter_50_50).
After obtaining the word embedding, we calculated the review sentences embedding by taking the mean of all the word vectors of the words made up of the review sentences. Finally, the machine learning classification method is applied to the review sentence embeddings. In this problem, we used Random Forest algorithm to make classification and achieve an accuracy of **62.56%** on *Amazon Review* and **70.99%** on *Drug Review*.
## Amazon review
First we examined the \verb|word2vec| pretrained emdeddings.
```{r, message=FALSE}
library(softmaxreg)
# Importing the word2vec pretrained vector into memory
data(word2vec)
```
In order to decode the entire review, we take the mean of all the word vectors of the words that made up the review. The \verb|softmaxreg| library offers the wordEmbed function where we could pass a sentence and ask it to compute the mean word vector for the sentence.
```{r}
# Function to get word vector for each review
docVectors <- function(x) { wordEmbed(x, word2vec, meanVec = TRUE) }
text_amazon <- read.csv(file='Sentiment Analysis Dataset.csv', header = TRUE)
# Applying the docVector function on each of the reviews
# Storing the matrix of word vectors as temp
temp_amazon <- t(sapply(text_amazon$SentimentText, docVectors))
```
This data frame can now be used to build classification models using an ML algorithm. We first split our matrix into training (80%) and testing (20%) datasets, then applying the Random Forest algorithm for our classifier model.
```{r, message=FALSE}
# Splitting the dataset into train and test
temp_amazon_train <- temp_amazon[1:N_train_amazon,]
temp_amazon_test <- temp_amazon[(N_train_amazon+1):N_amazon,]
labels_amazon_train <- as.factor(as.character(text_amazon[1:N_train_amazon,]$Sentiment))
labels_amazon_test <- as.factor(as.character(text_amazon[(N_train_amazon+1):N_amazon,]$Sentiment))
library(randomForest)
# Training a model using random forest classifier with training dataset
# Observe that we are using 20 trees to create the model
rf_amazon_senti_classifier <- randomForest(temp_amazon_train, labels_amazon_train, ntree=20)
print(rf_amazon_senti_classifier)
```
The accuracy for testing *Amazon Review* is
```{r, message=FALSE}
# Making predictions on the dataset
rf_amazon_predicts <- predict(rf_amazon_senti_classifier, temp_amazon_test)
library(rminer)
print(mmetric(rf_amazon_predicts, labels_amazon_test, c("ACC")))
```
## Drug Review
The works for *Drug Review* are similar to before.
```{r, message=FALSE, echo=FALSE}
library(softmaxreg)
# Importing the word2vec pretrained vector into memory
data(word2vec)
```
```{r, echo=FALSE}
# Function to get word vector for each review
docVectors = function(x) { wordEmbed(x, word2vec, meanVec = TRUE) }
text_Drug <- read.csv(file='Sentiment Analysis Dataset_Drug.csv', header = TRUE)
# Applying the docVector function on each of the reviews
# Storing the matrix of word vectors as temp
temp_Drug <- t(sapply(text_Drug$SentimentText, docVectors))
```
```{r, message=FALSE, echo=FALSE}
# Splitting the dataset into train and test
temp_train_Drug <- temp_Drug[1:N_train_Drug,]
temp_test_Drug <- temp_Drug[(N_train_Drug+1):N_Drug,]
labels_train_Drug <- as.factor(as.character(text_Drug[1:N_train_Drug,]$Sentiment))
labels_test_Drug <- as.factor(as.character(text_Drug[(N_train_Drug+1):N_Drug,]$Sentiment))
library(randomForest)
# Training a model using random forest classifier with training dataset
# Observe that we are using 20 trees to create the model
rf_senti_classifier_Drug <- randomForest(temp_train_Drug, labels_train_Drug,ntree=20)
print(rf_senti_classifier_Drug)
```
The accuracy for testing *Drug Review* is
```{r, message=FALSE, echo=FALSE}
# Making predictions on the dataset
rf_predicts_Drug <- predict(rf_senti_classifier_Drug, temp_test_Drug)
library(rminer)
print(mmetric(rf_predicts_Drug, labels_test_Drug, c("ACC")))
```
# GloVe word embedding with Random Forest algorithm
@pennington2014glove developed an extension of the \verb|word2vec| method called \verb|GloVe Vectors| for Word Representation (\verb|GloVe|) for efficiently learning word vectors. It combines the global statistics of matrix factorization techniques with local context-based learning in \verb|word2vec|. Also, unlike \verb|word2vec|, rather than using a window to define local context, \verb|GloVe| constructs an explicit word context or word co-occurrence matrix using statistics across the whole text corpus. As an effect, the learning model yields generally better word embeddings.
The \verb|text2vec| package in R has a \verb|GloVe| implementation that we could use to train to obtain word embeddings from our own training corpus. Similar to the previous part, we used the \verb|softmaxreg| library to obtain the mean word vector for each review. In this problem, we used the Random Forest algorithm to make classification and achieve an accuracy of **72.72%** on *Amazon Review* and **74.96%** on *Drug Review*.
## Amazon review
The following code demonstrates how \verb|GloVe| word embeddings can be created and used for sentiment analysis.
```{r, message=FALSE}
library(text2vec)
# Reading the dataset
text_amazon <- read.csv(file='Sentiment Analysis Dataset.csv', header = TRUE)
# Subsetting only the review text so as to create Glove word embedding
wiki_amazon <- as.character(text_amazon$SentimentText)
# Create iterator over tokens
tokens_amazon <- space_tokenizer(wiki_amazon)
# Create vocabulary. Terms will be unigrams (simple words).
it_amazon <- itoken(tokens_amazon, progressbar = FALSE)
vocab_amazon <- create_vocabulary(it_amazon)
# Consider a term in the vocabulary if and only if the term has appeared at least
# three times in the dataset
vocab_amazon <- prune_vocabulary(vocab_amazon, term_count_min = 3L)
# Use the filtered vocabulary
vectorizer_amazon <- vocab_vectorizer(vocab_amazon)
# Use window of 5 for context words and create a term co-occurance matrix
tcm_amazon <- create_tcm(it_amazon, vectorizer_amazon, skip_grams_window = 5L)
# Create the glove embedding for each in the vocab and
# the dimension of the word embedding should set to 50
# x_max is the maximum number of co-occurrences to use in the weighting function
glove <- GlobalVectors$new(rank = 50, x_max = 100)
wv_main_amazon <- glove$fit_transform(tcm_amazon, n_iter = 10, convergence_tol = 0.01)
```
The following uses the \verb|GloVe| model to obtain the combined word vector.
```{r, message=FALSE}
# Glove model learns two sets of word vectors - main and context.
# Both matrices may be added to get the combined word vector
wv_context <- glove$components
word_vectors_amazon <- wv_main_amazon + t(wv_context)
# Converting the word_vector to a dataframe for visualization
word_vectors_amazon <- data.frame(word_vectors_amazon)
# The word for each embedding is set as row name by default
# Using the tibble library rownames_to_column function, the rownames is copied
# as first column of the dataframe
# We also name the first column of the dataframe as words
library(tibble)
word_vectors_amazon <- rownames_to_column(word_vectors_amazon, var = "words")
```
We used the \verb|softmaxreg| library to obtain the mean word vector for each review.
```{r, message=FALSE}
library(softmaxreg)
docVectors_amazon = function(x) { wordEmbed(x, word_vectors_amazon, meanVec = TRUE) }
# Applying the function docVectors function on the entire reviews dataset
# This will result in word embedding representation of the entire reviews dataset
temp_amazon <- t(sapply(text_amazon$SentimentText, docVectors_amazon))
```
We split the dataset into 80% train and 20% test portions and used the Random Forest algorithm to build a model to train.
```{r, message=FALSE}
# Splitting the dataset into train and test portions
temp_amazon_train <- temp_amazon[1:N_train_amazon,]
temp_amazon_test <- temp_amazon[(N_train_amazon+1):N_amazon,]
labels_amazon_train <- as.factor(as.character(text_amazon[1:N_train_amazon,]$Sentiment))
labels_amazon_test <- as.factor(as.character(text_amazon[(N_train_amazon+1):N_amazon,]$Sentiment))
# Using randomforest to build a model on train data
library(randomForest)
rf_amazon_senti_classifier <- randomForest(temp_amazon_train, labels_amazon_train,ntree=20)
```
Finally, the accuracy for testing *Amazon Review* is
```{r, message=FALSE}
# Predicting labels using the randomforest model created
rf_amazon_predicts <- predict(rf_amazon_senti_classifier, temp_amazon_test)
# Estimating the accuracy from the predictions
library(rminer)
print(mmetric(rf_amazon_predicts, labels_amazon_test, c("ACC")))
```
## Drug Review
The works for *Drug Review* are similar.
```{r, message=FALSE, echo=FALSE}
library(text2vec)
# Reading the dataset
text_Drug <- read.csv(file='Sentiment Analysis Dataset_Drug.csv', header = TRUE)
# Subsetting only the review text so as to create Glove word embedding
wiki_Drug <- as.character(text_Drug$SentimentText)
# Create iterator over tokens
tokens_Drug <- space_tokenizer(wiki_Drug)
# Create vocabulary. Terms will be unigrams (simple words).
it_Drug <- itoken(tokens_Drug, progressbar = FALSE)
vocab_Drug <- create_vocabulary(it_Drug)
# Consider a term in the vocabulary if and only if the term has appeared at least
# three times in the dataset
vocab_Drug <- prune_vocabulary(vocab_Drug, term_count_min = 3L)
# Use the filtered vocabulary
vectorizer_Drug <- vocab_vectorizer(vocab_Drug)
# Use window of 5 for context words and create a term co-occurance matrix
tcm_Drug <- create_tcm(it_Drug, vectorizer_Drug, skip_grams_window = 5L)
# Create the glove embedding for each in the vocab and
# the dimension of the word embedding should set to 50
# x_max is the maximum number of co-occurrences to use in the weighting function
glove <- GlobalVectors$new(rank = 50, x_max = 100)
wv_main_Drug <- glove$fit_transform(tcm_Drug, n_iter = 10, convergence_tol = 0.01)
```
```{r, message=FALSE, echo=FALSE}
# Glove model learns two sets of word vectors - main and context
# Both matrices may be added to get the combined word vector
wv_context <- glove$components
word_vectors_Drug <- wv_main_Drug + t(wv_context)
# Converting the word_vector to a dataframe for visualization
word_vectors_Drug <- data.frame(word_vectors_Drug)
# The word for each embedding is set as row name by default
# Using the tibble library rownames_to_column function, the rownames is copied
# as first column of the dataframe
# We also name the first column of the dataframe as words
library(tibble)
word_vectors_Drug <- rownames_to_column(word_vectors_Drug, var = "words")
```
```{r, message=FALSE, echo=FALSE}
library(softmaxreg)
docVectors_Drug = function(x) { wordEmbed(x, word_vectors_Drug, meanVec = TRUE) }
# Applying the function docVectors function on the entire reviews dataset
# This will result in word embedding representation of the entire reviews dataset
temp_Drug <- t(sapply(text_Drug$SentimentText, docVectors_Drug))
```
```{r, message=FALSE, echo=FALSE}
# Splitting the dataset into train and test portions
temp_train_Drug <- temp_Drug[1:N_train_Drug,]
temp_test_Drug <- temp_Drug[(N_train_Drug+1):N_Drug,]
labels_train_Drug <- as.factor(as.character(text_Drug[1:N_train_Drug,]$Sentiment))
labels_test_Drug <- as.factor(as.character(text_Drug[(N_train_Drug+1):N_Drug,]$Sentiment))
# Using randomforest to build a model on train data
library(randomForest)
rf_senti_classifier_Drug <- randomForest(temp_train_Drug, labels_train_Drug,ntree=20)
```
The accuracy for testing *Drug Review* is
```{r, message=FALSE, echo=FALSE}
# Predicting labels using the randomforest model created
rf_predicts_Drug<-predict(rf_senti_classifier_Drug, temp_test_Drug)
# Estimating the accuracy from the predictions
library(rminer)
print(mmetric(rf_predicts_Drug, labels_test_Drug, c("ACC")))
```
# FastText word embedding
\verb|FastText| is also an extension of \verb|word2vec|, \verb|fastTextR| package is used to reach more concise predictions for the analysis. Created and open-sourced by Facebook in 2016 [@fastText], \verb|FastText| is a more powerful tool to classify text and learn word vector representation by breaking words into several character n-grams. \verb|FastText| can construct the vector for a word from its character n-grams, even if it doesn't appear in the training corpus; however, it is also time-consuming.
Before training the model, we convert the label in the dataset from "\\\\\\1\\\\\\" into "__label__1" in order to meet the format of the \verb|FastText| algorithm. We also cleaned all multiple spaces in the text with a single space. Thereupon, we used \verb|ft_train| function to train the model and \verb|ft_control| to tune the hyper-parameter for our two datasets. Our best accuracy for the fastText model is **86.49%** for the *Amazon Review Dataset* and **78.69%** for the Drug dataset.
## Amazon review
We used the \verb|fastTextR| library for this problem to build a sentiment analysis engine on *Amazon Review*.
```{r, message=FALSE}
library(fastTextR)
# Input reviews file
text_amazon <- readLines("Sentiment Analysis Dataset_ft.txt")
```
We divided the reviews into 80% training and 20% testing datasets.
```{r}
# Dividing the reviews into training and test
temp_amazon_train <- text_amazon[1:N_train_amazon]
temp_amazon_test <- text_amazon[(N_train_amazon+1):N_amazon]
```
We then created a \verb|.txt| file for the train and test dataset.
```{r}
# Creating txt file for train and test dataset
fileConn <- file("train.ft.txt")
writeLines(temp_amazon_train, fileConn)
close(fileConn)
fileConn <- file("test.ft.txt")
writeLines(temp_amazon_test, fileConn)
close(fileConn)
# Creating a test file with no labels
temp_amazon_test_nolabel <- gsub("__label__1", "", temp_amazon_test, perl=TRUE)
temp_amazon_test_nolabel <- gsub("__label__2", "", temp_amazon_test_nolabel, perl=TRUE)
```
We also created no labels test dataset to a file so we can use it for testing.
```{r}
fileConn <- file("test_nolabel.ft.txt")
writeLines(temp_amazon_test_nolabel, fileConn)
close(fileConn)
# Training a supervised classification model with training dataset file
model_amazon <- ft_train("train.ft.txt", method = "supervised",
control = ft_control(nthreads = 3L, seed = 1))
# Obtain all the words from a previously trained model
words_amazon <- ft_words(model_amazon)
# Obtain word vectors from a previously trained model.
word_vec_amazon <- ft_word_vectors(model_amazon, words_amazon)
```
The estimate of the accuracy for testing *Amazon Review* is
```{r, message=FALSE}
# Predicting the labels for the reviews in the no labels test dataset
# Getting the predictions into a dataframe so as to compute performance measurement
ft_preds_amazon <- ft_predict(model_amazon, newdata = temp_amazon_test_nolabel)
# Reading the test file to extract the actual labels
reviewstestfile_amazon <- readLines("test.ft.txt")
# Extracting just the labels frm each line
library(stringi)
actlabels_amazon <- stri_extract_first(reviewstestfile_amazon, regex="\\w+")
# Converting the actual labels and predicted labels into factors
actlabels_amazon <- as.factor(as.character(actlabels_amazon))
ft_preds_amazon <- as.factor(as.character(ft_preds_amazon$label))
# Getting the estimate of the accuracy
library(rminer)
print(mmetric(actlabels_amazon, ft_preds_amazon, c("ACC")))
```
## Drug Review
```{r, message=FALSE, echo=FALSE}
library(fastTextR)
# Input reviews file
text_Drug <- readLines("Sentiment Analysis Dataset_ft_Drug.txt")
```
```{r, echo=FALSE}
# Dividing the reviews into training and test
temp_train_Drug <- text_Drug[1:N_train_Drug]
temp_test_Drug <- text_Drug[(N_train_Drug+1):N_Drug]
```
```{r, echo=FALSE}
# Creating txt file for train and test dataset
fileConn <- file("train_Drug.ft.txt")
writeLines(temp_train_Drug, fileConn)
close(fileConn)
fileConn <- file("test_Drug.ft.txt")
writeLines(temp_test_Drug, fileConn)
close(fileConn)
# Creating a test file with no labels
temp_test_Drug_nolabel <- gsub("__label__1", "", temp_test_Drug, perl=TRUE)
temp_test_Drug_nolabel <- gsub("__label__2", "", temp_test_Drug_nolabel, perl=TRUE)
```
```{r, echo=FALSE}
fileConn <- file("test_Drug_nolabel.ft.txt")
writeLines(temp_test_Drug_nolabel, fileConn)
close(fileConn)
# Training a supervised classification model with training dataset file
model_Drug <- ft_train("train_Drug.ft.txt", method = "supervised", control = ft_control(nthreads = 3L, seed = 1))
# Obtain all the words from a previously trained model=
words_Drug <- ft_words(model_Drug)
```
```{r, echo=FALSE}
# Obtain word vectors from a previously trained model.
word_vec_Drug <- ft_word_vectors(model_Drug, words_Drug)
```
The estimate of the accuracy for testing *Drug Review* is
```{r, message=FALSE, echo=FALSE}
# Predicting the labels for the reviews in the no labels test dataset
# Getting the predictions into a dataframe so as to compute performance measurement
ft_preds_Drug <- ft_predict(model_Drug, newdata = temp_test_Drug_nolabel)
# Reading the test file to extract the actual labels
reviewstestfile_Drug <- readLines("test_Drug.ft.txt")
# Extracting just the labels frm each line
library(stringi)
actlabels_Drug <- stri_extract_first(reviewstestfile_Drug, regex="\\w+")
# Converting the actual labels and predicted labels into factors
actlabels_Drug <- as.factor(as.character(actlabels_Drug))
ft_preds_Drug <- as.factor(as.character(ft_preds_Drug$label))
# Getting the estimate of the accuracy
library(rminer)
print(mmetric(actlabels_Drug, ft_preds_Drug, c("ACC")))
```
# Conclusion & Discussion
In conclusion, comparing all of our models after fine-tuning, the \verb|FastText| model performs best on the *Amazon Review Dataset* with **86.49%** of accuracy and the Drug dataset with **78.69%** of accuracy. Since words passed by the \verb|FastText| model are represented as the sum of each word's bag of character n-grams, \verb|FastText| is much more efficient for dealing with large corpus and computing word embeddings for words unseen from the training set [@whyFT]. With such features, \verb|FastText| can cope with typos and different word tenses accordingly without treating them as different words. For example, "helped" and "help" are two same words but only different from tenses. However, models other than \verb|FastText| may treat them as two different words and assign the wrong labels. Therefore, using fastText can significantly boost performance.
From the entire scope, the \verb|Word2vec| model performs with relatively low accuracy for both models (**62.56%** for *Amazon Review* and **70.99%** for Drug). One possible reason will be the existing words-only embedding in the \verb|word2vec| package. When encountering a sentence that its own embeddings cannot convert, the model will turn it into zero vectors, which will lose information and weaken the info of other word features. Thus, we may want to try to create and train word embedding on our own using CBOW and Continuous Skip-Gram in the future to see if any improvements can be made.
For future investigation, we can try to use BERT to better process the dataset and attempt more classification algorithms, such as XGBoost and AdaBoost, to classify the sentence embeddings. Moreover, as we only split our data into train and test datasets with 80-20, we may want to split the test set further into 10% of the validation dataset and 10% of the testing dataset so that we can select more fitting hyperparameters for the model and realize higher accuracy [@validSet].
\newpage
# References