-
Notifications
You must be signed in to change notification settings - Fork 0
/
Stat360_TopicModeling.Rmd
254 lines (186 loc) · 10 KB
/
Stat360_TopicModeling.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
---
title: "Stat 360 Topic Modeling"
author: "Meredith Brown and Caleb Kornfein"
date: "`r Sys.Date()`"
output: html_document
---
## Loading Packages
```{r packages, message=FALSE}
library(tidyverse)
library(tm)
library(stringr)
library(stringi)
library(wordcloud)
library(topicmodels)
library(ldatuning)
library(usethis)
library(LDAvis)
library(servr)
library(lda)
library(gistr)
library(reshape2)
library(plot.matrix)
```
## Loading Data
```{r loading, message=FALSE}
debates <- read_csv("debate_transcripts_v3_2020-02-26.csv")
```
## Cleaning Data
```{r data}
# Set global seed
set.seed(1)
# Here we drop all the columns except speaker and speech
keeps <- c("speaker", "speech")
debates <- debates[, keeps, drop = FALSE]
# A list of all the democratic candidates that appeared in any of the debates -- searched online
demCandidates <- c("Michael Bennet", "Joe Biden", "Cory Booker", "Pete Buttigieg", "Julian Castro", "Bill de Blasio", "John Delaney", "Tulsi Gabbard", "Kirsten Gillibrand", "Kamala Harris", "Jay Inslee", "Amy Klobuchar", "Beto O'Rourke", "Tim Ryan", "Bernie Sanders", "Eric Swalwell", "Elizabeth Warren", "Marianne Williamson", "Andrew Yang", "Michael Bloomberg")
# Filtering the rows out that are not spoken by one of the democratic candidates. Also lol it took me like half an hour to figure out how to include Beto since he had a special symbol in his name
debates <- subset(debates, speaker %in% demCandidates)
# Double checking we got everyone
length(demCandidates)
length(unique(debates$speaker))
# Collapsing the speech text by speaker, such that each candidates total responses from all of the debates forms a document
debates <- aggregate(speech ~ speaker, data = debates, FUN = paste, collapse = " ")
## Next we remove punctuation from the speeches
debates$speech <- removePunctuation(debates$speech)
```
## Start using the tm package here for some fancy shenanigans
```{r tmMagic}
# definining a "corpus" from the dataframe
debates_source <- VectorSource(debates$speech)
debates_corpus <- VCorpus(debates_source)
# stripping whitespace
debates_corpus <- tm_map(debates_corpus, stripWhitespace)
# converting to all lower case
debates_corpus <- tm_map(debates_corpus, content_transformer(tolower))
# removing common stopwords aka words such as 'a', 'the', 'also'
debates_corpus <- tm_map(debates_corpus, removeWords, stopwords("english"))
# removing numbers
debates_corpus <- tm_map(debates_corpus, removeNumbers)
# removing punctuation
debates_corpus <- tm_map(debates_corpus, removePunctuation)
#removing other uninformative words that came up
Stopwords <- c("can", "actually", "weve", "also", "whats", "youre", "think", "tell", "yes", "want", "well", "bit", "get", "thats", "theyre", "thing", "ive", "able", "like", "lets", "lot", "let", "one", "will", "got", "dont", "said", "put", "done", "going", "around", "part", "two", "good", "act", "say", "sure", "things", "make", "theres", "way", "say", "take", "talk", "something", "many", "new", "people", "see", "just", "need", "now", "away", "even", "much", "really", "cant", "day", "number", "look", "matter", "clear", "understand", "know")
debates_corpus <- tm_map(debates_corpus, removeWords, Stopwords)
```
## Moving on to creating matrices of data
```{r matrixMagic}
#created a matrix with speaker aka document as the rows (1 - 20) and word as column, with first column being the most frequent word used in the total documents. A ij of the matrix represents that the word in the column index was spoken j times by the ith candidate
dtm <- DocumentTermMatrix(debates_corpus)
rownames(dtm) <- debates$speaker
inspect(dtm)
# list of words spoken by candidates at least 100 total times during the debate
findFreqTerms(dtm, 100)
```
## Visualizing data as wordclouds
```{r wordcloud}
# wordcloud of all the candidates together
freq <- colSums(as.matrix(dtm))
wordcloud(names(freq), freq, min.freq=175, colors=brewer.pal(6, "Dark2"), random.color = TRUE)
```
## Finding Optimal Number of Topics
This function comes from the `ldatuning` package. This function "calculates different metrics to estimate the most preferable number of topics for LDA model" [https://cran.r-project.org/web/packages/ldatuning/ldatuning.pdf]. Here, it uses 4 scoring algorithms to evaluate the number of topics: Arun 2010, CaoJuan 2009, Griffiths 2004, and Deveaud 2014.
```{r optimalTopics, cache=TRUE, eval=FALSE}
N <- 2500 ; ALPHA <- .01 ; BURNIN = 200;
topicStats <- FindTopicsNumber(dtm
, topics = seq(3, 35, by = 2)
, metrics = c("Arun2010", "CaoJuan2009", "Griffiths2004", "Deveaud2014")
, method = "Gibbs"
, control = list(alpha = ALPHA #inputting alpha hyperparameter for topics over documents
, seed = 1 #setting seed to 1
, delta = .01 #setting delta hyperparameter for words over topics
, iter = N #number of trials equal to N
, burnin = BURNIN)
, mc.cores = 4L) #adding burn in period
```
This function takes the output from the previous function, a number of topics and the corresponding value from the given metric, and plots the change in values over the number of topics for each metric:
```{r plotting-model-criterion}
FindTopicsNumber_plot(topicStats)
```
## LDA Model
This LDA function is from the `topicmodels` package. It "estimate[s] a LDA model using for example...Gibbs Sampling" [https://cran.r-project.org/web/packages/topicmodels/topicmodels.pdf]. This will return an object of class LDA, on which we can perform analysis. The K value was chosen after analyzing the plot above of the 4 scoring metrics for the optimal number of topics.
```{r model, cache=TRUE}
K <- 13 ; S <- 10000;
lda <- LDA(dtm #inputting our document-term-matrix
, method = "Gibbs" #we will be using Gibbs sampling
, control = list(alpha = ALPHA #inputting alpha hyperparameter
, seed = 1 #setting seed to 1
, delta = .01 #setting delta hyperparameter
, iter = S #number of trials equal to S
, burnin = BURNIN) #adding burn in period
, k = K) #fixing the number of topics
```
These two functions return a list or matrix containing the most likely terms for each topic and the most likely topics for each document, respectively:
```{r visualizing}
topics(lda, k = 20)
terms(lda, k = 20)
```
```{r heatmap}
plot(topics(lda, k = 20), fmt.cell='%.2g', col = topo.colors, ylab = "Rank", cex.axis = .60, las = 2,
main = "Chart of Rank of Topics per Candidate", xlab = " ", breaks = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13))
```
After running producting the lda model above, the posterior function helps to evaluate the posterior probabilities of the specific terms per each topic and the specific topics per document.
```{r posterior-lda}
beta <- posterior(lda)$terms # beta is the terms per topic
theta <- posterior(lda)$topics
```
## Visualized topics and terms
Next, we can assign pseudo-names for each topic by concatenating the 4 terms with the highest probability from each topic.
```{r pseudo-topic-names}
pseudo_name <- terms(lda, 4)
topicNames <- apply(pseudo_name, 2, paste, collapse="-")
topicNames
```
```{r topic-wordclouds, warning=FALSE, message=FALSE}
for(i in 1:K){
topic <- i
top40terms <- sort(beta[topic,], decreasing=TRUE)[1:40]
words <- names(top40terms)
probabilities <- sort(beta[topic,], decreasing=TRUE)[1:40]
wordcloud(words, probabilities, random.order = FALSE, scale=c(4,.5), rot.per=.5,
colors = brewer.pal(6, "Dark2"), random.color = FALSE)
}
```
## Re-ranking Topics
As referenced in [https://tm4ss.github.io/docs/Tutorial_6_Topic_Models.html], TF-IDF, otherwise known as Term Frequency - Inverse Document Frequency, calculates "the frequency of a term adjusted for how rarely it is used" [https://www.tidytextmining.com/tfidf.html]. In this way, some of the terms that are used across multiple topics drop in significance to bring attention to the terms that are rarely used and are thus more significant to defining a topic.
```{r re-rank}
# re-rank top topic terms for topic names
topicNames <- apply(lda::top.topic.words(beta, 4, by.score = T), 2, paste, collapse = " ")
# Most probable topics across the corpus, sorted in descending order
topicProportions <- colSums(theta) / nrow(dtm) # mean probablities over all paragraphs
names(topicProportions) <- topicNames # assign the topic names created before
sort(topicProportions, decreasing = TRUE) # show summed proportions (descending)
```
```{r visualize-topic-proportion}
data2 <- melt(t(theta))
mycolors <- colorRampPalette(brewer.pal(8, "Set2"))(K)
glimpse(data)
ggplot(data2, mapping=aes(x=Var2, y=value, fill= as.factor(Var1))) +
geom_bar(stat="identity") +
labs(y = "Proportion", x = "Candidate", fill = "Topic",
title = "Per Candidate Proportion of Topics") +
scale_fill_manual(values = mycolors) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
## Visualize fitted model with LDAvis
Function from [https://gist.github.com/trinker/477d7ae65ff6ca73cace] to connect the output from the LDA function in the `topicmodels` package and the visualization tools in the `LDAvis` package:
```{r create-json}
topicmodels2LDAvis <- function(x, ...){
post <- posterior(x)
if (ncol(post[["topics"]]) < 3) stop("The model must contain > 2 topics")
mat <- x@wordassignments
createJSON(
phi = post[["terms"]],
theta = post[["topics"]],
vocab = colnames(post[["terms"]]),
doc.length = slam::row_sums(mat, na.rm = TRUE),
term.frequency = slam::col_sums(mat, na.rm = TRUE)
)
}
json <- topicmodels2LDAvis(lda)
RJSONIO::fromJSON(json)$topic.order
```
```{r visualize-json, warning=FALSE}
serVis(json, open.brower=interactive(), out.dir = "LDAvis_final_files",
as.gist = TRUE, description = "Topics from the 2019-2020 Democratic Debates")
```