-
Notifications
You must be signed in to change notification settings - Fork 45
/
Tutorial_6_Topic_Models.Rmd
465 lines (356 loc) · 21.6 KB
/
Tutorial_6_Topic_Models.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
---
title: 'Tutorial 6: Topic Models'
author: "Andreas Niekler, Gregor Wiedemann"
date: "`r format(Sys.time(), '%Y-%m-%d')`"
output:
pdf_document:
toc: yes
html_document:
number_sections: yes
theme: united
toc: yes
toc_float: yes
highlight: tango
csl: springer.csl
bibliography: references.bib
---
```{r klippy, echo=FALSE, include=TRUE}
klippy::klippy()
options(width = 98)
```
This exercise demonstrates the use of topic models on a text corpus for the extraction of latent semantic contexts in the documents. In this exercise we will:
1. Read in and preprocess text data,
2. Calculate a topic model using the R package *topmicmodels* and analyze its results in more detail,
3. Visualize the results from the calculated model and
4. Select documents based on their topic composition.
The process starts as usual with the reading of the corpus data. Change to your working directory, create a new R script, load the quanteda-package and define a few already known default variables.
```{r setup, results='hide', message=FALSE, warning=FALSE}
# setwd("Your work directory")
options(stringsAsFactors = FALSE)
library(quanteda)
require(topicmodels)
```
The 231 SOTU addresses are rather long documents. Documents lengths clearly affects the results of topic modeling. For very short texts (e.g. Twitter posts) or very long texts (e.g. books), it can make sense to concatenate/split single documents to receive longer/shorter textual units for modeling.
For the SOTU speeches for instance, we infer the model based on paragraphs instead of entire speeches. By manual inspection / qualitative inspection of the results you can check if this procedure yields better (interpretable) topics. In `sotu_paragraphs.csv`, we provide a paragraph separated version of the speeches.
For text preprocessing, we remove stopwords, since they tend to occur as "noise" in the estimated topics of the LDA model.
```{r initalisierung, results='hide', message=FALSE, warning=FALSE, cache = T}
textdata <- read.csv("data/sotu.csv", sep = ";", encoding = "UTF-8")
sotu_corpus <- corpus(textdata$text, docnames = textdata$doc_id)
# Build a dictionary of lemmas
lemma_data <- read.csv("resources/baseform_en.tsv", encoding = "UTF-8")
# extended stopword list
stopwords_extended <- readLines("resources/stopwords_en.txt", encoding = "UTF-8")
# Create a DTM (may take a while)
corpus_tokens <- sotu_corpus %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_replace(lemma_data$inflected_form, lemma_data$lemma, valuetype = "fixed") %>%
tokens_remove(pattern = stopwords_extended, padding = T)
sotu_collocations <- textstat_collocations(corpus_tokens, min_count = 25)
sotu_collocations <- sotu_collocations[1:250, ]
corpus_tokens <- tokens_compound(corpus_tokens, sotu_collocations)
```
# Model calculation
After the preprocessing, we have two corpus objects: `processedCorpus`, on which we calculate an LDA topic model @blei_latent_2003. To this end, stopwords were removed, words were stemmed and converted to lowercase letters and special characters were removed. The second Corpus object `corpus` serves to be able to view the original texts and thus to facilitate a qualitative control of the topic model results.
We now calculate a topic model on the `processedCorpus`. For this purpose, a DTM of the corpus is created. In this case, we only want to consider terms that occur with a certain minimum frequency in the body. This is primarily used to speed up the model calculation.
``` {r DTM, cache=TRUE}
# Create DTM, but remove terms which occur in less than 1% of all documents
DTM <- corpus_tokens %>%
tokens_remove("") %>%
dfm() %>%
dfm_trim(min_docfreq = 0.01, max_docfreq = Inf, docfreq_type = "prop")
# have a look at the number of documents and terms in the matrix
dim(DTM)
```
For topic modeling not only language specific stop words may beconsidered as uninformative, but also domain specific terms. We remove 10 of the most frequent terms to improve the modeling.
``` {r DTM2, cache=TRUE}
top10_terms <- c( "unite_state", "past_year", "year_ago", "year_end", "government", "state", "country", "year", "make", "seek")
DTM <- DTM[, !(colnames(DTM) %in% top10_terms)]
# due to vocabulary pruning, we have empty rows in our DTM
# LDA does not like this. So we remove those docs from the
# DTM and the metadata
sel_idx <- rowSums(DTM) > 0
DTM <- DTM[sel_idx, ]
textdata <- textdata[sel_idx, ]
```
As an unsupervised machine learning method, topic models are suitable for the exploration of data. The calculation of topic models aims to determine the proportionate composition of a fixed number of topics in the documents of a collection. It is useful to experiment with different parameters in order to find the most suitable parameters for your own analysis needs.
For parameterized models such as Latent Dirichlet Allocation (LDA), the number of topics `K` is the most important parameter to define in advance. How an optimal `K` should be selected depends on various factors. If `K` is too small, the collection is divided into a few very general semantic contexts. If `K` is too large, the collection is divided into too many topics of which some may overlap and others are hardly interpretable.
For our first analysis we choose a thematic "resolution" of `K = 20` topics. In contrast to a resolution of 100 or more, this number of topics can be evaluated qualitatively very easy. We also set the seed for the random number generator to ensure reproducible results between repeated model inferences.
``` {r LDA, cache=TRUE, results='hide', message=FALSE, warning=FALSE}
# load package topicmodels
require(topicmodels)
# number of topics
K <- 20
# compute the LDA model, inference via n iterations of Gibbs sampling
topicModel <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, seed = 1, verbose = 25))
```
Depending on the size of the vocabulary, the collection size and the number K, the inference of topic models can take a very long time. This calculation may take several minutes. If it takes too long, reduce the vocabulary in the DTM by increasing the minimum frequency in the previous step.
The topic model inference results in two (approximate) posterior probability distributions: a distribution `theta` over K topics within each document and a distribution `beta` over V terms within each topic, where V represents the length of the vocabulary of the collection (V = `r ncol(DTM)`). Let's take a closer look at these results:
```{r}
# have a look a some of the results (posterior distributions)
tmResult <- posterior(topicModel)
# format of the resulting object
attributes(tmResult)
ncol(DTM) # lengthOfVocab
# topics are probability distribtions over the entire vocabulary
beta <- tmResult$terms # get beta from results
dim(beta) # K distributions over ncol(DTM) terms
rowSums(beta) # rows in beta sum to 1
nrow(DTM) # size of collection
# for every document we have a probability distribution of its contained topics
theta <- tmResult$topics
dim(theta) # nDocs(DTM) distributions over K topics
rowSums(theta)[1:10] # rows in theta sum to 1
```
Let's take a look at the 10 most likely terms within the term probabilities `beta` of the inferred topics (only the first 8 are shown below).
``` {r results="hide"}
terms(topicModel, 10)
```
``` {r echo=FALSE}
exampleTermData <- terms(topicModel, 10)
exampleTermData[, 1:8]
```
For the next steps, we want to give the topics more descriptive names than just numbers. Therefore, we simply concatenate the five most likely terms of each topic to a string that represents a pseudo-name for each topic.
``` {r}
top5termsPerTopic <- terms(topicModel, 5)
topicNames <- apply(top5termsPerTopic, 2, paste, collapse=" ")
```
# Visualization
Although wordclouds may not be optimal for scientific purposes they can provide a quick visual overview of a set of terms. Let's look at some topics as wordcloud.
In the following code, you can change the variable **topicToViz** with values between 1 and 20 to display other topics.
``` {r, fig.width=8, fig.height=6, fig.align='center', message=FALSE, warning=F}
require(wordcloud2)
# visualize topics as word cloud
topicToViz <- 11 # change for your own topic of interest
topicToViz <- grep('mexico', topicNames)[1] # Or select a topic by a term contained in its name
# select to 40 most probable terms from the topic by sorting the term-topic-probability vector in decreasing order
top40terms <- sort(tmResult$terms[topicToViz,], decreasing=TRUE)[1:40]
words <- names(top40terms)
# extract the probabilites of each of the 40 terms
probabilities <- sort(tmResult$terms[topicToViz,], decreasing=TRUE)[1:40]
# visualize the terms as wordcloud
wordcloud2(data.frame(words, probabilities), shuffle = FALSE, size = 0.8)
```
Let us now look more closely at the distribution of topics within individual documents. To this end, we visualize the distribution in 3 sample documents.
Let us first take a look at the contents of three sample documents:
``` {r eval=FALSE}
exampleIds <- c(2, 100, 200)
cat(sotu_corpus[exampleIds[1]])
cat(sotu_corpus[exampleIds[2]])
cat(sotu_corpus[exampleIds[3]])
```
``` {r echo=FALSE}
exampleIds <- c(2, 100, 200)
cat(paste0(exampleIds[1], ": ", substr(texts(sotu_corpus)[exampleIds[1]], 0, 400), '...'))
cat(paste0(exampleIds[2], ": ", substr(texts(sotu_corpus)[exampleIds[2]], 0, 400), '...'))
cat(paste0(exampleIds[3], ": ", substr(texts(sotu_corpus)[exampleIds[3]], 0, 400), '...'))
```
After looking into the documents, we visualize the topic distributions within the documents.
``` {r results="hide", warning=FALSE, message=FALSE, fig.width=10, fig.height=6, fig.align='center'}
# load libraries for visualization
library("reshape2")
library("ggplot2")
N <- length(exampleIds)
# get topic proportions form example documents
topicProportionExamples <- theta[exampleIds,]
colnames(topicProportionExamples) <- topicNames
vizDataFrame <- melt(cbind(data.frame(topicProportionExamples), document = factor(1:N)), variable.name = "topic", id.vars = "document")
ggplot(data = vizDataFrame, aes(topic, value, fill = document), ylab = "proportion") +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
facet_wrap(~ document, ncol = N)
```
# Topic distributions
The figure above shows how topics within a document are distributed according to the model. In the current model all three documents show at least a small percentage of each topic. However, two to three topics dominate each document.
The topic distribution within a document can be controlled with the *Alpha*-parameter of the model. Higher alpha priors for topics result in an even distribution of topics within a document. Low alpha priors ensure that the inference process distributes the probability mass on a few topics for each document.
In the previous model calculation the alpha-prior was automatically estimated in order to fit to the data (highest overall probability of the model). However, this automatic estimate does not necessarily correspond to the results that one would like to have as an analyst. Depending on our analysis interest, we might be interested in a more peaky/more even distribution of topics in the model.
Now let us change the alpha prior to a lower value to see how this affects the topic distributions in the model.
```{r}
# see alpha from previous model
attr(topicModel, "alpha")
```
``` {r LDA2, cache=TRUE, message=FALSE, results='hide'}
topicModel2 <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, verbose = 25, seed = 1, alpha = 0.2))
tmResult <- posterior(topicModel2)
theta <- tmResult$topics
beta <- tmResult$terms
topicNames <- apply(terms(topicModel2, 5), 2, paste, collapse = " ") # reset topicnames
```
Now visualize the topic distributions in the three documents again. What are the differences in the distribution structure?
``` {r results="hide", echo=T, warning=FALSE, message=FALSE, fig.width=10, fig.height=6, fig.align='center'}
# get topic proportions form example documents
topicProportionExamples <- theta[exampleIds,]
colnames(topicProportionExamples) <- topicNames
vizDataFrame <- melt(cbind(data.frame(topicProportionExamples), document = factor(1:N)), variable.name = "topic", id.vars = "document")
ggplot(data = vizDataFrame, aes(topic, value, fill = document), ylab = "proportion") +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
coord_flip() +
facet_wrap(~ document, ncol = N)
```
# Topic ranking
First, we try to get a more meaningful order of top terms per topic by re-ranking them with a specific score @Chang.2012. The idea of re-ranking terms is similar to the idea of TF-IDF. The more a term appears in top levels w.r.t. its probability, the less meaningful it is to describe the topic. Hence, the scoring favors less general, more specific terms to describe a topic.
```{r, results='hide'}
# re-rank top topic terms for topic names
topicNames <- apply(lda::top.topic.words(beta, 5, by.score = T), 2, paste, collapse = " ")
```
What are the defining topics within a collection? There are different approaches to find out which can be used to bring the topics into a certain order.
**Approach 1:** We sort topics according to their probability within the entire collection:
``` {r, results='hide'}
# What are the most probable topics in the entire collection?
topicProportions <- colSums(theta) / nrow(DTM) # mean probablities over all paragraphs
names(topicProportions) <- topicNames # assign the topic names we created before
sort(topicProportions, decreasing = TRUE) # show summed proportions in decreased order
```
```{r, echo=FALSE}
soP <- sort(topicProportions, decreasing = TRUE)
paste(round(soP, 5), ":", names(soP))
```
We recognize some topics that are way more likely to occur in the corpus than others. These describe rather general thematic coherences. Other topics correspond more to specific contents.
**Approach 2:** We count how often a topic appears as a primary topic within a paragraph This method is also called Rank-1.
``` {r rank1, results='hide'}
countsOfPrimaryTopics <- rep(0, K)
names(countsOfPrimaryTopics) <- topicNames
for (i in 1:nrow(DTM)) {
topicsPerDoc <- theta[i, ] # select topic distribution for document i
# get first element position from ordered list
primaryTopic <- order(topicsPerDoc, decreasing = TRUE)[1]
countsOfPrimaryTopics[primaryTopic] <- countsOfPrimaryTopics[primaryTopic] + 1
}
sort(countsOfPrimaryTopics, decreasing = TRUE)
```
```{r, echo=FALSE}
so <- sort(countsOfPrimaryTopics, decreasing = TRUE)
paste(so, ":", names(so))
```
We see that sorting topics by the Rank-1 method places topics with rather specific thematic coherences in upper ranks of the list.
This sorting of topics can be used for further analysis steps such as the semantic interpretation of topics found in the collection, the analysis of time series of the most important topics or the filtering of the original collection based on specific sub-topics.
# Filtering documents
The fact that a topic model conveys of topic probabilities for each document, resp. paragraph in our case, makes it possible to use it for thematic filtering of a collection. As filter we select only those documents which exceed a certain threshold of their probability value for certain topics (for example, each document which contains topic `X` to more than `Y` percent).
In the following, we will select documents based on their topic content and display the resulting document quantity over time.
``` {r}
topicToFilter <- 6 # you can set this manually ...
# ... or have it selected by a term in the topic name
topicToFilter <- grep('mexico ', topicNames)[1]
topicThreshold <- 0.1 # minimum share of content must be attributed to the selected topic
selectedDocumentIndexes <- (theta[, topicToFilter] >= topicThreshold)
filteredCorpus <- sotu_corpus %>% corpus_subset(subset = selectedDocumentIndexes)
# show length of filtered corpus
filteredCorpus
```
Our filtered corpus contains `r ndoc(filteredCorpus)` documents related to the topic `r topicToFilter` to at least 10 %.
# Topic proportions over time
In a last step, we provide a distant view on the topics in the data over time. For this, we aggregate mean topic proportions per decade of all SOTU speeches. These aggregated topic proportions can then be visualized, e.g. as a bar plot.
```{r fig.width=9, fig.height=6, fig.align='center', warning=F, message=F}
# append decade information for aggregation
textdata$decade <- paste0(substr(textdata$date, 0, 3), "0")
# get mean topic proportions per decade
topic_proportion_per_decade <- aggregate(theta, by = list(decade = textdata$decade), mean)
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(20), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
The visualization shows that topics around the relation between the federal government and the states as well as inner conflicts clearly dominate the first decades. Security issues and the economy are the most important topics of recent SOTU addresses.
For more details about topic modeling and some best practice advise, see also @Maier.2018.
# Optional exercises
1. Create a list of all documents that contain a share of *war*-related topics of at least 50 % (possible approach: sub-select topics which contain the term *war* among the top 15 terms).
```{r ex1, echo=FALSE, warning=F, message=F, cache=T, error=T}
regex_pattern <- "^war$|-war$"
top15words <- lda::top.topic.words(beta, 15, by.score = T)
relevant_terms <- apply(top15words, 2, FUN = function(x) grepl(regex_pattern, x))
relevant_topics <- which(colSums(relevant_terms) > 0)
cat("Selected topics:")
print(topicNames[relevant_topics])
# subselect
topicThreshold <- 0.5
summed_share <- rowSums(theta[, relevant_topics])
selectedDocumentIndexes <- summed_share >= topicThreshold
filteredCorpus <- sotu_corpus %>% corpus_subset(subset = selectedDocumentIndexes)
# show beginnings of filtered speeches
cat("Beginnings of filtered speeches:")
filteredCorpus
# show metadata of filtered corpus
metadata <- textdata[selectedDocumentIndexes, -which(names(textdata) %in% c("text", "speech_type"))]
metadata$topic_share <- summed_share[selectedDocumentIndexes]
cat("Metadata of filtered speeches:")
print(metadata)
```
2. Repeat the exercises with a different K value (e.g., K = 5, 30, 50). What do you think of the results?
3. Split the original text source into paragraphs (e.g. use `corpus_reshape(x, to = "paragraphs")`) and compute a topic model on paragraphs instead of full speeches. How does the smaller context unit affect the result?
```{r ex3, fig.width=9, fig.height=6, fig.align='center', echo=FALSE, warning=F, message=F, cache=T, error=T}
sotu_corpus <- corpus(
textdata$text,
docnames = textdata$doc_id,
docvars = data.frame(decade = paste0(substr(textdata$date, 0, 3), "0"))
)
corpus_paragraphs <- corpus_reshape(sotu_corpus, to = "paragraphs")
cat("Length of the corpus split into paragraphs:")
length(corpus_paragraphs)
# Create a DTM (may take a while)
corpus_tokens <- corpus_paragraphs %>%
tokens(remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE) %>%
tokens_tolower() %>%
tokens_replace(lemma_data$inflected_form, lemma_data$lemma, valuetype = "fixed") %>%
tokens_remove(pattern = stopwords_extended, padding = T) %>%
tokens_compound(sotu_collocations)
# Create DTM, but remove terms which occur in less than 1% of all documents
# due to paragraphs, we need a much lower min-docfreq threshold
DTM <- corpus_tokens %>%
tokens_remove("") %>%
dfm() %>%
dfm_trim(min_docfreq = 0.0001, max_docfreq = Inf, docfreq_type = "prop")
top10_terms <- c( "unite_state", "past_year", "year_ago", "year_end", "government", "state", "country", "year", "make", "seek")
DTM <- DTM[, !(colnames(DTM) %in% top10_terms)]
sel_idx <- rowSums(DTM) > 0
DTM <- DTM[sel_idx, ]
corpus_paragraphs <- corpus_paragraphs[sel_idx]
K <- 20
topicModel3 <- LDA(DTM, K, method="Gibbs", control=list(iter = 500, seed = 1, alpha = 0.2))
tmResult <- posterior(topicModel3)
theta <- tmResult$topics
beta <- tmResult$terms
topicNames <- apply(lda::top.topic.words(beta, 5, by.score = T), 2, paste, collapse = " ")
# stacked bar chart
# -----------------
# append decade information for aggregation
paragraph_decade <- docvars(corpus_paragraphs)$decade
# get mean topic proportions per decade
topic_proportion_per_decade <- aggregate(theta, by = list(decade = paragraph_decade), mean)
# set topic names to aggregated columns
colnames(topic_proportion_per_decade)[2:(K+1)] <- topicNames
# reshape data frame
vizDataFrame <- melt(topic_proportion_per_decade, id.vars = "decade")
# plot topic proportions per deacde as bar plot
require(pals)
ggplot(vizDataFrame, aes(x=decade, y=value, fill=variable)) +
geom_bar(stat = "identity") + ylab("proportion") +
scale_fill_manual(values = paste0(alphabet(K), "FF"), name = "decade") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
4. Use the LDAvis package by @sievert_ldavis:_2014 to visualize the latest model you computed.
```{r ex4, eval=F, echo=T, warning=F, message=F, cache=T, error=T}
# LDAvis browser
library(LDAvis)
library("tsne")
svd_tsne <- function(x) tsne(svd(x)$u)
json <- createJSON(
phi = tmResult$beta,
theta = tmResult$theta,
doc.length = rowSums(DTM),
vocab = colnames(DTM),
term.frequency = colSums(DTM),
mds.method = svd_tsne,
plot.opts = list(xlab="", ylab="")
)
serVis(json)
```
![LDAvis example](resources/ldavis.PNG)
# References