-
Notifications
You must be signed in to change notification settings - Fork 0
/
text_mining.rmd
789 lines (639 loc) · 38.6 KB
/
text_mining.rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
---
title: "_Palabras que se las lleva el viento_"
subtitle: "Exploring the last 10 inauguration speeches of Peruvian presidents"
output: html_document
date: \today
---
## Introduction
As of this writing, the current Peruvian government counts over 66 deaths in less than 4 months of government (Rebaza and Guy, 2023). Dina Boluarte is Peru’s first female president and the sixth person to occupy the presidential chair in four years (Olmo, 2022). She ascended to office on December 7, 2022, after the previous president, elected by popular vote, dissolved the congress after only 16 months in charge and many attempts of being removed from office by the congress.
The analysis of the political motives of the perpetual peruvian political crisis exceeds the purpose of this exercise. However, it will suffice to mention that, before Boluarte, perhaps the one who aroused the most popular rage was Manuel Merino, who governed alone from November 10 to 15, 2020 —less than a week!— and ended up resigning because of massive protests throughout the country (Redacción BBC Mundo, 2020). The last government that lasted 5 years, the constitutional term of office, was that of Ollanta Humala (2011 - 2016). Since then, mandates have lasted months or even just days.
This exercise aims to characterize the topics and sentiments addressed in the last 10 Message to the Nation (or inaugural speech) in Peru. The hypothesis is that the inaugural speeches, analyzed through text mining techniques, will help to characterize the context in which each president ascended to power, and that there will be clear differences in the topics and length of those who ascended by popular vote, and those who took office by emergency succession. Besides, negative-related sentiments will be predominant in general.
## Data
The speeches were manually downloaded in .pdf format from official sources: the archive of the [Congress of the Republic of Peru](https://www.congreso.gob.pe/participacion/museo/congreso/mensajes-presidenciales/) and the [Single Digital Platform of the Peruvian State](https://www.gob.pe/mensajepresidencial]). Only the inaugural speech of Alan Garcia (2006-2011) was not found, as the one of Alejandro Toledo (2001-2006) was wrongly uploaded in its place. Therefore, to complete 10 presidents, the speech of the first period of government of the dictator Alberto Fujimori (1900-1995) was also extracted.
These pdf were then converted to markdown, superficially checked for special characters, and uploaded to GitHub so that they could be retrieved from there directly. It could help make this exercise reproducible without the need to attach them. The GitHub repository is accessible [here](https://github.com/carolinacornejocastellano/presidential_text_mining).
Thus, 10 presidential speeches were selected. These were proclaimed on the following dates:
- Dina Boluarte (December 7, 2022).
- Pedro Castillo (December 28, 2021)
- Francisco Sagasti (November 17, 2020)
- Martin Vizcarra (March 23, 2018).
- Pedro Pablo Kuczynski (July 28, 2016).
- Ollanta Humala (July 28, 2011)
- Alejandro Toledo (July 28, 2001)
- Valentín Paniagua (November 11, 2000)
- Alberto Fujimori (second term) (July 28, 1995)
- Alberto Fujimori (first term) (July 28, 1990)
Dina Boluarte, Francisco Sagasti, Martín Vizcarra and Valentín Paniagua were not elected by popular vote, but assigned for different reasons and in different contexts, with different popular approval. There is also a debate about Alberto Fujimori's second term: whether he was legitimately elected or manipulated the results in obedience to his dictatorial spirit. On the other hand, Manuel Merino was not considered on purpose: because of the way he took office and his brief period (5 days), many do not consider him a president, only a coup plotter or usurper.
## Dataframe creation
```{r libraries}
libraries <- c(
"tidyverse",
"tidytext",
"textdata",
"wordcloud",
"RColorBrewer",
"reshape2",
"igraph",
"ggraph",
"widyr",
"tm",
"quanteda",
"quanteda.textplots",
"topicmodels",
"syuzhet",
"parallel",
"textstem",
"SentimentAnalysis"
)
# install (in case they are not already installed) and load packages
for (lib in libraries) {
if (!requireNamespace(lib, quietly = TRUE)) {
install.packages(lib)
}
suppressPackageStartupMessages(library(lib, character.only = TRUE))
}
# remove vector and iterator
rm(lib, libraries)
```
The following steps are to automatically create the download link for each given the username and repository name:
```{r download speeches}
presidents <-
c("Dina Boluarte",
"Pedro Castillo",
"Francisco Sagasti",
"Martin Vizcarra",
"Pedro Pablo Kuczynski",
"Ollanta Humala",
"Alejandro Toledo",
"Valentin Paniagua",
"Alberto Fujimori second term",
"Alberto Fujimori first term")
links <- paste0(
"https://raw.githubusercontent.com/castellco/presidential_text_mining/main/",
tolower(gsub(" ", "_", presidents)), ".md")
```
Then, a first dataframe without the speeches was created:
```{r create df}
df <- tibble(
president = presidents,
date = c("2022-12-07",
"2021-07-28",
"2020-11-17",
"2018-03-23",
"2016-07-28",
"2011-07-28",
"2001-07-28",
"2000-11-11",
"1995-07-28",
"1990-07-28"),
url = links
)
```
After that, the speeches were downloaded from the [GitHub](https://github.com/castellco/presidential_text_mining) repository. In this way, a dataframe with 3 columns was obtained: president, date and the entire speech in a single line for each president:
```{r read speeches}
df$speech <- df$url %>%
map_chr(~ read_lines(.x,
locale = locale(encoding = "UTF-8")) %>%
paste(collapse = "\n"))
# delete URLs' column from df, as they are not longer needed
df$url <- NULL
df
```
## Data preparation
The next step was to tokenize the speeches and remove stopwords. Since the language of the speeches is Spanish, it was preferred to conduct the analysis in Spanish. It was considered that translating them could add a degree of bias, since many online translation services perform very literal translations and/or make the real meaning of what was said get lost.
However, as it will be seen later on, analyzing the Spanish texts was a challenge in itself since the packages for stopwords, sentiment analysis, lemmatization and stemming, etc., are mostly built for English corpora, or their default settings are for English. In some cases, it was necessary to add additional parameters or to use other packages to do the same as what it was taught in class.
The stopwords in Spanish were stored in a vector. Then, the speeches column was split into tokens with the `unnest_tokens()` function from the `tidytext` package. Subsequently, the tokens that matched the stopwords stored in the vector were deleted.
```{r store stopwords in Spanish as a vector}
stopwords_es <- stopwords::stopwords("es",
source = "stopwords-iso"
)
```
```{r delete stopwords from the original df}
df_no_stopwords <- df %>%
unnest_tokens(word, speech) %>%
anti_join(data.frame(word = stopwords_es),
by = "word") %>%
group_by(president, date) %>%
summarize(clean_speech = paste(word,
collapse = " "))
```
```{r}
df_tokenized <- df_no_stopwords %>%
unnest_tokens(word, clean_speech) %>%
filter(!word %in% stopwords_es)
```
Below, the most used words in a given speech are shown. It can be seen that Pedro Pablo Kuczynski (PPK) leads the ranking, mentioning the commonplace _país_ (country) 28 times in his speech, and _salud_ (health) 18 times. Pedro Castillo also used those words repeatedly. This is interesting because both come from totally different political —and personal— spectrums: PPK, of Polish origin and upper income class, has been in Peruvian politics for decades: he is an economist, businessman and former Minister of Economy, and a neoliberal who studied all his degrees in US and UK universities. On the other hand, Pedro Castillo, from the conservative left, was a rural school teacher in Cajamarca, one of the poorest regions of Peru, with no experience as a politician. Regarding a word that can give us an idea of the priorities of a president at the beginning of his government, as already told, in the case of Pedro Castillo, the word _salud_ (health) was repeated 24 times. This makes sense, since he took office at a time when the Covid-19 pandemic was still killing thousands of people in Peru. It is clear that public health policies were an important theme in his first speech.
Another case worth mentioning is the reference to the word _pueblo_ (town, nation or working class) in the inauguration of the second term of dictator Alberto Fujimori. He also mentioned _desarrollo_ (development) a lot and, towards the fourth tab of results, it is seen that he mentioned _derechos_ (rights) and _humanos_ (human) 12 times. It is peculiar because he is now facing sentences, among other reasons, for systematic violations of human rights. For example, his "family planning" policies, which consisted of forced sterilization of women in the poorest and most remote areas of Peru, often Quechua-speaking, were among his most notorious cases. In his fight against terrorism, he also ordered innocent and poor people to be killed massively for being "suspects".
```{r general word ranking}
df_tokenized %>%
count(word,
sort = TRUE
)
```
Below is a graph representing the most used words by each president. Specifically, those words used more than 7 times are shown. Some presidents —Valentín Paniagua and Dina Boluarte— are not shown for not repeating any word more than 7 times. This is to be expected, because they took office by emergency: it is understandable that they did not have a speech as extensive and prepared as those elected by popular vote. Even Francisco Sagasti and Martin Vizcarra, who are included in the graph, are the two who show the fewest words above 7 repetitions among all the others: they also took office on an emergency basis, with only a few hours' pre-notification.
```{r plot the most common words in presidential speeches}
df_tokenized %>%
count(president,
word,
sort = TRUE
) %>%
filter(n > 7) %>% # words that each president used more than 7 times
ggplot(aes(x = reorder(word, n),
y = n,
fill = president)) +
geom_col() +
facet_wrap(~president,
scales = "free") +
coord_flip() +
scale_fill_manual(values = brewer.pal(8, "BrBG")) +
theme_minimal() +
ggtitle("Most Common Words in Presidential Inauguration Speeches",
subtitle = "From 1990 to 2022") +
labs(
x = "Word",
y = "Frequency",
fill = "President"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 20,
face = "bold"
),
plot.subtitle = element_text(
hjust = 0.5,
size = 14
)
)
```
## N-grams
The most common bigrams were be analyzed. Below are the top 20 bigrams:
```{r most common bigrams in all speeches}
df_no_stopwords %>%
unnest_tokens(bigram,
clean_speech,
token = "ngrams",
n = 2) %>%
filter(!is.na(bigram)) %>%
count(bigram, sort = TRUE) %>%
head(20) # onyl the first 20
```
Similar to what has already been mentioned, the reference to human rights (_derechos humanos_) is constant in the discourse of Alberto Fujimori's second term. Without reading his speech, but knowing in broad terms the reasons for which he is accused, one might think that in his speech he was already defending himself against accusations of human rights violations.
Probably, PPK mentioned the bigrams _año 2021_ (year 2021) and _año Bicentenario_ (Bicentennial year) many times because, upon taking office in 2016, he was excited to be the president in office in 2021, the year of the the Bicentennial of Peru's Independence. However, he resigned from office much earlier for disputes with the congress.
On the other hand, the _asamblea constityente_ (Constituent Assembly) was Pedro Castillo's main campaign promise to reform the constitution.
Another important feature is the mention of social issues by the two presidents who took office after running for the left: Ollanta Humala —although he later made a neoliberal right-wing government— and Pedro Castillo. The former spoke repeatedly about _inclusión social_ (social inclusion), and the latter about _protección social_ (social protection).
Below are the top 2 bigrams for each president. However, since there are many bigrams that are repeated an equal number of times, there are many ties in the top 2, so more bigrams are actually shown.
```{r}
bigrams <- df_no_stopwords %>%
unnest_tokens(bigram,
clean_speech,
token = "ngrams",
n = 2)
# bigrams by president
bigram_counts <- bigrams %>%
count(president,
bigram,
sort = TRUE)
# top bigrams for each president
top_bigrams <- bigram_counts %>%
group_by(president) %>%
top_n(2, n) %>% # only the top 2, although there are many ties
arrange(president)
top_bigrams
```
The following graph shows the most common bigrams. However, the plot does not show well in the inline preview. Please, click in "Show in New window" button for it to be better displayed:
```{r bigrams plot}
ggplot(top_bigrams,
aes(x = reorder(bigram, n),
y = n,
fill = president)) +
geom_col() +
facet_wrap(~ president,
scales = "free") +
coord_flip() +
scale_fill_manual(values = brewer.pal(10, "BrBG")) +
theme_minimal() +
ggtitle("Most Common Bigrams in Presidential Speeches",
subtitle = "From 1990 to 2022") +
labs(x = "Bigram",
y = "Frequency",
fill = "President") +
theme(plot.title = element_text(hjust = 0.5,
size = 20,
face = "bold"),
plot.subtitle = element_text(hjust = 0.5,
size = 14))
```
Next, the most common trigrams were identified:
```{r}
df_no_stopwords %>%
unnest_tokens(trigram,
clean_speech,
token = "ngrams",
n = 3) %>%
filter(!is.na(trigram)) %>%
count(trigram, sort = TRUE) %>%
head(20) # only the first 20
```
In the case of PPK, references to the year 2021 as the year of the Bicentennial of the Independence of Peru are repeated. However, the theme of _país moderno significa_, alluding to an idea of "modernization", is also very present. It could be thought that his allusion to modernization was in response to one of the main criticisms against him during his campaign: that he is an "old school" politician who has been decades without making a significant contribution to the country from his positions in the executive and legislative offices, and that he is too old to assume the presidency. In fact, one of his campaign slogans was _Me hago viejo esperando_ ("I grow old waiting"), responding to criticism about his advanced age.
By arranging the trigrams by president in alphabetical order, we also get an idea of the general theme of their speeches. Fujimori, who we have already mentioned is in prison for human rights violations, including the forced sterilization of thousands of indigenous, poor and Quechua-speaking women as a method to contain poverty, in his first term spoke about "family planning" and "poverty". In his second term, he spoke about _respeto derechos humanos_ "respect for human rights". Another very important topic in his speech is related to productivity, the role of companies in economic development, the economic crisis of that time and terrorism (when he mentions _grupos alzados armas_, "the armed groups").
On the contrary, Alejandro Toledo does not show any very relevant trigram, except his mention to the armed and police forces. On the other hand, Dina Boluarte does not have a predominant trigram: all of them are repeated only once, and that is the reason the table below is cluttered. It is to be anticipated, because she took office with only hours of notice, and, therefore, it is expected that she has not had a very structured speech. Similar happens with Martin Vizcarra and Valentin Paniagua.
```{r}
trigrams <- df_no_stopwords %>%
unnest_tokens(trigram,
clean_speech,
token = "ngrams",
n = 3)
# trigrams by president
trigram_counts <- trigrams %>%
count(president,
trigram,
sort = TRUE)
# top trigrams for each president
top_trigrams <- trigram_counts %>%
group_by(president) %>%
top_n(2, n) %>%
arrange(president)
top_trigrams
```
As can be seen in the graph above, their bar graphs are almost unreadable. Thus, the trigram analysis gives us an idea of who took office on an emergency basis. If not shown correctly, please, open the graph in a new window.
```{r}
ggplot(top_trigrams,
aes(x = reorder(trigram, n),
y = n,
fill = president)) +
geom_col() +
facet_wrap(~ president,
scales = "free") +
coord_flip() +
scale_fill_manual(values = brewer.pal(10, "BrBG")) +
theme_minimal() +
ggtitle("Most Common Trigrams in Presidential Speeches",
subtitle = "From 1990 to 2022") +
labs(x = "Trigram",
y = "Frequency",
fill = "President") +
theme(plot.title = element_text(hjust = 0.5,
size = 20,
face = "bold"),
plot.subtitle = element_text(hjust = 0.5,
size = 14))
```
## Sentiment analysis
A sentiment analysis of the speeches was also performed. For this purpose, other packages than those seen in class were used. The reason behind this was that sentiment analysis assigns a label —of level of polarization or sentiment— to words that are used based on lexicons. However, the packages discussed in class are more intended for use with English corpora. Thus, in this section, after searching and testing with other packages, the packages `SentimentAnalysis` and `syuzhet` were employed. Yet, they may not be as good as those based on English lexicons, which are much more developed.
The next chunk employs the `SentimentAnalysis` pacakge and may take up to 30 seconds to run:
```{r sentiment analysis with `SentimentAnalysis` package}
sentiments_spanish <- analyzeSentiment(df_tokenized$word,
language = "spanish",
stemming = TRUE
)
head(sentiments_spanish)
```
The following is to view sentiment direction (i.e. positive, neutral and negative), as a way to summarize a bit the prior results:
```{r}
df_sentiments <- data.frame(df_tokenized$word,
sentiment = convertToDirection(sentiments_spanish$SentimentGI))
head(df_sentiments, 20)
```
With the following, the sentiments of the presidential messages in general are further summarized:
```{r}
table(df_sentiments$sentiment)
```
As can be seen, the vast majority of words were classified as neutral. This includes nouns such as "country", or numbers such as 2021.
```{r}
df_sentiments %>%
group_by(sentiment) %>%
summarise(number = n()) %>%
ggplot(aes(x = sentiment,
y=number)) +
geom_bar(aes(fill=sentiment),
stat = "identity") +
scale_fill_manual(values = brewer.pal(4, "BrBG")) +
theme_minimal() +
ggtitle("Most Common Sentiments in Presidential Speeches",
subtitle = "From 1990 to 2022, according to the {SentimentAnalysis} package") +
labs(x = "Sentiment",
y = "Frequency",
fill = "President") +
theme(plot.title = element_text(hjust = 0.5,
size = 20,
face = "bold"),
plot.subtitle = element_text(hjust = 0.5,
size = 14))
```
These results were contrasted with the `syuzhet` package. This package associates texts to 8 emotions and 2 feelings. The more the value exceeds 0, the more pronounced the emotion or feeling is.
However, it was noticed that using this package required much computational power and took a long time. For that reason, some memory was freed with the `gc()` function and the process was parallelized. The parallel computing approach consists of distributing tasks that do not need to be sequential, for them to be executed in a parallel way. Therefore, more than one core can be used —in this case, the laptop on which this exercise was done has 8— to execute the task. This process could potentially be parallelized on 8 threads, but in the interest of making this code reproducible since not every computer has this number of cores, only 4 were harnessed.
It is also recommended closing other programs that are not being used.
```{r identifying cores}
gc()
cl <- makeCluster(4)
clusterExport(cl = cl, c("get_sentiment",
"get_sent_values",
"get_nrc_sentiment",
"get_nrc_values",
"parLapply"))
```
```{r assign sentiment or emotion to words with parallel computing}
sentiment_nrc <- get_nrc_sentiment(df_tokenized$word,
cl = cl,
language = "spanish")
stopCluster(cl)
```
With this approach, this computation took 1 minute and 40 secs, and almost 95% of CPU.
```{r}
head(sentiment_nrc)
```
A summary of the sentiments and emotions of this package can be seen. Judging by the mean, words with a positive tone are predominant:
```{r}
summary(sentiment_nrc)
```
The order of each emotion or feeling is in alphabetical order: anger, anticipation, disgust, fear, joy, sadness, surprise, trust, negative, positive. This analysis will be done in a general sense, rather than for each president.
Hereunder, an example of 20 words associated with each emotion or feeling can be read. The list is longer, but for space reasons only 20 are shown. To see the complete list, uncomment the last line of each chunk.
```{r}
anger <- which(sentiment_nrc$anger > 0)
head(df_tokenized$word[anger], 20)
#df_tokenized$word[anger]
```
```{r}
anticipation <- which(sentiment_nrc$anticipation > 0)
head(df_tokenized$word[anticipation], 20)
#df_tokenized$word[anticipation]
```
```{r}
disgust <- which(sentiment_nrc$disgust > 0)
head(df_tokenized$word[disgust], 20)
#df_tokenized$word[disgust]
```
```{r}
fear <- which(sentiment_nrc$fear > 0)
head(df_tokenized$word[fear], 20)
#df_tokenized$word[fear]
```
```{r}
joy <- which(sentiment_nrc$joy > 0)
head(df_tokenized$word[joy], 20)
# df_tokenized$word[joy]
```
```{r}
sadness <- which(sentiment_nrc$sadness > 0)
head(df_tokenized$word[sadness], 20)
# df_tokenized$word[sadness]
```
```{r}
surprise <- which(sentiment_nrc$surprise > 0)
head(df_tokenized$word[surprise], 20)
# df_tokenized$word[surprise]
```
```{r}
trust <- which(sentiment_nrc$trust > 0)
head(df_tokenized$word[trust], 20)
# df_tokenized$word[trust]
```
```{r}
negative <- which(sentiment_nrc$negative > 0)
head(df_tokenized$word[negative], 20)
# df_tokenized$word[negative]
```
```{r}
positive <- which(sentiment_nrc$positive > 0)
head(df_tokenized$word[positive], 20)
#df_tokenized$word[positive]
```
As can be seen next, in an aggregated manner, the "positive" and "trust" words are the most used by Peruvian presidents in recent years, followed by the negative words —these are almost half of the positive ones—, and the words denoting "fear" and "anticipation":
```{r}
sentiment_nrc %>%
summarise_all(sum) %>%
gather(key = sentiment,
value = number) %>%
arrange(desc(number))
```
```{r}
sentiment_nrc %>%
summarise_all(sum) %>%
gather(key = sentiment,
value = number) %>%
ggplot(aes(x = sentiment,
y = number,
fill = sentiment)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = brewer.pal(10, "BrBG")) +
theme_minimal() +
ggtitle("Most Common Sentiments in Presidential Speeches",
subtitle = "From 1990 to 2022, according to the {syuzhet} package") +
labs(x = "Sentiment",
y = "Frequency",
fill = "Sentiment") +
theme(plot.title = element_text(hjust = 0.5,
size = 20,
face = "bold"),
plot.subtitle = element_text(hjust = 0.5,
size = 14))
```
## Topic modelling
Next, topic modelling was conducted. According to Silge and Robinson (2017), Latent Dirichlet allocation (LDA) is one of the most common algorithms for it. It is guided by two principles:
- each document is a mixture of topics, and
- every topic is a mixture of words (Silge and Robinson, 2017).
The authors comment that LDA is a mathematical method for estimating both at the same time. In this way, associations between the discourses can be found.
The first step towards LDA was to turn the dataframe into a DocumentTermMatrix (DTM).
```{r create the DTM}
df_dtm <- df_tokenized %>%
count(president, word) %>%
cast_dtm(president, word, n)
class(df_dtm) # to verify it is of class "DocumentTermMatrix"
df_dtm
```
Later, a fixed number of topics to be found was established. For no particular reason, it was chosen to find 5.
```{r}
df_lda_5 <- LDA(df_dtm,
k = 5,
control = list(seed = 666))
df_lda_5
```
At this point, even though no output could be read, the model was already declared.
```{r probabilities of words per each of the 5 topics}
df_lda_topics_5 <- tidy(df_lda_5,
matrix = "beta")
df_lda_topics_5
```
Note that numbers appeared first in the "term" column. This could be confusing. However, starting from page 4 of the results the terms were shown. Therefore, at this point, numbers were deleted from the corpus, always taking into consideration that there were some numbers, such as 2021, that were important in the presidential speeches.
```{r create a df with no numbers}
sum(str_detect(df_tokenized$word, "^\\d+(,\\d{3})*(\\.\\d+)?$")) #to identify numbers before deleting them
df_tokenized_no_numbers <- df_tokenized %>%
filter(!str_detect(word, "^\\d+(,\\d{3})*(\\.\\d+)?$"))
sum(str_detect(df_tokenized_no_numbers$word, "^\\d+(,\\d{3})*(\\.\\d+)?$")) # to check that we have effectively deleted them
```
A new DTM was created, overwriting the first one:
```{r create the DTM without numbers as words}
df_dtm <- df_tokenized_no_numbers %>%
count(president, word) %>%
cast_dtm(president, word, n)
class(df_dtm)
df_dtm
```
### 5 topics LDA
Then, the same model was declared and probabilities of each word to be part of any topic were calculated:
```{r fit the model with 5 topics}
df_lda_5 <- LDA(df_dtm,
k = 5,
control = list(seed = 666))
df_lda_5
```
```{r probabilities of words per each of the 5 topics, excluding numbers as words}
df_lda_topics_5 <- tidy(df_lda_5,
matrix = "beta")
df_lda_topics_5
```
This gave the probability for each word to belong to each topic. Below are the first 20 words for each of the 5 topics.
```{r top 20 words per each of the 5 topics}
top_terms_topic_5 <- df_lda_topics_5 %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms_topic_5
```
In topic 1, the most likely word is _salud_, followed by _país_ (country) and _nacional_ (national). Other topics are Peru, _pueblo_ and _gobierno_ (government). Topic 2 is characterized by topics such as _pobreza_ (poverty), _desarrollo_ (development), _future_ (future) and _educación_ (education). Topic 3 talks also about _salud_, _año_ (year), _bicentenario_ (Bicentennial, probably regarding the 200 aniversary of Peru independence in 2021) and _jóvenes_ (youth). Topic 4 coincides with some of the previous topics in terms of _social_, Peru, but also mentions _democracia_ (democracy) and _patria_, _crecimiento_ (growth), _sistema_ (system), etc. Finally, topic 5 is related to _gobierno_, _desarrollo_ (development), _derechos_ (rights), _corrupción_ (corruption), _empresas_ (companies), etc. However, in almost all topics the first general words are repeated: _país_, Peru y _nacional_.
If the following graphic is not clearly visible, I would appreciate it if you could enlarge it by opening it in a separate window:
```{r plot top terms by topic(5)}
colors_5 <- c("#543005", "#bf812d", "#80cdc1", "#01665e", "#003c30")
top_terms_topic_5 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term,
fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic,
scales = "free") +
scale_y_reordered() +
scale_fill_manual(values = colors_5) +
theme_minimal() +
ggtitle("Top Terms by Topic",
subtitle = "From 1990 to 2022") +
labs(
x = "Beta Coefficient",
y = "Term",
fill = "Topic"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 20,
face = "bold"
),
plot.subtitle = element_text(
hjust = 0.5,
size = 14
),
legend.position = "bottom"
)
```
Next, the probability to each presidential speech of belonging to each topic was assigned:
```{r assign each pf the 5 topics to the speeches}
document_topic_prob_5 <- tidy(df_lda_5,
matrix = "gamma")
document_topic_prob_5
```
For example, it seemed that Pedro Castillo was the most likely to belong to this the first topic. The speech of Alberto Fujimori in his first speech, and the ones of Alejandro Toledo and Martín Vizcarra were the most likely to belong to topic 2. Dina Boluarte, Francisco Sagasti, PPK and Valentín Paniagua belong to topic 3. In the case of Sagasti and PPK, it also made sense that they were together, since they assumed their positions as part of the same party. On other hand, it seemed that only Ollanta Humala speech was part of topic 4. Similar case is that on the speech of Fujimori in his second term, which is alone in the fifth topic.
In the table below, the probability of fitting in one of the topics, together with the original speech, is shown.
```{r merge the 5 topics with original data}
document_topic_merged_5 <- df %>%
left_join(document_topic_prob_5,
by = c("president" = "document")
)
document_topic_merged_5
```
To conclude this section, if we could assign only one topic to each inaugural speech, what would it be?:
```{r identify each president top topic from 5 with LDA model}
president_topics_5 <- document_topic_merged_5 %>%
group_by(president) %>%
top_n(1, gamma) %>%
select(president, topic)
president_topics_5
```
### 2 topics LDA
However, there were very similar topics. Maybe 5 was a large number. We then reduced the number of topics to 2, with the hypothesis that this would help make the themes of each topic more distinct. The exercise was very similar. First, the model was declared:
```{r fit the model with only 2 topics}
df_lda_2 <- LDA(df_dtm,
k = 2,
control = list(seed = 666))
df_lda_2
```
Then, a probability to each word was assigned:
```{r probabilities of words per each of the 2 topics}
df_lda_topics_2 <- tidy(df_lda_2,
matrix = "beta")
df_lda_topics_2
```
Later, the 20 most probable words for each topic were found. This could help understand what is each topic about.
```{r top 20 words per each of the 2 topics}
top_terms_topic_2 <- df_lda_topics_2 %>%
group_by(topic) %>%
top_n(20, beta) %>%
ungroup() %>%
arrange(topic, -beta)
top_terms_topic_2
```
The following plot reveals which themes are covered in each topic. In a broad sense, although there are some words that are repeated —such as _país_, Perú, _peruanos_, _nacional_, etc.—, it seems that the first topic covers more politics areas and the second one covers just a bit more socio-economic ideas. For example, the first is about _gobierno_, _política_, _patria_, _república_ , and others. The second addresses _salud_ (even if the first topic also covers it), _empresa_ and _pobreza_, _derechos_ and _futuro_ issues. However, the boundaries are unclear.
```{r}
colors2 <- c( "#bf812d", "#01665e")
top_terms_topic_2 %>%
mutate(term = reorder_within(term, beta, topic)) %>%
ggplot(aes(beta, term,
fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic,
scales = "free") +
scale_y_reordered() +
scale_fill_manual(values = colors2) +
theme_minimal() +
ggtitle("Top Terms by Topic",
subtitle = "From 1990 to 2022") +
labs(
x = "Beta Coefficient",
y = "Term",
fill = "Topic"
) +
theme(
plot.title = element_text(
hjust = 0.5,
size = 20,
face = "bold"
),
plot.subtitle = element_text(
hjust = 0.5,
size = 14
),
legend.position = "bottom"
)
```
Below it can be seen which speech (and therefore, which president) could be classified in each cluster. The speeches of Dina Boluarte, Ollanta Humala, Pedro Castillo and Valentín Paniagua were more likely to be classified in the first cluster. The three presidents in the sample who took office with a leftist rhetoric were included in this cluster: Boluarte was the vice-president of Castillo, the rural left-wing rural leader, and Humala entered with a leftist discourse, although his mandate was more right-wing. These are accompanied in the same topic with Valentín Paniagua, the transitional president who was only in office for a few months after Fujimori's resignation and which meant the return of democracy.
On the other hand, the two speeches of dictator Alberto Fujimori (quite obviously), the one by Alejandro Toledo, and the ones by Sagasti, Vizcarra and PPK were more likely to enter the 2nd topic. They were all from the right-wing, neoliberal side of the political spectrum. It is not surprising that the two speeches of Alberto Fujimori entered together in this topic, nor that Sagasti, Vizcarra and PPK were also stick: they three were from the same party. In fact, PPK was elected, but resigned and his vice-president, Vizcarra, took office. Later, he was removed from the presidency and 5 days later Sagasti occupied the position. Regarding their speeches, PPK's focus on economic, business and productivity issues coincides with Fujimori's emphasis on his repeated phrase "the productive revolution of the little ones".
```{r assign each of the 2 topics to the speeches}
document_topic_prob_2 <- tidy(df_lda_2,
matrix = "gamma")
document_topic_prob_2
```
```{r merge the 2 topics with original data}
document_topic_merged_2 <- df %>%
left_join(document_topic_prob_2,
by = c("president" = "document")
)
document_topic_merged_2
```
So, to conclude, the following table summarized more easily which topic each one would belong to, as mentioned above.
```{r identify each president top topic from 2 with LDA model}
president_topics_2 <- document_topic_merged_2%>%
group_by(president) %>%
top_n(1, gamma) %>%
select(president, topic)
president_topics_2
```
## Closing remarks
This analysis is far from exhaustive. However, there are some relevant insights, both in the characterization of the discourses and in the practical way of doing so.
Starting with the latter, because the texts of interest are in Spanish, translating them would have introduced errors. However, the errors that were not introduced by translation could have been introduced by the same alternative packages that were used. The fact that there are more alternatives for analyzing English texts is an undeniable truth, and the ones we used are not specific to Spanish corpora: they are packages like `SentimentAnalysis` and `syuzhet` that have additional options for setting the language to Spanish or some other language. I consider that these might be suboptimal with respect to their accuracy with English terms. In fact, at one point I tried to perform lemmatization and stemming, but I gave up on including it in this report because the words were truncated in a strange way.
Regarding the characterization of the speeches, these are full of platitudes and lack references to more concrete issues. They are more focused on the romantic than on the programmatic. Dina Boluarte's speech is particularly scarce, as it does not even have trigrams with more than 2 appearances, and only two bigrams with more than 2 mentions, which also do not have much meaning. Another aspect, even shocking, that this analysis helped to uncover are the mentions of family planning and human rights issues by Fujimori, in his two terms in office. He, who committed atrocities in these two fields, mentioned them more frequently than others in his inaugural speeches.
Also, in terms of the sentiments of the speeches, one of the packages assigned the vast majority of words as neutral in tone. These results did not seem correct. However, an additional package was able to identify more nuances: more than 1500 tokens were identified as "positive", being this the mode. This was followed by words classified as trust, negative, and fear. The least used were those of disgust and surprise.
Regarding the hypotheses, these techniques more or less helped to understand the contexts of each assumption of office, as in the case of the topics addressed by Fujimori -human rights, family planning-, those of Castillo -health in the midst of the Covid-19 pandemic, education, etc.-, those of PPK -who repeated several times the idea of the Bicentenary of the Independence of Peru in 2021, although in the end he did not arrive in office until that date- or the length of his texts, those of PPK -who repeated several times the idea of the Bicentenary of the Independence of Peru in 2021, although in the end he did not arrive in office until that date- or the length of their texts, as in the case of Boluarte and Paniagua, in whom it is difficult to identify bigrams or trigrams with a greater number of repetitions. Differences can be seen between those who were elected in elections and those who were promoted to the position by succession or designation.
## Bibliography
Mundo, Redacción BBC News. 2020. “La ola de protestas en Perú que dejó 2 muertos y 100 heridos y culminó con la renuncia del presidente.” BBC News Mundo, November 15, 2020. https://www.bbc.com/mundo/noticias-america-latina-54948270.
Olmo, Guillermo. 2022. “6 presidentes en 4 años: por qué Perú es tan difícil de gobernar.” BBC News Mundo, December 8, 2022. https://www.bbc.com/mundo/noticias-america-latina-63898035.
Rebaza, Claudia, and Jack Guy. 2023. “’They Say We’re Not Peruvian’: Protester Deaths Highlight Peru’s Deep Historical Divisions,” March 10, 2023. https://www.cnn.com/2023/03/10/americas/peru-protester-deaths-historical-divisions-intl/index.html.
Silge, Julia, and David Robinson. 2017. Text Mining with R: A Tidy Approach. First edition. Beijing ; Boston: O’Reilly.