Processing Topic Model for YouTube Channels

Introduction

This script was developed for the analysis of Portuguese. I hope it helps colleagues in the LC area and popularize the use of R. It is part of our research project developed with CNPQ. Please drop me a line if you have any doubts or need any help.

Purpose of this repository

This git brings the scripts for my article:

Rodrigo Esteves de Lima Lopes, Beyond the binary: Trans women’s video activism on YouTube, Digital Scholarship in the Humanities, Volume 37, Issue 1, April 2022, Pages 67–80, https://doi.org/10.1093/llc/fqab057

This script is specifically about data processing.

Packages

For data processing we are going to need some packages, each has a different function

abjutils: diacritic removal in Brazilian Portuguese
tm, tidytext, tidyverse, magrittr: data manipulation and cleaning
ggridges: graph plotting
formattable: table formating

library(abjutils)
library(tidytext)
library(tidyverse)
library(magrittr)
library(stm)
library(tm)
library(ggridges)
library(formattable)

Topic modelling

Preparing a general data frame

If you are processing more than one channel like I did, it will be necessary first to merge all in a single data frame. I am assuming your corpus has 140 videos of each channel, as mine. So the frist step is to identify each video before the dfs are merged. It will help a lot of confusion.

# we use a similar code for each channel (from 1 to 4)
channel1 <- rep("channel1", 140)
df.channel1 <-cbind(df.channel1,channel1)
colnames(df.channel1)[colnames(df.channel1)=="channel1"] <- "Channel"
df.final.all <-rbind(df.channel1,df.channel2,df.channel3,df.channel4)

The next step was to create a data frame with the words. Although it is not necessary to the actual topic modelling analysis, it is nice for us to have a look at our corpus’ lexis.

df.final.all  %>% 
  unnest_tokens(word, caption) %>% 
  # filtering stopwords
  filter(!word %in% sw_pt_tm)%>%
  # Counting
  count(word) %>% 
  arrange(desc(n)) %>% 
  formattable()

Processing

Our first step will be preparing a stopwords list to apply to our corpus. Please notice that some words, like proper names, names of cities or locations that can identify the YouTubers may pop up during the analysis. In this case, the best is to add those words to the list and rerun the script.

In this paper, I got a wordlist from tm package. Note that there is a command to clean the diacritic symbols

sw_pt_tm <- tm::stopwords("pt")
sw_pt_tm <- rm_accent(sw_pt_tm)

Now let us process the corpus itself

proc <- stm::textProcessor(df.final.all$caption, metadata = df.final.all, 
                           language = "portuguese",
                           customstopwords = sw_pt_tm)
out <- stm::prepDocuments(proc$documents, proc$vocab, proc$meta,
                          lower.thresh = 10)

storage <- stm::searchK(out$documents, out$vocab, K = c(3:15),
                          data = out$meta)

fit <- stm::stm(
  documents = out$documents, vocab = out$vocab, data = out$meta,  K = 4,
  max.em.its = 75, init.type = "Spectral", verbose = FALSE
)

This will give us the four topics, like in the final paper. Depending on the size of your corpus, the k=n might change. If you do so, the number of topics is going to change. My advice is keep doing this until the optimal number of topics is subjectively reached.

In order to check the results and make some ajustments, the following block of code might help:

## More common words in each topic
stm::labelTopics(fit)
#plot(fit, "summary")
plot(fit$theta, type = "p", col="blue")
head(fit$theta)
view(fit$theta)

After we checked everything, it is time to name the topics, four in our case

Topic.Names <- c("Relationships",
                   "Gender",
                   "Beauty", 
                   "Transition")

Now, we are going to extract the best possibility of a video fit into a topic:

prob <- apply(fit$theta, 1, max)

Now we are going to associate such probality to a topic name

Videos.Topic <- Topic.Names[apply(fit$theta, 1, which.max)]

Now it is time to join all these pieces of information in a data frame

df_topics <- df.final.all %>% 
  mutate(best_prob = prob,
         topic = Videos.Topic)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

03_Processing.md

03_Processing.md

Processing Topic Model for YouTube Channels

Introduction

Purpose of this repository

Packages

Topic modelling

Preparing a general data frame

Processing

Files

03_Processing.md

Latest commit

History

03_Processing.md

File metadata and controls

Processing Topic Model for YouTube Channels

Introduction

Purpose of this repository

Packages

Topic modelling

Preparing a general data frame

Processing