Topic modeling guide #262

dselivanov · 2018-05-26T09:53:21Z

It will be useful to create a comprehensive practical guide for topic modeling. Now we have all components in place:

POS tags and lemmatization - thanks to udpipe package
coherence measures - thanks to Manuel work
fast LDA, thanks to WarpLDA in text2vec
fast non-negative matrix factorization, thanks to rsparse package
multi-word phrase extraction - several approaches text2vec::Collocations, udpipe::as_phrasemachine

Steps

find interesting non-trivial corpus with large number of documents
demonstrate how to create tokenizer which only use particular POS
create collocation model on top of that
create document-term matrix using tokens with multi-word expression
fit several topic models (text2vec::LDA, rsparse::WRMF) with different hyper parameters
cross-validate / compare them using different coherence metrics
- demonstrate usage of external corpus for tcm calculation
- check on how coherence metrics are correlated (is perplexity correlated with them? )

There are already good vignettes in udpipe package topic modeling and phrase extraction. They can be used as inspiration.

@manuelbickel @jwijffels anything we can add to the plan above?

The text was updated successfully, but these errors were encountered:

sjankin · 2018-05-26T10:36:11Z

Re non-trivial corpus with large number of documents, how about UN General Debate corpus? It's publicly available from Harvard Dataverse: "UNGDC 1970-2017.zip". Direct link here. It covers country statements in the UN General Debate (presidents, prime ministers etc), once per year at the opening of each UN session from 1970 to 2017. Total 7,897 speeches.

manuelbickel · 2018-05-26T10:59:27Z

We might add some aspects regarding downstream analysis (and maybe visualization depending on the target audience or format of publication).

Regarding downstream analysis we might do (feel free to change/adapt/add):

use LDAvis for PCA on topics to understand the main differences between the topics in the corpus
apply cluster analysis on doc topic matrix (or topic term matrix) to highlight commonalities - either with probabilities or with classified probabilities, in the extreme case 0 and 1
apply network analysis on doc topic matrix (or topic term matrix) to identify relations/co-occurrence between documents or topics (setting a threshold seems reasonable, e.g., only use the three strongest topics per document and set others to zero)
do trend analysis over time on prevalence of topics (e..g by calculating mean of topic probability per year and using AICC loess smoothing); this procedure is not perfect, since standard LDA neglects the timely order of documents; more sophisticated algorithms like dynamic topic models exist (see, e.g. Blei) that do this, which, however, are more restrictive - not sure what the best practice of trend analysis is at the moment

jwijffels · 2018-05-26T11:10:50Z

Nice points. I have some time from May 13 onwards to work on this.
I would be interested in having a corpus which has the same text in several languages to show that the flow works for all languages with limited manual intervention. In Belgium we have some open data (20000 records if I recall) for all question/answers in parliament for the last years but that is only Dutch and French. It would be nice to have a corpus with also English in it + some more languages (maybe europarl?)
FYI. I've also added more docs on multi-word phrase extraction at https://bnosac.github.io/udpipe/docs/doc7.html

dselivanov added the help wanted label May 26, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Topic modeling guide #262

Topic modeling guide #262

dselivanov commented May 26, 2018

sjankin commented May 26, 2018

manuelbickel commented May 26, 2018 •

edited

Loading

jwijffels commented May 26, 2018 •

edited

Loading

Topic modeling guide #262

Topic modeling guide #262

Comments

dselivanov commented May 26, 2018

Steps

sjankin commented May 26, 2018

manuelbickel commented May 26, 2018 • edited Loading

jwijffels commented May 26, 2018 • edited Loading

manuelbickel commented May 26, 2018 •

edited

Loading

jwijffels commented May 26, 2018 •

edited

Loading