-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Topic modeling guide #262
Comments
Re non-trivial corpus with large number of documents, how about UN General Debate corpus? It's publicly available from Harvard Dataverse: "UNGDC 1970-2017.zip". Direct link here. It covers country statements in the UN General Debate (presidents, prime ministers etc), once per year at the opening of each UN session from 1970 to 2017. Total 7,897 speeches. |
We might add some aspects regarding downstream analysis (and maybe visualization depending on the target audience or format of publication). Regarding downstream analysis we might do (feel free to change/adapt/add):
|
Nice points. I have some time from May 13 onwards to work on this. |
It will be useful to create a comprehensive practical guide for topic modeling. Now we have all components in place:
udpipe
packagecoherence
measures - thanks to Manuel workrsparse
packagetext2vec::Collocations
,udpipe::as_phrasemachine
Steps
text2vec::LDA
,rsparse::WRMF
) with different hyper parameterstcm
calculationThere are already good vignettes in udpipe package topic modeling and phrase extraction. They can be used as inspiration.
@manuelbickel @jwijffels anything we can add to the plan above?
The text was updated successfully, but these errors were encountered: