-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checking on LDA implementation reference #129
Comments
hello @xandaschofield thanks for the comments. by no means i'm an expert on the topic, but you stated the following:
what about this implementation . it's an r package that does both Variational and Gibbs sampling. i'm asking for your judgement, how close it to blei's original implementation. |
I think this R implementation should work (in practice, folks I know who have used it have also qualitatively had it work okay for small corpora). MALLET is also a solid implementation of the Gibbs sampling part in Java (https://github.com/maria-antoniak/little-mallet-wrapper gives a wrapper for calling this in Python), but I'm unsure if that's more convenient or if there's a more stable/maintained option for a Python wrapper for it. |
Thanks a bunch @xandaschofield for both your time and answer.If you have time, could you please answer these two questions:
|
Thanks for being so responsive - it's a great question! Depending on the degree to which you want to get into the math, I'm happy to set up a chat to go into this more. Short version: gensim's implementation makes some design decisions to run faster and with less memory on large (million-document) text collections, but for smaller (ten-thousand-document) text collections those design decisions can result in meaningfully worse models than the classic algorithms. Longer version: No algorithms for LDA directly solve for the "best" model, but there are lots of different strategies for how to converge to a model that's good. The versions used by the R library are two of the original algorithms: in general, LDA algorithms that use variational inference are likely to converge faster but can get stuck in a less-than-optimal space of possible topic models, while algorithms that use Gibbs Sampling are harder to track for convergence and can take a little longer but seem to be better at exploring the full probability space and sometimes produce slightly better models. Both of these classic algorithms can be slow and memory-intensive if you have a lot of text (I'd say this threshold would be hundreds of thousands or millions of several-hundred-word documents), so more recent work on LDA inference algorithms often focuses on how to make some approximations to reduce how much memory is getting used at a time and to allow inference to run faster. The one gensim uses is a nice approximation that considers batches of documents at once (default of 2000 at a time) instead of the full corpus, which can help it run faster! But that performance doesn't come for free - even more than the old VI algorithm, it can get stuck in a suboptimal part of the topic model probability space if it isn't getting lots of new documents as it goes. In theory, "in the limit", all these inference algorithms correctly converge to an LDA model. In practice, for smaller corpora (tens of thousands of documents or less) the online algorithm can converge to pretty bad topics. I spend a chunk of my time talking to people about projects they start using topic models (https://dl.acm.org/doi/10.1145/3701201). I can say that in practice, for folks with these smaller corpora, there's a substantial qualitative difference in the models they get when they switch from gensim to MALLET or R's topicmodels library. |
understood. I'll give the article you mentioned a read next. i liked the "comparing apples to apples" article too, very well written. Thank you so much professor @xandaschofield . |
Don't let me assign you homework :). Good luck! |
Hello! I wanted to check about something I noticed in the core documentation, specifically referring to the model of LDA used by OCTIS. Right now, you have this listing in the documentation:
LDA (Blei et al. 2003) | https://radimrehurek.com/gensim/
As it turns out, gensim's ldamodel does not implement a classic LDA inference algorithm, but instead uses Online LDA. This approach both uses variational inference (which anecdotally according to many topic modeling experts gets worse results than Gibbs sampling) and approximates it in a streaming context (which makes it fast for million-document corpora but provides poor topic quality on smaller corpora). In short, this implementation isn't
There aren't a lot of popular true implementations of LDA inference using VI (Blei et al., 2003) or Gibbs sampling (Griffiths and Steyvers, 2004). But given this library is being used on small corpora in evaluations, it might be good to update the documentation here to reflect the right paper citation? Another option - since you already have tomotopy, it's also an approximation to my knowledge that uses a distributed algorithm from 2009, but my impression is it's closer to classic algorithms on smaller corpora for topic outcomes based on how it does distributed Gibbs sampling.
Thanks, and sorry if this is getting in the weeds - happy to provide more information if useful!
Xanda Schofield
The text was updated successfully, but these errors were encountered: