Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checking on LDA implementation reference #129

Open
xandaschofield opened this issue Jan 29, 2025 · 6 comments
Open

Checking on LDA implementation reference #129

xandaschofield opened this issue Jan 29, 2025 · 6 comments

Comments

@xandaschofield
Copy link

Hello! I wanted to check about something I noticed in the core documentation, specifically referring to the model of LDA used by OCTIS. Right now, you have this listing in the documentation:

LDA (Blei et al. 2003) | https://radimrehurek.com/gensim/

As it turns out, gensim's ldamodel does not implement a classic LDA inference algorithm, but instead uses Online LDA. This approach both uses variational inference (which anecdotally according to many topic modeling experts gets worse results than Gibbs sampling) and approximates it in a streaming context (which makes it fast for million-document corpora but provides poor topic quality on smaller corpora). In short, this implementation isn't

There aren't a lot of popular true implementations of LDA inference using VI (Blei et al., 2003) or Gibbs sampling (Griffiths and Steyvers, 2004). But given this library is being used on small corpora in evaluations, it might be good to update the documentation here to reflect the right paper citation? Another option - since you already have tomotopy, it's also an approximation to my knowledge that uses a distributed algorithm from 2009, but my impression is it's closer to classic algorithms on smaller corpora for topic outcomes based on how it does distributed Gibbs sampling.

Thanks, and sorry if this is getting in the weeds - happy to provide more information if useful!

Xanda Schofield

@mr1lazycoder
Copy link

hello @xandaschofield thanks for the comments. by no means i'm an expert on the topic, but you stated the following:

There aren't a lot of popular true implementations of LDA inference using VI (Blei et al., 2003) or Gibbs sampling (Griffiths and Steyvers, 2004).

what about this implementation . it's an r package that does both Variational and Gibbs sampling. i'm asking for your judgement, how close it to blei's original implementation.

@xandaschofield
Copy link
Author

I think this R implementation should work (in practice, folks I know who have used it have also qualitatively had it work okay for small corpora). MALLET is also a solid implementation of the Gibbs sampling part in Java (https://github.com/maria-antoniak/little-mallet-wrapper gives a wrapper for calling this in Python), but I'm unsure if that's more convenient or if there's a more stable/maintained option for a Python wrapper for it.

@mr1lazycoder
Copy link

Thanks a bunch @xandaschofield for both your time and answer.If you have time, could you please answer these two questions:

  • What's "wrong" with these different sampling approaches. why not implement Blei's approach directly?
  • how short is a corpus to not be affected by these nuances?
    I've came across something similar to OCTIS called topmost which relies on gensim's implementation too so i guess the R package should suffice. Thanks again for your help.

@xandaschofield
Copy link
Author

Thanks for being so responsive - it's a great question! Depending on the degree to which you want to get into the math, I'm happy to set up a chat to go into this more.

Short version: gensim's implementation makes some design decisions to run faster and with less memory on large (million-document) text collections, but for smaller (ten-thousand-document) text collections those design decisions can result in meaningfully worse models than the classic algorithms.

Longer version: No algorithms for LDA directly solve for the "best" model, but there are lots of different strategies for how to converge to a model that's good. The versions used by the R library are two of the original algorithms: in general, LDA algorithms that use variational inference are likely to converge faster but can get stuck in a less-than-optimal space of possible topic models, while algorithms that use Gibbs Sampling are harder to track for convergence and can take a little longer but seem to be better at exploring the full probability space and sometimes produce slightly better models.

Both of these classic algorithms can be slow and memory-intensive if you have a lot of text (I'd say this threshold would be hundreds of thousands or millions of several-hundred-word documents), so more recent work on LDA inference algorithms often focuses on how to make some approximations to reduce how much memory is getting used at a time and to allow inference to run faster. The one gensim uses is a nice approximation that considers batches of documents at once (default of 2000 at a time) instead of the full corpus, which can help it run faster! But that performance doesn't come for free - even more than the old VI algorithm, it can get stuck in a suboptimal part of the topic model probability space if it isn't getting lots of new documents as it goes.

In theory, "in the limit", all these inference algorithms correctly converge to an LDA model. In practice, for smaller corpora (tens of thousands of documents or less) the online algorithm can converge to pretty bad topics. I spend a chunk of my time talking to people about projects they start using topic models (https://dl.acm.org/doi/10.1145/3701201). I can say that in practice, for folks with these smaller corpora, there's a substantial qualitative difference in the models they get when they switch from gensim to MALLET or R's topicmodels library.

@mr1lazycoder
Copy link

understood. I'll give the article you mentioned a read next. i liked the "comparing apples to apples" article too, very well written. Thank you so much professor @xandaschofield .

@xandaschofield
Copy link
Author

Don't let me assign you homework :). Good luck!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants