BERTopic

BERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports guided, (semi-) supervised, and dynamic topic modeling. It even supports visualizations similar to LDAvis!

Corresponding medium posts can be found here and here.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

You may want to install more depending on the transformers and language backends that you will be using. The possible installations are:

pip install bertopic[flair]
pip install bertopic[gensim]
pip install bertopic[spacy]
pip install bertopic[use]

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access the frequent topics that were generated:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	    4630	-1_can_your_will_any
0	    693	    49_windows_drive_dos_file
1	    466	    32_jesus_bible_christian_faith
2	    441	    2_space_launch_orbit_lunar
3	    381	    22_key_encryption_keys_encrypted

-1 refers to all outliers and should typically be ignored. Next, let's take a look at the most frequent topic that was generated, topic 0:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

NOTE: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Overview

For quick access to common functions, here is an overview of BERTopic's main methods:

Method	Code
Fit the model	`.fit(docs)`
Fit the model and predict documents	`.fit_transform(docs)`
Predict new documents	`.transform([new_doc])`
Access single topic	`.get_topic(topic=12)`
Access all topics	`.get_topics()`
Get topic freq	`.get_topic_freq()`
Get all topic information	`.get_topic_info()`
Get representative docs per topic	`.get_representative_docs()`
Get topics per class	`.topics_per_class(docs, topics, classes)`
Dynamic Topic Modeling	`.topics_over_time(docs, topics, timestamps)`
Update topic representation	`.update_topics(docs, topics, n_gram_range=(1, 3))`
Reduce nr of topics	`.reduce_topics(docs, topics, nr_topics=30)`
Find topics	`.find_topics("vehicle")`
Save model	`.save("my_model")`
Load model	`BERTopic.load("my_model")`
Get parameters	`.get_params()`

For an overview of BERTopic's visualization methods:

Method	Code
Visualize Topics	`.visualize_topics()`
Visualize Topic Hierarchy	`.visualize_hierarchy()`
Visualize Topic Terms	`.visualize_barchart()`
Visualize Topic Similarity	`.visualize_heatmap()`
Visualize Term Score Decline	`.visualize_term_rank()`
Visualize Topic Probability Distribution	`.visualize_distribution(probs[0])`
Visualize Topics over Time	`.visualize_topics_over_time(topics_over_time)`
Visualize Topics per Class	`.visualize_topics_per_class(topics_per_class)`

Citation

To cite BERTopic in your work, please use the following bibtex reference:

@misc{grootendorst2020bertopic,
  author       = {Maarten Grootendorst},
  title        = {BERTopic: Leveraging BERT and c-TF-IDF to create easily interpretable topics.},
  year         = 2020,
  publisher    = {Zenodo},
  version      = {v0.9.4},
  doi          = {10.5281/zenodo.4381785},
  url          = {https://doi.org/10.5281/zenodo.4381785}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index.md

index.md

BERTopic

Installation

Quick Start

Overview

Citation

Files

index.md

Latest commit

History

index.md

File metadata and controls

BERTopic

Installation

Quick Start

Overview

Citation