BERTopic Easy

The purpose of this library is to reduce development time needed to cluster documents into topics, at least for your prototype.

Caution

This library is in early development. It is not ready for production use.

The library has been tested on 2,500 sentences. A smell test of 10,000 sentences seems to pass, but of of course, the topic quality will be unknown, so be cautious and evaluate carefully.

The approach here is to use DBSAN clustering algorithm from BERTopic along with OPENAI's o3-mini LLM model to name the clusters and classify outliers.

Motivations

Topic modeling is a time-consuming development task. I did not find any tools to help me quickly make quality topics for my prototype. BERTopic library is a great tool, but it is not easy to use with complicated options.
OpenAI's cutting-edge o3-mini names clusters well, and reduces outliers better than BERTopic's default method.

Example usage

import os

from dotenv import load_dotenv
from rich import print

from bertopic_easy import bertopic_easy

load_dotenv()
openai_api_key = os.environ["OPENAI_API_KEY"]

texts = [
    "16/8 fasting",
    "16:8 fasting",
    "24-hour fasting",
    "24-hour one meal a day (OMAD) eating pattern",
    "2:1 ketogenic diet, low-glycemic-index diet",
    "30-day nutrition plan",
    "36-hour fast",
    "4-day fast",
    "40 hour fast, low carb meals",
    "4:3 fasting",
    "5-day fasting-mimicking diet (FMD) program",
    "7 day fast",
    "84-hour fast",
    "90/10 diet",
    "Adjusting macro and micro nutrient intake",
    "Adjusting target macros",
    "Macro and micro nutrient intake",
    "AllerPro formula",
    "Alternate Day Fasting (ADF), One Meal A Day (OMAD)",
    "American cheese",
    "Atkin's diet",
    "Atkins diet",
    "Avoid seed oils",
    "Avoiding seed oils",
    "Limiting seed oils",
    "Limited seed oils and processed foods",
    "Avoiding seed oils and processed foods",
]

clusters = bertopic_easy(
    texts=texts,
    openai_api_key=openai_api_key,
    reasoning_effort="low",  # low, medium, high ... slow, slower, slowest
    subject="personal diet intervention outcomes",
)
print(clusters)

Example output

What's happening under the hood? The three steps...

This is a opinionated hybrid approach to topic modeling using a combination of embeddings and LLM completions. The embeddings are for clustering and the LLM completions are for naming and outlier classification.

graph TD;
    A[Start] -->|sentences| B{1.Run Bertopic};
    B -->|clusters| C[2.Name clusters];
    C -->|target classifications| D;;
    B -->|outliers| D[3.Classify and merge outliers];

Step 1 - Cluster sentences

Bertopic library clusters using embeddings from a text-embedding-3-large LLM model.

Step 2 - Name clusters

Names are generated by a o3-mini LLM model for the resulting clusters from Step 1.

Step 3 - Re-group outliers (not implemented yet)

Outlier sentences, those that did not fit into any of the Bertopic clusters from Step 1, are classified by the o3-mini LLM using the resulting cluster names from Step 2.

Install

Pre-requisites

python = ">=3.11,<3.13"

pip install bertopic-easy

Some BERTopic FAQs

Why does it take so long to import BERTopic?

Pointers for contributing developer

Run a smoke test

git clone [email protected]:borisdev/bertopic-easy.git
cd bertopic-easy
pip install -e . # editable install or poetry install -e .
# set the OPENAI_API_KEY in the code or as an environment variable
poetry run pytest tests/test_main.py::test_bertopic_easy
# remember it takes a while to import the bertopic library

make a tiny PR so I can see how I can help you get started

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
bertopic_easy		bertopic_easy
images		images
tests		tests
.azure.env.example		.azure.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
demo.py		demo.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
pytest.ini		pytest.ini

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BERTopic Easy

Motivations

Example usage

Example output

What's happening under the hood? The three steps...

Step 1 - Cluster sentences

Step 2 - Name clusters

Step 3 - Re-group outliers (not implemented yet)

Install

Pre-requisites

Some BERTopic FAQs

Pointers for contributing developer

About

Releases

Packages

Languages

borisdev/bertopic-easy

Folders and files

Latest commit

History

Repository files navigation

BERTopic Easy

Motivations

Example usage

Example output

What's happening under the hood? The three steps...

Step 1 - Cluster sentences

Step 2 - Name clusters

Step 3 - Re-group outliers (not implemented yet)

Install

Pre-requisites

Some BERTopic FAQs

Pointers for contributing developer

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages