Historical analysis of reddit discourse regarding LGBT issues on the LDS-related subreddits.
A personal project where I tested out some new tools like Dagster and DuckDB.
- Dagster: data orchestration. A more modern Airflow alternative, Dagster's doing most of the heavy lifting in this project
- dbt: data build tool
- DuckDB: an embedded analytical database. Lightweight enough to run locally alongside my pipelines, nimble enough for online analytical processing (OLAP)
- OpenAI API: text embeddings and summaries of the clusters
- Poetry: dependency manager. If I ever come back to this project 3 years from now, when all the dependencies are out of date and pip has updated how it resolves conflicts, Poetry will
- scikit-learn: K-means clustering
- Streamlit: creating and hosting (for free!) an interactive data visualization
- Arctic Shift API: source of historic reddit data. No money went towards buying Reddit data in this project.
This project uses Dagster to create a pipeline that:
- Pulls all available historic data from LDS-related subreddits from a data dump API
- Narrows the posts and comments down to the ones related to LGBT+ issues
- Puts those through OpenAI's API to get text embeddings
- Does K-means clustering on those embeddings
- Uses OpenAI chat API to summarize each cluster
- Creates an interactive streamlit app to see cluster frequencies over time
You can intereact with the streamlit website here. Clicking on a cluster will show the summary and some sample reddit comments from the cluster.
- Many of the clusters seemed to respond to current events like The Boy Scouts of America rescinding the long-standing ban on openly homosexual youth in the program, publicized suicides and studies on gay suicide, and California's Prop 8.
- Spikes in calls for no personal attacks immediately follow spikes in sarcastic comments.
- Predictably, r/exmormon had the most discussion on leaving the faith.
- The "homosexuality is a sin" frequency did not decline over the last decade, with spikes as recently as summer 2021.
- Many of the comments were quite intimate, sharing personal experiences and kindly advice.
- Polygamy comparisons held steady over time.
- OpenAI's embeddings tended to cluster similar but opposing statements very closely. E.g. "I believe in gay marriage" was closer to "I don't believe in gay marriage" than it was to "I support human rights". This made it relatively unhelpful for analysis on support and opposition for LGBT issues.
This project tested out useful patterns for many Dagster concepts, including:
Software-defined assets - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage.
This project contains three asset groups:
-
Reddit
: Retrieves reddit posts and comments from LDS subreddits, and filters it down to media relevant to LGBTQ+ issues. -
OpenAI
: Creates OpenAI embeddings for reddit posts and comments, performs K-means clustering, and generates cluster summaries using the ChatGPT API. -
Visualization
: Exports a database to be used in the final streamlit visualization.
Resources - A resource is an object that models a connection to a (typically) external service. Resources can be shared between assets, and different implementations of resources can be used depending on the environment. In this example, we built multiple Hacker News API resources, all of which have the same interface but different implementations:
OpenAIClientResource
interacts with the OpenAI API and gets the full text embeddings or text summaries, which will be used in the full pipeline.OpenAISubsampleClientResource
talks to the real API but subsamples the data, which is much faster and cheaper than the normal implementation and is great for demoing purposes.
The way Dagster models resources helps separate the business logic in code from environments, e.g. you can easily switch resources without changing your pipeline code.
I/O managers - An I/O manager is a special kind of resource that handles storing and loading assets. This project includes:
FilesystemIOManager
: stores outputs as files in your local file system. It minimizes setup difficulty and is useful for local development.- DuckDBIOManager: Dagster provided DuckDB I/O manager that can store and load data as Pandas or PySpark DataFrames. Useful for local development since DuckDB runs locally and requires minimal setup. In this example, a DuckDB I/O manager that can handle Pandas and PySpark DataFrames is built using the function.
Testing - All Dagster entities are unit-testable. This project includes lightweight invocations in unit tests, including:
- Testing assets by directly invoking the -decorated functions. Read more about testing assets on the Testing page.
- Testing I/O managers by mocking the I/O and constructing and with the mocks. Check out Testing an IO manager to learn more.
One of my favorite things about Dagster is how easy it is to have different resource configurations for each environment. This project uses two:
- A production deployment, which uses a DuckDB I/O manager, and uses all 20-ish years of Reddit posts and comments.
- A local deployment, which stores assets in the local filesystem, and reduces costs by using a smaller subset of time and subsamples how much data is send to the OpenAI API
Having a lighter environment for testing and a fuller environment for the final result helped keep this project under budget (which was about $15 in OpenAI credits).
By default, it will load for the local deployment. You can toggle deployments by setting the DAGSTER_DEPLOYMENT
env var to production
or local
.
This project uses OpenAI's API to create text embeddings of the reddit posts and comments, then scikit-learn for K-means clustering, then OpenAI's text generation to produce summaries and titles for each of the clusters. Many of the clusters were vaguely titled, redundant, or irrelevant (like meta-commentary complaining about Reddit), so I omitted those from the final visualization.
The elbow method for finding the optimal K for K-means clustering wasn't very clear, at least partially because reddit threads and (human conversations in general) don't inherently have tight clusters. I ended up choosing 32 clusters for the final result.
OpenAI models were chosen to get the best accuracy for the task, with minimizing cost being the top priority.
Once Poetry is installed (instructions here), install this project's requirements with
poetry install
New packages can be added with poetry add
, e.g.
poetry add requests pendulum
In order to run the OpenAI jobs, this project requires an API key with OpenAI.
⚠️ Warning: OpenAI does charge money to use its API.
Once you have credentials, you can store them in a .env
file.
# .env
OPENAI_API_KEY=yourapikey
⚠️ Warning: Don't upload .env files to git, they contain secrets.
Using environment variables to provide secrets ensures sensitive info won't be visible in your code or the launchpad in the UI. This project follows Dagster's best practices for handling secrets through configuration and resources.
Once the requirements are installed, you can start the Dagster UI web server:
dagster dev
Open http://localhost:3000 with your browser to see the project.
Dagster assets are stored in mormon_queer_analysis/assets.py
. The assets are automatically loaded into the Dagster code location as you define them.
Tests are in the mormon_queer_analysis_tests
directory and you can run tests using pytest
:
pytest mormon_queer_analysis_tests