mormon_queer_analysis

Historical analysis of reddit discourse regarding LGBT issues on the LDS-related subreddits.

A personal project where I tested out some new tools like Dagster and DuckDB.

The stack

Dagster: data orchestration. A more modern Airflow alternative, Dagster's doing most of the heavy lifting in this project
dbt: data build tool
DuckDB: an embedded analytical database. Lightweight enough to run locally alongside my pipelines, nimble enough for online analytical processing (OLAP)
OpenAI API: text embeddings and summaries of the clusters
Poetry: dependency manager. If I ever come back to this project 3 years from now, when all the dependencies are out of date and pip has updated how it resolves conflicts, Poetry will
scikit-learn: K-means clustering
Streamlit: creating and hosting (for free!) an interactive data visualization
Arctic Shift API: source of historic reddit data. No money went towards buying Reddit data in this project.

Overview

This project uses Dagster to create a pipeline that:

Pulls all available historic data from LDS-related subreddits from a data dump API
Narrows the posts and comments down to the ones related to LGBT+ issues
Puts those through OpenAI's API to get text embeddings
Does K-means clustering on those embeddings
Uses OpenAI chat API to summarize each cluster
Creates an interactive streamlit app to see cluster frequencies over time

Results

You can intereact with the streamlit website here. Clicking on a cluster will show the summary and some sample reddit comments from the cluster.

Interesting findings:

Many of the clusters seemed to respond to current events like The Boy Scouts of America rescinding the long-standing ban on openly homosexual youth in the program, publicized suicides and studies on gay suicide, and California's Prop 8.
Spikes in calls for no personal attacks immediately follow spikes in sarcastic comments.
Predictably, r/exmormon had the most discussion on leaving the faith.
The "homosexuality is a sin" frequency did not decline over the last decade, with spikes as recently as summer 2021.
Many of the comments were quite intimate, sharing personal experiences and kindly advice.
Polygamy comparisons held steady over time.
OpenAI's embeddings tended to cluster similar but opposing statements very closely. E.g. "I believe in gay marriage" was closer to "I don't believe in gay marriage" than it was to "I support human rights". This made it relatively unhelpful for analysis on support and opposition for LGBT issues.

Project structure and highlights

This project tested out useful patterns for many Dagster concepts, including:

Organizing assets in groups

Software-defined assets - An asset is a software object that models a data asset. The prototypical example is a table in a database or a file in cloud storage.

This project contains three asset groups:

Reddit: Retrieves reddit posts and comments from LDS subreddits, and filters it down to media relevant to LGBTQ+ issues.
OpenAI: Creates OpenAI embeddings for reddit posts and comments, performs K-means clustering, and generates cluster summaries using the ChatGPT API.
Visualization: Exports a database to be used in the final streamlit visualization.

Varying external services or I/O without changing your DAG

Resources - A resource is an object that models a connection to a (typically) external service. Resources can be shared between assets, and different implementations of resources can be used depending on the environment. In this example, we built multiple Hacker News API resources, all of which have the same interface but different implementations:

OpenAIClientResource interacts with the OpenAI API and gets the full text embeddings or text summaries, which will be used in the full pipeline.
OpenAISubsampleClientResource talks to the real API but subsamples the data, which is much faster and cheaper than the normal implementation and is great for demoing purposes.

The way Dagster models resources helps separate the business logic in code from environments, e.g. you can easily switch resources without changing your pipeline code.

I/O managers - An I/O manager is a special kind of resource that handles storing and loading assets. This project includes:

FilesystemIOManager: stores outputs as files in your local file system. It minimizes setup difficulty and is useful for local development.
DuckDBIOManager: Dagster provided DuckDB I/O manager that can store and load data as Pandas or PySpark DataFrames. Useful for local development since DuckDB runs locally and requires minimal setup. In this example, a DuckDB I/O manager that can handle Pandas and PySpark DataFrames is built using the function.

Testing

Testing - All Dagster entities are unit-testable. This project includes lightweight invocations in unit tests, including:

Testing assets by directly invoking the -decorated functions. Read more about testing assets on the Testing page.
Testing I/O managers by mocking the I/O and constructing and with the mocks. Check out Testing an IO manager to learn more.

Environments

One of my favorite things about Dagster is how easy it is to have different resource configurations for each environment. This project uses two:

A production deployment, which uses a DuckDB I/O manager, and uses all 20-ish years of Reddit posts and comments.
A local deployment, which stores assets in the local filesystem, and reduces costs by using a smaller subset of time and subsamples how much data is send to the OpenAI API

Having a lighter environment for testing and a fuller environment for the final result helped keep this project under budget (which was about $15 in OpenAI credits).

By default, it will load for the local deployment. You can toggle deployments by setting the DAGSTER_DEPLOYMENT env var to production or local.

ML models used

This project uses OpenAI's API to create text embeddings of the reddit posts and comments, then scikit-learn for K-means clustering, then OpenAI's text generation to produce summaries and titles for each of the clusters. Many of the clusters were vaguely titled, redundant, or irrelevant (like meta-commentary complaining about Reddit), so I omitted those from the final visualization.

The elbow method for finding the optimal K for K-means clustering wasn't very clear, at least partially because reddit threads and (human conversations in general) don't inherently have tight clusters. I ended up choosing 32 clusters for the final result.

OpenAI models were chosen to get the best accuracy for the task, with minimizing cost being the top priority.

Development

Installing and adding Python dependencies

Once Poetry is installed (instructions here), install this project's requirements with

poetry install

New packages can be added with poetry add, e.g.

poetry add requests pendulum

OpenAI credentials

In order to run the OpenAI jobs, this project requires an API key with OpenAI.

⚠️ Warning: OpenAI does charge money to use its API.

Once you have credentials, you can store them in a .env file.

# .env
OPENAI_API_KEY=yourapikey

⚠️ Warning: Don't upload .env files to git, they contain secrets.

Using environment variables to provide secrets ensures sensitive info won't be visible in your code or the launchpad in the UI. This project follows Dagster's best practices for handling secrets through configuration and resources.

Running Dagster locally

Once the requirements are installed, you can start the Dagster UI web server:

dagster dev

Open http://localhost:3000 with your browser to see the project.

Dagster assets are stored in mormon_queer_analysis/assets.py. The assets are automatically loaded into the Dagster code location as you define them.

Unit testing

Tests are in the mormon_queer_analysis_tests directory and you can run tests using pytest:

pytest mormon_queer_analysis_tests

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
database		database
images		images
mormon_queer_analysis		mormon_queer_analysis
mormon_queer_analysis_dbt		mormon_queer_analysis_dbt
mormon_queer_analysis_tests		mormon_queer_analysis_tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

mormon_queer_analysis

The stack

Overview

Results

Interesting findings:

Project structure and highlights

Organizing assets in groups

Varying external services or I/O without changing your DAG

Testing

Environments

ML models used

Development

Installing and adding Python dependencies

OpenAI credentials

Running Dagster locally

Unit testing

About

Releases

Packages

Languages

License

Joe-Koch/mormon-queer-analysis

Folders and files

Latest commit

History

Repository files navigation

mormon_queer_analysis

The stack

Overview

Results

Interesting findings:

Project structure and highlights

Organizing assets in groups

Varying external services or I/O without changing your DAG

Testing

Environments

ML models used

Development

Installing and adding Python dependencies

OpenAI credentials

Running Dagster locally

Unit testing

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages