Code and data for our paper: Should we tweet this? Generative response modeling for predicting reception of public health messaging on Twitter
Follow this guide to:
- Interact with our trained response generation models
- Import our COVID-19 and Vaccines public health tweet datasets and run the model evaluation from the paper
Our models are also available on the HuggingFace model hub:
- COVID-19: TheRensselaerIDEA/gpt2-large-covid-tweet-response
- Vaccines: TheRensselaerIDEA/gpt2-large-vaccine-tweet-response
We recommend creating an environment with Python 3.7 or greater.
If PyTorch is not already installed in your environment, please install the appropriate configuration of PyTorch for you environment (OS, CUDA version) before proceeding - see https://pytorch.org/get-started/locally/.
To install python dependencies run:
pip install -r requirements.txt
Depending on the model you wish to use (COVID-19 or Vaccines), specify the appropriate config:
# COVID-19
python response_prediction/predict_distributions.py -c covid19_config.json
# ...OR...
# Vaccines
python response_prediction/predict_distributions.py -c vaccines_config.json
Open analysis/predict_responses.Rmd in a knitr enabled R environment. We recommend using RStudio.
Knit or run the notebook to generate responses. Specifically, the notebook includes list parameters prompt_authors
, prompt_messages
, and response_sample_size
. It generates a sample of N=response_sample_size
responses for each message in prompt_messages
for each author account in prompt_authors
. Generated responses are assigned sentiment scores, and each sample is output with response text and sentiment statistics.
Elasticsearch version 7.x is required to import the datasets and run the evaluation. Elasticsearch v8.x may also work but we have not tested with it.
Importing the datasets requires downloading tweets by ID from the Twitter API. A Twitter developer account is required for this. If you don't have one already, you can apply at https://developer.twitter.com/en/apply-for-access.
We re-use the tweet collection pipeline from our previous paper. The code required to import the datasets has been included here, and the import instructions from the accompanying repository are compatible. Specifically, make sure to follow the following sections when importing each dataset:
- Section 2.2.1 and Section 2.2.2 to import the tweets.
- Section 2.3 to pre-process the tweets (compute embeddings and sentiment scores).
Importantly, use the following modifications to the original instructions:
- In sections 2.2.1 and 2.3 use:
"elasticsearch_index_name": "covid19-pubhealth-responses"
for the COVID-19 dataset"elasticsearch_index_name": "vaccine-pubhealth-responses"
for the Vaccines dataset
- In section 2.2.2 use:
--datasetglob=./../tweet_ids/covid19/*/*.txt
for the COVID-19 dataset--datasetglob=./../tweet_ids/vaccine/*/*.txt
for the Vaccines dataset
Open analysis/model_eval.Rmd in a knitr enabled R environment. We recommend using RStudio.
Knit or run the notebook to sample responses and evaluate the baselines & model as described in the paper. Specifically:
- For the COVID-19 dataset, set:
elasticsearch_index <- "covid19-pubhealth-responses"
rangestart <- "2020-03-01 00:00:00"
rangeend <- "2020-10-01 00:00:00"
- For the Vaccines dataset, set:
elasticsearch_index <- "vaccine-pubhealth-responses"
rangestart <- "2021-10-01 00:00:00"
rangeend <- "2022-02-01 00:00:00"
If you use our data, models, or code in your work, please cite:
@article{sanders2022should,
title={Should we tweet this? Generative response modeling for predicting reception of public health messaging on Twitter},
author={Sanders, Abraham and Ray-Majumder, Debjani and Erickson, John S and Bennett, Kristin P},
journal={arXiv preprint arXiv:2204.04353},
year={2022}
}