Pem is a set of pre-trained machine-learning models that predict (im-)politeness scores in texts. The models are trained on annotated microblogs and packaged into a simple Python class for easy use. Currently, two language-microblog pairs are supported:
- English Twitter
- Mandarin Chinese Weibo
The description of the data and training process is part of our paper published at CSCW 2020. Preprint: https://arxiv.org/pdf/2008.02449.pdf
Looking for source code of analyses, and/or PoliteLex itself? See src/
.
The annotated corpora is available upon request.
- Install requirements:
pip install -r requirements.txt
- If you wish to use LIWC (highly recommended for increased accuracy),
- Put your English LIWC
.dic
file to the same directory asLIWC2015_Dictionary.dic
. - Convert LIWC dictionary to long form
liwc15.csv
by runningpython prepare_liwc.py
. - Rename
liwc15.csv
toenglish_liwc15.csv
and repeat this for other languages, if needed.
- Put your English LIWC
- Prepare also the EmoLex lexicon by
python prepare_emolex.py
. Bothenglish_emolex.csv
andchinese_emolex.csv
should be generated.
Notice that, due to license concerns, we are unable to provide LIWC in this repository.
-
Put microblogs in a CSV file as a column
text
. I have included two toy examples,tweets.csv
for English Twitter andweibo.csv
for Mandarin Weibo. If Chinese, please pre-segment/pre-tokenize posts by whitespace. -
In your code:
from pem import Pem # `Pem` is short for "Politeness Estimator for Microblogs". pem = Pem( liwc_path ='english_liwc15.csv', # or '' if LIWC is unavailable emolex_path ='english_emolex.csv', estimator_path ='english_twitter_politeness_estimator.joblib', # or 'english_twitter_politeness_estimator_noLiwc.joblib' if LIWC is unavailable feature_defn_path ='english_twitter_additional_features.pickle') pem.load('tweets.csv') pem.tokenize() pem.vectorize() print(pem.predict())
or, for Mandarin Weibo posts:
from pem import Pem # `Pem` is short for "Politeness Estimator for Microblogs". pem = Pem( liwc_path ='chinese_liwc15.csv', # or '' if LIWC is unavailable emolex_path ='chinese_emolex.csv', estimator_path ='chinese_weibo_politeness_estimator.joblib', # or 'chinese_weibo_politeness_estimator_noLiwc.joblib' if LIWC is unavailable feature_defn_path ='chinese_weibo_additional_features.pickle') pem.load('weibo.csv') pem.tokenize() pem.vectorize() print(pem.predict())
We are actively working on understanding sources of bias in classifiers and currently, estimates between -0.5 and 0.5 are treated as neutral. Would love your feedback on how to make this classifer better. Reach out at myli at alumni dot upenn dot edu or sharathg at cis dot upenn dot edu.
@article{li2020cscw,
title={Studying Politeness across Cultures Using English Twitter and Mandarin Weibo},
author={Li, Mingyang and Hickman, Louis and Tay, Louis and Ungar, Lyle and Guntuku, Sharath Chandra},
journal={Proceedings of the ACM on Human-Computer Interaction},
number={CSCW},
year={2020}
}
APA
Li, M., Hickman, L., Tay, L., Ungar, L., & Guntuku, S. C. (2020). Studying Politeness across Cultures Using English Twitter and Mandarin Weibo. Proceedings of the ACM on Human-Computer Interaction (CSCW)