This repository contains the dataset and code for our paper.
We compare AI-generated and human-written inspiring Reddit content across India and the UK. Although there may not be any visible differences to the human eye, by using linguistic methods, we find significant syntactic and lexical cross-cultural differences between generated and real inspiring posts.
The final data is available at all_data
All annotations are available in `all_annotations'
Topic Modeling features can be accessed interactively in topic_analysis
All generation code is available at LLM_generation.
Random Forest, Naive Bayes, SVM models are available at baselines.
XLM-Roberta is available at roberta.
Llama-2-7b model with LoRA fine-tuning is available at llama.