This repository contains code and generated data for the paper "The Parrot Dilemma: Human-Labeled vs. LLM-augmented Data in Classification Tasks" by Anders Giovanni Møller, Jacob Aarup Dalsgaard, Arianna Pera, and Luca Maria Aiello. Accepted at EACL 2024.
The project contains code and functionality to perform the following experiments.
- Zero-shot classification using LLMs.
- Data augmentation using LLMs.
- Datasize experiment using progressively larger sample size in training.
- Traditional LM training (additional experiment, not included in the paper).
- Few-shot learning with contrastive pre-training using the SetFit framework (additional experiment, not included in the paper).
We use the OpenAI and huggingface APIs to interact with LLMs. We use GPT-4 and Llama-2 70B Chat.
- Latest versions of poetry installed.
- OpenAI API key. Make sure to put it in
.env
file. - A Weights & Bias account for performance reporting.
You can install the environment using:
# Create the environment
$ poetry shell
# Update dependencies
$ poetry update
# Install project
$ poetry install
- Login to your W&B account:
$ wandb loging
- Enable tracking of experiments:
$ wandb enabled
- Disable tracking of experiments (for debugging):
$ wandb disabled
In src/worker_vs_gpt/conf/config_prompt_augmentation.yaml
you find the configuration file with variables to change:
model: vicuna # can be gpt-3.5-turbo or gpt-4
dataset: analyse-tal # can be hate-speech, sentiment, ten-dim
sampling: balanced # can be proportional or balanced
Next, execute the script: python -m src.worker_vs_gpt.prompt_augmentation
In src/worker_vs_gpt/conf/config_prompt_classification.yaml
you find the configuration file with variables to change:
model: llama-2-70b # can be gpt-4 or llama-2-70b
dataset: sentiment # can be hate-speech, sentiment, ten-dim, ...
wandb_project: W&B_project_name
wandb_entity: W&B_account_name
Next, execute the script: python -m src.worker_vs_gpt.prompt_classification
In src/worker_vs_gpt/conf/config_datasize.yaml
you find the configuration file with variables to change:
ckpt: intfloat/e5-base # The model you want to use from the Hugginface model hub
dataset: ten-dim # can be 'hate-speech', 'sentiment', 'ten-dim', ...
use_augmented_data: True # Whether or not to use augmented data
sampling: balanced # can be balanced or proportional
augmentation_model: llama-2-70b # can be gpt-4 or llama-2-70b
wandb_project: W&B_project_name
wandb_entity: W&B_account
batch_size: 32 # batch size
lr: 2e-5 # learning rate
num_epochs: 10 # number of epochs
weight_decay: 0 # weight decay
Next, execute the script: python -m src.worker_vs_gpt.datasize_experiment
In src/worker_vs_gpt/conf/config_trainer.yaml
you find the configuration file with variables to change:
ckpt: intfloat/e5-base # The model you want to use from the Hugginface model hub
dataset: ten-dim # can be 'hate-speech', 'sentiment', 'ten-dim'
use_augmented_data: True # Whether or not to use augmented data
sampling: proportional # can be proportional or balanced
augmentation_model: llama-2-70b # can be gpt-4 or llama-2-70b
experiment_type: both # can be crowdsourced (only crowdsourced), aug (only augmented data), both (crowdsourced and augmented data concatenated)
wandb_project: W&B_project_name
wandb_entity: W&B_account
batch_size: 32 # batch size
lr: 2e-5 # learning rate
num_epochs: 10 # number of epochs
weight_decay: 0 # weight decay
Next, execute the script: python -m src.worker_vs_gpt.__main__
Few-shot with contrastive pre-training using SetFit (additional experiment, not included in the paper)
In src/worker_vs_gpt/conf/setfit.yaml
you find the configuration file with variables to change:
ckpt: intfloat/e5-base # The model you want to use from the Hugginface model hub
text_selection: h_text # this is for social-dim dataset, don't change
experiment_type: aug # can be 'crowdsourced', 'aug', 'both'
sampling: balanced # can be proportional or balanced
augmentation_model: gpt-3.5-turbo # can be gpt-3.5-turbo or gpt-4
dataset: hate-speech # can be 'hate-speech', 'sentiment', 'ten-dim'
batch_size: 8 # Batch size
lr_body: 1e-5 # Learning rate for the contrastive pre-training of the model body.
lr_head: 1e-5 # Learning rate for the classification head
num_iterations: 20 # Parameter to construct pais in pre-training. They use in the paper (20)
num_epochs_body: 1 # How many epochs to do contrastive pre-training. 1 is used in paper.
num_epochs_head: 20 # How many iterations to train the head for. In their tutorial they use 50. Not clear how many they use in training.
weight_decay: 0 # weight decay
wandb_project: W&B_project_name
wandb_entity: W&B_account
Next, execute the script: python -m src.worker_vs_gpt.setfit_classfication
Distributed under the terms of the MIT license, Worker vs. GPT is free and open source software.
This project was generated from @cjolowicz's Hypermodern Python Cookiecutter template.