Authors: Noah Lee* Na Min An* James Thorne
- We test generative LLMs jointly on the performance and human disagreement on NLI.
- We suggest two probability distribution estimation techniques for LLMs to represent disagreement and perform empirical evaluations to with respect to the human disagreement distribution.
- LLMs do not excel as expected on NLI tasks and fail to align with human disagreement levels.
conda create -n humllm
conda activate humllm
pip install -r requirements.txt
The datasets used for the research are as the following:
- All the script examples can be found in
./scripts/
Sample random 100 samples & hardest 100 samples
bash ./scripts/sample.sh
Generation of a LLM output can be done by bash ./scripts/generate.sh
or either:
python generate.py --data_dir <input data directory> \
--data_type <input data type> \
--model <model name> \
--file_name <output file name> \
--out_dir <output directory> \
--max_length <maximum token lengths> \
--gen_type <generation type> \
--num_iter <iteration number> \
--num_samples <sample number> # num_iter x num_samples = total sample size \
--prompt_variations <use prompt variations> \
--few_shot <few shot number>
Evaluation of generated distribution is available by bash ./scripts/evaluate.sh
or either:
python evaluate.py --data_dir <input data directory> \
--data_type <input data type> \
--gen_type <generation type>
Please consider citing our work if you find this work helpful for your research.
@misc{lee2023large,
title={Can Large Language Models Capture Dissenting Human Voices?},
author={Noah Lee and Na Min An and James Thorne},
year={2023},
eprint={2305.13788},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
- Noah Lee: [email protected]
- Na Min An: [email protected]