Folktexts is a suite of Q&A datasets with natural outcome uncertainty, aimed at evaluating LLMs' calibration on unrealizable tasks.
The folktexts datasets are derived from US Census data products. Namely, the datasets made available here are derived from the 2018 Public Use Microdata Sample (PUMS). Individual features are mapped to natural text using the respective codebook. Each task relates to predicting different individual characteristics (e.g., income, employment) from a set of demographic features (e.g., age, race, education, occupation).
Importantly, every task has natural outcome uncertainty. That is, in general, the features describing each row do not uniquely determine the task's label. For calibrated models to perform well on this task, the model must correctly output nuanced scores between 0 and 1, instead of simply outputting discrete labels 0 or 1.
Namely, we make available the following tasks in natural language Q&A format:
ACSIncome
: Predict whether a working adult earns above $50,000 yearly.ACSEmployment
: Predict whether an adult is an employed civilian.ACSPublicCoverage
: Predict individual public health insurance coverage.ACSMobility
: Predict whether an individual changed address within the last year.ACSTravelTime
: Predict whether an employed adult has a work commute time longer than 20 minutes.
These tasks follow the same naming and feature/target columns as the folktables tabular datasets proposed by Ding et al. (2021). The folktables tabular datasets have seen prevalent use in the algorithmic fairness and distribution shift communities. We make available natural language Q&A versions of these tasks.
The datasets are made available in standard multiple-choice Q&A format (columns
question
, choices
, answer
, answer_key
, and choice_question_prompt
), as
well as in numeric Q&A format (columns numeric_question
,
numeric_question_prompt
, and label
).
The numeric prompting (also known as verbalized prompting) is known to improve
calibration of zero-shot LLM risk scores
[Tian et al., EMNLP 2023;
Cruz et al., NeurIPS 2024].
The accompanying folktexts
python package
eases customization, evaluation, and benchmarking with these datasets.
Table of contents:
- Language(s) (NLP): English
- License: Code is licensed under the MIT license; Data license is governed by the U.S. Census Bureau terms of service.
- Repository: https://github.com/socialfoundations/folktexts
- Paper: https://arxiv.org/pdf/2407.14614
- Data source: 2018 American Community Survey Public Use Microdata Sample
The datasets were originally used to evaluate LLMs' ability to produce calibrated and accurate risk scores in the Cruz et al. (2024) paper.
Other potential uses include evaluating the fairness of LLMs' decisions,
as individual rows feature protected demographic attributes such as sex
and
race
.
Description of dataset columns:
id
: A unique row identifier.description
: A textual description of an individual's features, following a bulleted-list format.instruction
: The instruction used for zero-shot LLM prompting (should be pre-appended to the row description).question
: A question relating to the task's target column.choices
: A list of two answer options relating to the above question.answer
: The correct answer from the above list of answer options.answer_key
: The correct answer key; i.e.,A
for the first choice, orB
for the second choice.choice_question_prompt
: The full multiple-choice Q&A text string used for LLM prompting.numeric_question
: A version of the question that prompts for a numeric output instead of a discrete choice output.label
: The task's label. This is the correct output to the above numeric question.numeric_question_prompt
: The full numeric Q&A text string used for LLM prompting.<tabular-columns>
: All other columns correspond to the tabular features in this task. Each of these features will also appear in text form on the above description column.
The dataset was randomly split in training
, test
, and validation
data,
following an 80%/10%/10% split.
Only the test
split should be used to evaluate zero-shot LLM performance.
The training
split can be used for fine-tuning, or for fitting traditional
supervised ML models on the tabular columns for metric baselines.
The validation
split should be used for hyperparameter tuning, feature
engineering or any other model improvement loop.
The datasets are based on publicly available data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS), namely the 2018 ACS 1-year PUMS files.
The categorical values were mapped to meaningful natural language
representations using the folktexts
package, which in turn uses the official
ACS PUMS codebook.
The data download and processing was aided by the folktables
python package,
which in turn uses the official US Census web API.
U.S. Census Bureau.
If you find this useful in your research, please consider citing the following paper:
@inproceedings{
cruz2024evaluating,
title={Evaluating language models as risk scores},
author={Andr{\'e} F Cruz and Moritz Hardt and Celestine Mendler-D{\"u}nner},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=qrZxL3Bto9}
}
More information is available in the folktexts
package repository
and the accompanying paper.