Skip to content

Latest commit

 

History

History
148 lines (112 loc) · 7.14 KB

DATASET_CARD.md

File metadata and controls

148 lines (112 loc) · 7.14 KB

Dataset Card for folktexts

Folktexts is a suite of Q&A datasets with natural outcome uncertainty, aimed at evaluating LLMs' calibration on unrealizable tasks.

The folktexts datasets are derived from US Census data products. Namely, the datasets made available here are derived from the 2018 Public Use Microdata Sample (PUMS). Individual features are mapped to natural text using the respective codebook. Each task relates to predicting different individual characteristics (e.g., income, employment) from a set of demographic features (e.g., age, race, education, occupation).

Importantly, every task has natural outcome uncertainty. That is, in general, the features describing each row do not uniquely determine the task's label. For calibrated models to perform well on this task, the model must correctly output nuanced scores between 0 and 1, instead of simply outputting discrete labels 0 or 1.

Namely, we make available the following tasks in natural language Q&A format:

  • ACSIncome: Predict whether a working adult earns above $50,000 yearly.
  • ACSEmployment: Predict whether an adult is an employed civilian.
  • ACSPublicCoverage: Predict individual public health insurance coverage.
  • ACSMobility: Predict whether an individual changed address within the last year.
  • ACSTravelTime: Predict whether an employed adult has a work commute time longer than 20 minutes.

These tasks follow the same naming and feature/target columns as the folktables tabular datasets proposed by Ding et al. (2021). The folktables tabular datasets have seen prevalent use in the algorithmic fairness and distribution shift communities. We make available natural language Q&A versions of these tasks.

The datasets are made available in standard multiple-choice Q&A format (columns question, choices, answer, answer_key, and choice_question_prompt), as well as in numeric Q&A format (columns numeric_question, numeric_question_prompt, and label). The numeric prompting (also known as verbalized prompting) is known to improve calibration of zero-shot LLM risk scores [Tian et al., EMNLP 2023; Cruz et al., NeurIPS 2024].

The accompanying folktexts python package eases customization, evaluation, and benchmarking with these datasets.

Table of contents:

Dataset Details

Dataset Description

  • Language(s) (NLP): English
  • License: Code is licensed under the MIT license; Data license is governed by the U.S. Census Bureau terms of service.

Dataset Sources

Uses

The datasets were originally used to evaluate LLMs' ability to produce calibrated and accurate risk scores in the Cruz et al. (2024) paper.

Other potential uses include evaluating the fairness of LLMs' decisions, as individual rows feature protected demographic attributes such as sex and race.

Dataset Structure

Description of dataset columns:

  • id: A unique row identifier.
  • description: A textual description of an individual's features, following a bulleted-list format.
  • instruction: The instruction used for zero-shot LLM prompting (should be pre-appended to the row description).
  • question: A question relating to the task's target column.
  • choices: A list of two answer options relating to the above question.
  • answer: The correct answer from the above list of answer options.
  • answer_key: The correct answer key; i.e., A for the first choice, or B for the second choice.
  • choice_question_prompt: The full multiple-choice Q&A text string used for LLM prompting.
  • numeric_question: A version of the question that prompts for a numeric output instead of a discrete choice output.
  • label: The task's label. This is the correct output to the above numeric question.
  • numeric_question_prompt: The full numeric Q&A text string used for LLM prompting.
  • <tabular-columns>: All other columns correspond to the tabular features in this task. Each of these features will also appear in text form on the above description column.

The dataset was randomly split in training, test, and validation data, following an 80%/10%/10% split. Only the test split should be used to evaluate zero-shot LLM performance. The training split can be used for fine-tuning, or for fitting traditional supervised ML models on the tabular columns for metric baselines. The validation split should be used for hyperparameter tuning, feature engineering or any other model improvement loop.

Dataset Creation

Source Data

The datasets are based on publicly available data from the American Community Survey (ACS) Public Use Microdata Sample (PUMS), namely the 2018 ACS 1-year PUMS files.

Data Collection and Processing

The categorical values were mapped to meaningful natural language representations using the folktexts package, which in turn uses the official ACS PUMS codebook. The data download and processing was aided by the folktables python package, which in turn uses the official US Census web API.

Who are the source data producers?

U.S. Census Bureau.

Citation

If you find this useful in your research, please consider citing the following paper:

@inproceedings{
cruz2024evaluating,
title={Evaluating language models as risk scores},
author={Andr{\'e} F Cruz and Moritz Hardt and Celestine Mendler-D{\"u}nner},
booktitle={The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track},
year={2024},
url={https://openreview.net/forum?id=qrZxL3Bto9}
}

More Information

More information is available in the folktexts package repository and the accompanying paper.

Dataset Card Authors

André F. Cruz