Data and code for our paper An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.
For more details, please refer to the project page: https://mathhay.github.io/.
[Webpage] [Paper] [Huggingface Dataset] [Leaderboard] [Twitter]
- [2024.11.14] Our code is now accessible.
- [2024.10.07] Our paper is now accessible at https://arxiv.org/abs/2410.04698.
Overview of the framework for the automatic construction of the MATHHAY Benchmark.
Benchmark | Multi-Doc | Multi-Step | Avoid Contam. | Irrelevant Docs | Realistic Docs | Auto. Const. | Math. Reasoning |
---|---|---|---|---|---|---|---|
ZeroSCROLLS | ✓ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ |
L-Eval (Math) | ✓ | ✓ | ✓ | ✗ | ✗ | ✓ | ✓ |
LongBench | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
BAMBOO | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ |
InfiniteBench (Math) | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ |
Loong | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
NIAH | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✗ |
RULER | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
FlenQA | ✗ | ✓ | ✗ | ✓ | ✓ | ✗ | ✗ |
SummHay | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ | ✗ |
BABILong | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ |
NeedleBench | ✓ | ✓ | ✗ | ✓ | ✓ | ✓ | ✗ |
MathHay (Ours) | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
Accuracy scores on the MathHay V1:
Performance of Selected Models on MATHHAY (32K to 128K tokens). The model with the best performance is highlighted in bold.
Examples of the single step single document tasks.
Run the following commands to install dependencies:
pip install openai
pip install pydantic
pip install tavily-python
pip install spacy
pip install pandas
pip install langchain
pip install langchain-core
pip install nltk
pip install tiktoken
pip install google
pip install boto3
python -m spacy download en_core_web_sm
Set up environment variables:
export TAVILY_API_KEY=""
export OPENAI_API_KEY=""
export PYTHONPATH="."
To generate MathHay data, use:
sh scripts/bench_generation.sh March-2024-to-September-2024 2 2 2
where assigned input arguments are time, number of topics, number of subtopics, and number of queries.
Run the evaluation command:
sh scripts/evaluation.sh March-2024-to-September-2024 sssd gpt-4o 32000 middle full
where assigned input arguments are time, task type, models to be evaluated, input length, placement, and dataset choice.
Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.
If you use our data or method, please cite our paper:
@article{wang2024mathhay,
title={MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs},
author={Wang, Lei and Dong, Shan and Xu, Yuhui and Dong, Hanze and Wang, Yalu and Saha, Amrita and Lim, Ee-Peng and Xiong, Caiming and Sahoo, Doyen},
journal={arXiv preprint arXiv:2410.04698},
year={2024}
}