MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

Data and code for our paper An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

For more details, please refer to the project page: https://mathhay.github.io/.

[Webpage] [Paper] [Huggingface Dataset] [Leaderboard] [Twitter]

💥 News 💥

[2024.11.14] Our code is now accessible.
[2024.10.07] Our paper is now accessible at https://arxiv.org/abs/2410.04698.

Overview of the automatic construction of MATHHAY:

Overview of the framework for the automatic construction of the MATHHAY Benchmark.

Compared to existing long-context benchmarks:

Benchmark	Multi-Doc	Multi-Step	Avoid Contam.	Irrelevant Docs	Realistic Docs	Auto. Const.	Math. Reasoning
ZeroSCROLLS	✓	✓	✗	✓	✓	✗	✗
L-Eval (Math)	✓	✓	✓	✗	✗	✓	✓
LongBench	✓	✓	✓	✓	✓	✗	✗
BAMBOO	✓	✗	✓	✓	✓	✗	✗
InfiniteBench (Math)	✓	✓	✗	✓	✗	✗	✓
Loong	✓	✓	✓	✓	✓	✗	✗
NIAH	✗	✗	✗	✓	✓	✓	✗
RULER	✓	✓	✗	✓	✓	✓	✗
FlenQA	✗	✓	✗	✓	✓	✗	✗
SummHay	✓	✗	✓	✓	✓	✗	✗
BABILong	✓	✓	✓	✓	✓	✗	✗
NeedleBench	✓	✓	✗	✓	✓	✓	✗
MathHay (Ours)	✓	✓	✓	✓	✓	✓	✓

Leaderboard on the MathHay V1:

Accuracy scores on the MathHay V1:

Performance of Selected Models on MATHHAY (32K to 128K tokens). The model with the best performance is highlighted in bold.

Dataset Examples

Examples of the single step single document tasks.

Automatic Generation for MathHay

Run the following commands to install dependencies:

pip install openai
pip install pydantic
pip install tavily-python
pip install spacy
pip install pandas
pip install langchain
pip install langchain-core
pip install nltk
pip install tiktoken
pip install google
pip install boto3

python -m spacy download en_core_web_sm

Set up environment variables:

export TAVILY_API_KEY=""
export OPENAI_API_KEY=""
export PYTHONPATH="."

To generate MathHay data, use:

sh scripts/bench_generation.sh March-2024-to-September-2024 2 2 2

where assigned input arguments are time, number of topics, number of subtopics, and number of queries.

Evaluations on MathVista

Run the evaluation command:

sh scripts/evaluation.sh March-2024-to-September-2024 sssd gpt-4o 32000 middle full

where assigned input arguments are time, task type, models to be evaluated, input length, placement, and dataset choice.

License and Usage

Users need to make their own assessment regarding any obligations or responsibilities under the corresponding licenses or terms and conditions pertaining to the original datasets and data. This repository is being released for research purposes only.

Citation

If you use our data or method, please cite our paper:

@article{wang2024mathhay,
  title={MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs},
  author={Wang, Lei and Dong, Shan and Xu, Yuhui and Dong, Hanze and Wang, Yalu and Saha, Amrita and Lim, Ee-Peng and Xiong, Caiming and Sahoo, Doyen},
  journal={arXiv preprint arXiv:2410.04698},
  year={2024}
}

Ethical Considerations

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high-risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
analysis		analysis
assets		assets
bench_generation		bench_generation
evaluation		evaluation
outputs/data		outputs/data
results/March-2024-to-September-2024		results/March-2024-to-September-2024
scripts		scripts
.DS_Store		.DS_Store
.gitignore		.gitignore
AI_ETHICS.md		AI_ETHICS.md
CODEOWNERS		CODEOWNERS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.txt		LICENSE.txt
README.md		README.md
SECURITY.md		SECURITY.md
license_info.md		license_info.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

💥 News 💥

Overview of the automatic construction of MATHHAY:

Compared to existing long-context benchmarks:

Leaderboard on the MathHay V1:

Dataset Examples

Automatic Generation for MathHay

Evaluations on MathVista

License and Usage

Citation

Ethical Considerations

About

Releases

Packages

Contributors 3

Languages

License

SalesforceAIResearch/MathHay

Folders and files

Latest commit

History

Repository files navigation

MathHay: An Automated Benchmark for Long-Context Mathematical Reasoning in LLMs.

💥 News 💥

Overview of the automatic construction of MATHHAY:

Compared to existing long-context benchmarks:

Leaderboard on the MathHay V1:

Dataset Examples

Automatic Generation for MathHay

Evaluations on MathVista

License and Usage

Citation

Ethical Considerations

About

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages