Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring
This repository houses the implementation of the paper titled "Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring," which has been accepted for presentation at EMNLP 2024 Findings.
Generating rationales that justify scoring decisions has been a promising way to facilitate explainability in automated scoring systems. However, existing methods do not match the accuracy of classifier-based methods. Plus, the generated rationales often contain hallucinated information. To address these issues, we propose a novel framework capable of generating more faithful rationales and, more importantly, matching performance with classifier-based black-box scoring systems. We first mimic the human assessment process by querying Large Language Models (LLMs) to generate a thought tree. We then summarise intermediate assessment decisions from each thought tree path for creating synthetic rationale data and rationale preference data. Finally, we utilise the generated synthetic data to calibrate LLMs through a two-step training process: supervised fine-tuning and preference optimization. Extensive experimental results demonstrate that our framework achieves a 38% assessment performance improvement in the QWK score compared to prior work while producing higher-quality rationales, as recognised by human evaluators and LLMs. Our work sheds light on the effectiveness of performing preference optimization using synthetic preference data obtained from thought tree paths.
We are thrilled to make our datasets and models accessible at all stages of our research. Explore our collections and models via the following links:
- Stage 1: MCT Data
- Stage 2: Synthetic Rationale Data
- Stage 3: Rationale to Score Model
- Stage 3: Llama-3-8B SFT Model
- Stage 3: Llama-3-8B DPO Model
- Stage 3: Mixtral-8x7B-Instruct-v0.1 SFT Model
- Stage 3: Mixtral-8x7B-Instruct-v0.1 DPO Model
conda env create -f environment.yml
Edit configs/tot_query.yaml
If you use Azure OpenAI api service: Add your api info in here.
If you use OpenAI API service: Add your api key in here.
If you use Mistral API service: Add your api key in here.
If you use VLLM local api sever: Change your configuration in here.
Using custom models: You probably need to change model list in here.
python query.py
Edit configs/generation.yaml
python generate.py
We utilize OpenAI’s batch API to generate synthetic data efficiently.
We used LLaMA-Factory(Thanks!) to train our models. Please refer to our example training scripts/configs: [train sft model] [train dpo model].
If you find our method useful, please cite our paper as follows:
@misc{li2024calibratingllmspreferenceoptimization,
title={Calibrating LLMs with Preference Optimization on Thought Trees for Generating Rationale in Science Question Scoring},
author={Jiazheng Li and Hainiu Xu and Zhaoyue Sun and Yuxiang Zhou and David West and Cesare Aloisi and Yulan He},
year={2024},
eprint={2406.19949},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2406.19949},
}