The overview and official implementation of TV score used in OOD Detection in Mathematical Reasoning.
Details are shown in our paper.
Abbreviations: In-Distribution -> ID; Out-of-Distribuion -> OOD
-
Input Space: Low distinction between different domains
-
Output Space: compressed high-density search space -> pattern collapse
- Constraints on trajectory endpoints in mathematical reasoning allow for a greater likelihood of variation in trajectory volatility under different samples.
A trajectory-based algorithm to detect OOD samples in mathematical reasoning scenarios.
Algorithm Pipeline:
We denote
- Step 1: Mahalanobis Distance Mapping
- Step 2: Average of Absolute Value Difference
TV score w/o Differential Smoothing when
First, fine-tuning the base model with ID dataset (MultiArith).
cd your/project/root/folder/path/
bash Scripts/finetune.sh
Details of Scripts/finetune.sh
are as below:
#!/bin/bash
export PROJECT_PATH="your/project/root/folder/path/"
export MODEL_PATH="your/model/repository/root/folder/path/"
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" # gpu id
model_name="llama2-7b" # SFT model name
python FineTune/ID_finetune.py --model_name $model_name
After fine-tuning, checkpoints will be stored in os.environ['PROJECT_PATH']/Checkpoints/$model_name
Next, conduct inference for all ID/OOD datasets using the checkpoint just fine-tuned.
cd your/project/root/folder/path/
bash Scripts/inference.sh
Details of Scripts/inference.sh
are as below:
#!/bin/bash
export PROJECT_PATH="your/project/root/folder/path/"
export MODEL_PATH="your/model/repository/root/folder/path/"
export CUDA_VISIBLE_DEVICES="0,1" # gpu id
model_name="llama2-7b" # SFT model name
max_output_token_num="16"
ckpt_step="10000" # checkpoint step as you selected
dataset_list=(MultiArith GSM8K SVAMP AddSub SingleEq SingleOp)
category="X"
for i in ${dataset_list[*]}; do
python Inference/ID_OOD_inference.py --model_name $model_name \
--dataset "$i" \
--category $category \
--max_output_token_num $max_output_token_num \
--ckpt_step $ckpt_step
done
dataset="MATH"
category_list=(algebra geometry counting_and_probability number_theory precalculus)
for i in ${category_list[*]}; do
python Inference/ID_OOD_inference.py --model_name $model_name \
--dataset $dataset \
--category "$i" \
--max_output_token_num $max_output_token_num \
--ckpt_step $ckpt_step
done
After inference, all inference results will be stored in os.environ['PROJECT_PATH']/Data/Inference_Data/$model_name
. Each sample corresponds to one dictionary.
{
"id": i,
"hidden_state": hidden_states,
"output_scores": output_scores,
"output_seq": output_seq
}
Finally, computer TV scores for each dataset in all ID/OOD datasets.
cd your/project/root/folder/path/
bash Scripts/computation.sh
Details of Scripts/computation.sh
are as below:
#!/bin/bash
export PROJECT_PATH="your/project/root/folder/path/"
model_name="llama2-7b"
max_output_token_num="16"
max_order="5" # Differential Smoothing Order
dataset_list=(MultiArith GSM8K SVAMP AddSub SingleEq SingleOp)
category="X"
for i in ${dataset_list[*]}; do
python Computation/ID_OOD_score.py --model_name $model_name \
--dataset "$i" \
--category $category \
--max_output_token_num $max_output_token_num \
--max_order $max_order
done
dataset="MATH"
category_list=(algebra geometry counting_and_probability number_theory precalculus)
for i in ${category_list[*]}; do
python Computation/ID_OOD_score.py --model_name $model_name \
--dataset $dataset \
--category "$i" \
--max_output_token_num $max_output_token_num \
--max_order $max_order
done
After computation, all scores will be stored in os.environ['PROJECT_PATH']/Data/Score_Data/$model_name
.
If methods and conclusions in our paper are inspiring, you can support our work by citation. Thanks!
@article{wang2024trajectory,
title={Trajectory Volatility for Out-of-Distribution Detection in Mathematical Reasoning},
author={Wang, Yiming and Zhang, Pei and Yang, Baosong and Wong, Derek F and Zhang, Zhuosheng and Wang, Rui},
journal={arXiv preprint arXiv:2405.14039},
year={2024}
}