Skip to content

Latest commit

 

History

History
226 lines (153 loc) · 7.4 KB

README.md

File metadata and controls

226 lines (153 loc) · 7.4 KB

Trajectory Volatility for Out-of-Distribution Detection in Mathematical Reasoning

The overview and official implementation of TV score used in OOD Detection in Mathematical Reasoning.

Details are shown in our paper.

Abbreviations: In-Distribution -> ID; Out-of-Distribuion -> OOD

Overview

Why Trajectory as the measure?

a. Disadvantages of input/output embedding space:

  • Input Space: Low distinction between different domains

  • Output Space: compressed high-density search space -> pattern collapse

b. Advantages of input->output embedding shift trajectory:

  • Constraints on trajectory endpoints in mathematical reasoning allow for a greater likelihood of variation in trajectory volatility under different samples.

What is TV score?

A trajectory-based algorithm to detect OOD samples in mathematical reasoning scenarios.

Algorithm Pipeline:

We denote $\boldsymbol{y_l}$ as the embedding of $l$-th layer, $\mathcal{G}_l = \mathcal{N}(\boldsymbol{\mu}_l, \boldsymbol{\Sigma}_l)$ as ID Gaussian distribution of $l$-th layer

  • Step 1: Mahalanobis Distance Mapping
  • Step 2: Average of Absolute Value Difference

TV score w/o Differential Smoothing when $k = 0$, w/ Differential Smoothing when $k>0$.

Usage Instruction (Batch Computation)

Step 1: ID Fine-tuning

First, fine-tuning the base model with ID dataset (MultiArith).

cd your/project/root/folder/path/
bash Scripts/finetune.sh

Details of Scripts/finetune.sh are as below:

#!/bin/bash

export PROJECT_PATH="your/project/root/folder/path/"
export MODEL_PATH="your/model/repository/root/folder/path/"
export CUDA_VISIBLE_DEVICES="0,1,2,3,4,5,6,7"   # gpu id

model_name="llama2-7b"  # SFT model name

python FineTune/ID_finetune.py --model_name $model_name

After fine-tuning, checkpoints will be stored in os.environ['PROJECT_PATH']/Checkpoints/$model_name

Step 2: ID/OOD Inference

Next, conduct inference for all ID/OOD datasets using the checkpoint just fine-tuned.

cd your/project/root/folder/path/
bash Scripts/inference.sh

Details of Scripts/inference.sh are as below:

#!/bin/bash

export PROJECT_PATH="your/project/root/folder/path/"
export MODEL_PATH="your/model/repository/root/folder/path/"
export CUDA_VISIBLE_DEVICES="0,1"  # gpu id

model_name="llama2-7b"  # SFT model name
max_output_token_num="16"
ckpt_step="10000"   # checkpoint step as you selected

dataset_list=(MultiArith GSM8K SVAMP AddSub SingleEq SingleOp)
category="X"
for i in ${dataset_list[*]}; do
    python Inference/ID_OOD_inference.py --model_name $model_name \
                                         --dataset "$i" \
                                         --category $category \
                                         --max_output_token_num $max_output_token_num \
                                         --ckpt_step $ckpt_step
done

dataset="MATH"
category_list=(algebra geometry counting_and_probability number_theory precalculus)
for i in ${category_list[*]}; do
    python Inference/ID_OOD_inference.py --model_name $model_name \
                                         --dataset $dataset \
                                         --category "$i" \
                                         --max_output_token_num $max_output_token_num \
                                         --ckpt_step $ckpt_step
done

After inference, all inference results will be stored in os.environ['PROJECT_PATH']/Data/Inference_Data/$model_name. Each sample corresponds to one dictionary.

{  
  "id": i,
  "hidden_state": hidden_states,
  "output_scores": output_scores,
  "output_seq": output_seq
}

Step 3: ID/OOD TV Score Computation

Finally, computer TV scores for each dataset in all ID/OOD datasets.

cd your/project/root/folder/path/
bash Scripts/computation.sh

Details of Scripts/computation.sh are as below:

#!/bin/bash

export PROJECT_PATH="your/project/root/folder/path/"

model_name="llama2-7b"
max_output_token_num="16"
max_order="5"   # Differential Smoothing Order

dataset_list=(MultiArith GSM8K SVAMP AddSub SingleEq SingleOp)
category="X"
for i in ${dataset_list[*]}; do
    python Computation/ID_OOD_score.py --model_name $model_name \
                                       --dataset "$i" \
                                       --category $category \
                                       --max_output_token_num $max_output_token_num \
                                       --max_order $max_order
done

dataset="MATH"
category_list=(algebra geometry counting_and_probability number_theory precalculus)
for i in ${category_list[*]}; do
    python Computation/ID_OOD_score.py --model_name $model_name \
                                       --dataset $dataset \
                                       --category "$i" \
                                       --max_output_token_num $max_output_token_num \
                                       --max_order $max_order
done

After computation, all scores will be stored in os.environ['PROJECT_PATH']/Data/Score_Data/$model_name.

Citation

If methods and conclusions in our paper are inspiring, you can support our work by citation. Thanks!

@article{wang2024trajectory,
  title={Trajectory Volatility for Out-of-Distribution Detection in Mathematical Reasoning},
  author={Wang, Yiming and Zhang, Pei and Yang, Baosong and Wong, Derek F and Zhang, Zhuosheng and Wang, Rui},
  journal={arXiv preprint arXiv:2405.14039},
  year={2024}
}