Confidence Score Estimation for Dialogue State Tracking (DST)

Introduction

Estimating a model's confidence in its outputs is critical for conversational AI systems based on LLMs, particularly for reducing hallucinations and preventing over-reliance. In this project, we explore various methods to quantify and leverage model uncertainty, focusing on DST in TODS. Our approach aims to provide well-calibrated confidence scores, improving the overall performance of the dialogue systems.

Dataset

Our experiments use the MultiWOZ 2.2 corpus, a multi-domain task-oriented dialogue dataset. MultiWOZ is a human-human written dialogue dataset with 8K/1K/1K samples for training/validation/testing. We focus on five domains: Restaurant, Hotel, Train, Attraction, and Taxi. Each domain includes turn-level annotations and descriptions of slot labels.

Methods

We evaluate four methods for estimating confidence scores:

Softmax
Raw token scores
Verbalized confidences
Combination of the above methods

Additionally, we enhance these methods with a self-probing mechanism. We use the Area Under the Curve (AUC) metric to assess calibration, with higher AUC indicating better calibration. Our experiments demonstrate the effectiveness of fine-tuning open-weight LLMs to achieve better-calibrated confidence scores and improved Joint Goal Accuracy (JGA) by 8.5% in zero-shot scenarios compared to closed-source models.

Results

Our results show that incorporating confidence scores into the fine-tuning process significantly enhances DST performance. The combined confidence score method generates well-calibrated scores that are moderately correlated with ground truth labels, justifying its superior performance.

Installation

To install the necessary dependencies for this project, run:

pip install -r requirements.txt

Usage

To use the code in this repository, follow these steps:

Clone the respository

git clone https://github.com/yourusername/Confidence_Score_DST.git
cd Confidence_Score_DST

Get multiwoz-context-db.vec, which is a faiss database.

python create_faiss_db.py --output_faiss_db multiwoz-context-db.vec

Methods To run the different methods, use the following commands:

No Self-probing

python run.py

Turn-level Self-probing

python run_selfprob_turn.py

Slot-level Self-probing

python run_selfprob_slot.py

=======================================================

Arguments

The scripts accept various arguments to customize the execution. Here are the detalis:

--database (str): Path to the database file. Default is "multiwoz_database".
--faiss (str): Path to the FAISS index file. Default is "multiwoz-context-db.vec".
--ontology (str): Path to the ontology file. Default is "ontology.json".
--context_size (int): Size of the dialogue context to consider. Default is 2.
--num_examples (int): Number of examples to use for few-shot learning. Default is 3.
--dials_total (int): Total number of dialogues to process. Default is 100.
--prompt (str): Type of prompt to use. Options are "vanilla", "topk", and "multistep". Default is "vanilla".
--few_shot (int): Whether to use few-shot learning. 1 to enable, 0 to disable. Default is 0.
--temperature (float): Temperature for scaling the logits. Default is 0.7.
--model_name (str): Name of the model to use. Default is "meta-llama/Meta-Llama-3-8B-Instruct".
--split (str): Dataset split to use. Options are "train", "validation", and "test". Default is "validation".
--result (str): Path to save the results. Default is "results".
--plot_result (str): Path to save plot results. Default is "plot_results_gpt4".
--verbalized (int): Whether to use verbalized confidences. 1 to enable, 0 to disable. Default is 0.
--start_idx (int): Starting index for processing dialogues. Default is 0.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
Confidence Results		Confidence Results
multiwoz_database		multiwoz_database
mwzeval		mwzeval
.gitignore		.gitignore
README.md		README.md
confidence_minicons.py		confidence_minicons.py
create_faiss_db.py		create_faiss_db.py
database.py		database.py
definitions.py		definitions.py
evaluation_minicons.py		evaluation_minicons.py
loaders.py		loaders.py
model.py		model.py
ontology.json		ontology.json
prompts.py		prompts.py
prompts_selfprob.py		prompts_selfprob.py
requirements.txt		requirements.txt
run.py		run.py
run.slurm		run.slurm
run_selfprob_slot.py		run_selfprob_slot.py
run_selfprob_turn.py		run_selfprob_turn.py
slot_description.py		slot_description.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Confidence Score Estimation for Dialogue State Tracking (DST)

Table of Contents

Introduction

Dataset

Methods

Results

Installation

Usage

Arguments

About

Releases

Packages

Contributors 2

Languages

jennycs0830/Confidence_Score_DST

Folders and files

Latest commit

History

Repository files navigation

Confidence Score Estimation for Dialogue State Tracking (DST)

Table of Contents

Introduction

Dataset

Methods

Results

Installation

Usage

Arguments

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages