- Metrics: Faithfulness between predicted response and ground-truth knowledge (Tab. 1) -- Critic, Q², BERT F1, F1.
- Datasets: Wizard-of-Wikipedia (WoW), the DSTC9 and DSTC11 extensions of MultiWoZ 2.1, FaithDial -- a de-hallucinated subset of WoW.
- Metrics: Factual consistency of summaries: BERT-Precision and FactKB. MemoTrap and NQ-Swap: Exact Match.
- Datasets: Summarisation: CNN-DM, XSUM. Knowledge Conflicts: MemoTrap, NQ-Swap.
When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories
- Metrics: Exact Match/Accuracy.
- Datasets: QA datasets with long-tail entities: PopQA, EntityQuestions; NQ.
- Metrics: Generation: Perplexity, Unigram Overlap (F1), BLEU-4, ROUGE-L. Overlap between generation and knowledge on which the human grounded during dataset collection: Knowledge F1; only consider words that are infrequent in the dataset when calculating F1: Rare F1.
- Datasets: Wow, CMU Document Grounded Conversations (CMU_DoG). Knowledge source: KiLT Wikipedia dump.
Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback
- Metrics: Expected Calibration Error (ECE) with temperature scaling (ECE-t); accuracy@coverage and coverage@accuracy.
- Datasets: Question Answering datasets assessing factual knowledge: TriviaQA, SciQ, TruthfulQA.
- Metrics: Percentage of Wrong Answers (Hallucinations) and cases where "the model knows it's wrong" (Snowballed Hallucinations).
- Datasets: Primality Testing, Senator Search, Graph Connectivity.
- Metrics: Faithfulness evaluation for Knowledge-Grounded response generation on FaithDial -- FaithCritic, CoLA (Fluency), Dialog Engagement, Length-penalised TF-IDF Diversity.
- Datasets: Faithful Knowledge-Grounded Dialog: FaithDial, a more faithful subset of WoW.
- Metrics: AUROC, AUARC, Uncertainty and Confidence metrics (NumSet, Deg, EigV).
- Datasets: CoQA (Open-book Conversational QA dataset), TriviaQA and Natural Questions (Closed-book QA).
- Metrics: Metrics measure either the degree of hallucination of generated responses wrt to some given knowledge or their overlap with gold faithful responses: Critic, Q² (F1, NLI), BERTScore, F1, BLEU, ROUGE.
- Datasets: FaithDial, WoW.
- Metrics: FeQA, a faithfulness metric; Critic, a hallucination critic; BLEU.
- Datasets: OpenDialKG, a dataset that provides open-ended dialogue responses grounded on paths from a KG.
- Metrics: Accuracy: QA, Dialogue, Summarisation.
- Datasets: HaluEval, a collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognising hallucinations.
- Metrics: After generating sentence pairs, it measures precision, recall, and F1 score in detection tasks.
- Datasets: 12 selected topics from Wikipedia.
- Metrics: Coverage: a binary metric that determines whether all the correct gold answer values are included in the generated value. Hallucination: a binary indicator that assesses the presence of generated values that do not exist in the question values and gold grounding values. User Simulator: user simulator as an "oracle" language model with access to attribution information about the target answer.
- Datasets: FuzzyQA, a dataset based on HybridDialogue and MuSiQue where complex questions were simplified using ChatGPT.
Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback
- Metrics: KF1, BLEU, ROUGE, chrF, METEOR, BERTScore, BARTScore, BLEURT, Avg length.
- Datasets: News Chat: DSTC7 Track 2 was repurposed as an evaluation corpus for news conversation. Customer Service: uses DSTC11 Track 5 as a showcase in a conversational customer service scenario, expanding upon DSTC9 Track 1 by incorporating subjective information.
- Metrics: Sentence-level Hallucination Detection (AUC-PR), and Passage-level Hallucination Detection (Pearson and Spearman's correlation coefficients).
- Datasets: Generated Wikipedia articles from WikiBio, with annotated hallucinations.
- Metrics: Per-topic and average accuracy.
- Datasets: The True-False Dataset contains true and false statements covering several topics -- Cities, Inventions, Chemical Elements, Animals, Companies, and Scientific Facts.
- Metrics: Exact Match.
- Datasets: FEVER, Adversarial HotpotQA.
- Metrics: HaloCheck and SelfCheckGPT scores; consistency, factuality.
- Datasets: Generated and reviewed questions in the NBA domain.
A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation
- Metrics: Precision and Recall when detecting Sentence-level and Concept-level Hallucinations.
- Datasets: ChatGPT-generated paragraphs spanning 150 topics from diverse domains.
- Metrics: Directional Levy/Holt precision and recall with entity insertions and replacements.
- Datasets: Levy/Holt dataset, containing premise-hypothesis pairs with a task formatted as Given [premise P], is it true that [hypothesis H]?, where the model is evaluated with random premises.
- Metrics: Rate to which MT system produces hallucinations under perturbation (Language Pair fraction, rate).
- Datasets: Flores-101, WMT, TICO.
- Metrics: N/A
- Datasets: N/A
- Metrics: Hallucinatory instruction classification: AUC, ACC, F1, PEA.
- Datasets: Concept-7, which focuses on classifying potential hallucinatory instructions.
- Metrics: Attributable to Identified Sources (AIS) scores before and after editing.
- Datasets: Generated statements by creating task inputs from three datasets and prompting different models to produce long-form outputs which may contain hallucinations -- Factoid statements, Reasoning chains, and Knowledge-intensive dialogues.
Q²: Evaluating Factual Consistency in Knowledge-Grounded Dialogues via Question Generation and Question Answering
- Metrics: Q² is a metric itself, and it is compared with F1 token-level overlap, Precision and Recall, Q² w/o NLI, E2E NLI, Overlap, BERTScore, and BLEU.
- Datasets: WoW which contains dialogues in which a bot needs to respond to useri nputs in a knowledgeable way; Topical-Chat, a human-human knowledge-grounded conversation dataset; Dialogue NLI, a dataset based on the Persona-Chat dialogue task consisting of premise-hypothesis pairs.
- Metrics: EM on All, "Has answer", and "IDK"
- Datasets: MNLI, SQuAD 2.0, ACE-whQA.
- Metrics: Wikidata and Wiki-Category List: test precision, average number of positive and negative (hallucination) entities for list-based questions; MultiSpanQA: F1, Precision, Recall; Longform generation of biographies: FactScore.
- Datasets: Wikidata, Wiki-Category List, MultiSpanQA, Longform Generation of Biographies.
- Metrics: mFACT, a novel multilingual faithful metric developed from four English faithfulness metrics: DAE, QAFactEval, ENFS%, and EntFA.
- Datasets: XL-Sum, a multilingual summarisation dataset.
- Metrics: XEnt: Hallucination (Accuracy, F1), Factuality (Accuracy, F1), ROUGE, % of novel n-gram, Faithfulness (%ENFS, FEQA, DAE), EntFA (% Factual Ent., % Factual Hal.)
- Datasets: A novel dataset, XEnt, for analysing entity hallucination and factuality in abstractive summarisation, consisting of 800 summaries generated by BART and annotated. MEnt, a set of factuality and hallucination annotations for XSum.
- Comments: Tab. 2 outlines several types of hallucinations (e.g., factual, non-factual, intrinsic).
- Metrics: Fluency (MAUVE), Correctness (EM recall for ASQA, recall-5 for QAMPARI, claim recall for ELI5), Citation quality (citation recall, citation precision).
- Datasets: QA datasets such that 1) they contain factual questions in which references are important, 2) questions require long-text answers covering multiple aspects, and 3) answering the questions requires synthesising multiple sources: ASQA, QAMPARI, ELI5.
- Metrics: Acc, G-Mean, BSS, AUC, Not Hallucination (P, R, F1), Hallucination (P, R, F1).
- Datasets: HaDes (HAllucination DEtection dataSet), a novel token-level reference-free annotated hallucination detection dataset obtained by perturbing a large number of text segments extracted from the English Wikipedia and verified with crowd-sourced annotations.
- Comments: Fig. 3 outlines several hallucination types (domain-specific knowledge, commonsense knowledge, incoherence or improper collocation, unrelated to central topic, conflict with preceding context, conflict with succeeding context, ..)
- Datasets: Wiki-FACTOR and News-FACTOR: two novel factuality evaluation benchmarks for LLMs, based on Wikipedia and News articles. Each example consists of a prefix, a factual completion and three similar but non-factual alternatives. An LLM is evaluated by measuring the percentage of examples it assigns the highest probability to the factual completion.
- Comments: The paper introduces a framework for automatically generating such datasets from a given corpus, detailed in Section 3.
- Mitigating LLM Hallucinations: a multifaceted approach
- Survey of Hallucination in Natural Language Generation
- A Survey of Hallucination in Large Foundation Models
- LLM Powered Autonomous Agents
Survey of Hallucination in Natural Language Generation classifies metrics in Statistical (ROUGE, BLEU, PARENT, Knowledge F1, ..) and Model-based metrics. The latter are further structured in the following classes:
- Information-Extraction (IE)-based: retrieve an answer from a knowledge source and compare it with the generated answer -- there might be problems due to the error propagation from the IE model.
- QA-based: measure the overlap/consistency between generation and source reference, based on the intuition that similar answers will be generated from the same question if the generation is factually consistent with the source reference. Used to evaluate hallucinations in summarisation, dialogue, and data2text generation. Composed of a question generation model and a question answering model.
- Natural Language Inference (NLI)-based: based on the idea that only the source knowledge reference should entail the entirety of the information in faithful and hallucination-free generation.
A Survey of Hallucination in “Large” Foundation Models surveys papers flagging them for detection, mitigation, tasks, datasets, and evaluation metrics. Regarding hallucinations in text, it categorises papers by LLMs, Multilingual LLMs, and Domain-specific LLMs.
Neural Path Hunter defines as extrinsic hallucination as an utterance that brings a new span of text that does not correspond to a valid triple in a KG, and as intrinsic hallucination as an utterance that misuses either the subject or object in a KG triple such that there is no direct path between the two entities. Survey of Hallucination in Natural Language Generation defines as extrinsic hallucination a case where the generated output that cannot be verified from the source content, and as an intrinsic hallucination a case where the generated output contradicts the source content.