Extrinsic Hallucinations in LLMs
Hallucination in large language models usually refers to the model generating unfaithful, fabricated, inconsistent, or nonsensical content. As a term, hallucination has been somewhat generalized to cases when the model makes mistakes. Here, I would like to narrow down the problem of hallucination to cases where the model output is fabricated and not grounded by either the provided context or world knowledge. There are two types of hallucination: - In-context hallucination: The model output should be consistent with the source content in context. - Extrinsic hallucination: The model output should be grounded by the pre-training dataset. However, given the size of the pre-training dataset, it is too expensive to retrieve and identify conflicts per generation. If we consider the pre-training data corpus as a proxy for world knowledge, we essentially try to ensure the model output is factual and verifiable by external world knowledge. Equally importantly, when the model does not know about a fact, it should say so. This post focuses on extrinsic hallucination. To avoid hallucination, LLMs need to be (1) factual and (2) acknowledge not knowing the answer when applicable. What Causes Hallucinations? Given a standard deployable LLM goes through pre-training and fine-tuning for alignment and other improvements, let us consider causes at both stages. Pre-training Data Issues The volume of the pre-training data corpus is enormous, as it is supposed to represent world knowledge in all available written forms. Data crawled from the public Internet is the most common choice and thus out-of-date, missing, or incorrect information is expected. As the model may incorrectly memorize this information by simply maximizing the log-likelihood, we would expect the model to make mistakes. Fine-tuning New Knowledge Fine-tuning a pre-trained LLM via supervised fine-tuning and RLHF is a common technique for improving certain capabilities of the model like instruction following. Introducing new knowledge at the fine-tuning stage is hard to avoid. Fine-tuning usually consumes much less compute, making it debatable whether the model can reliably learn new knowledge via small-scale fine-tuning. Gekhman et al. 2024 studied the research question of whether fine-tuning LLMs on new knowledge encourages hallucinations. They found that (1) LLMs learn fine-tuning examples with new knowledge slower than other examples with knowledge consistent with the pre-existing knowledge of the model; (2) Once the examples with new knowledge are eventually learned, they increase the model’s tendency to hallucinate. Given a closed-book QA dataset (i.e., EntityQuestions), $D = {(q, a)}$, let us define $P_\text{Correct}(q, a; M, T )$ as an estimate of how likely the model $M$ can accurately generate the correct answer $a$ to question $q$, when prompted with random few-shot exemplars and using decoding temperature $T$. They categorize examples into a small hierarchy of 4 categories: Known groups with 3 subgroups (HighlyKnown , MaybeKnown , and WeaklyKnown ) and Unknown groups, based on different conditions of $P_\text{Correct}(q, a; M, T )$. Some interesting observations of the experiments, where dev set accuracy is considered a proxy for hallucinations. Unknown examples are fitted substantially slower thanKnown .- The best…

