What actually causes hallucinations in LLMs?

What actually causes hallucinations in LLMs?

Learn why Large Language Models (LLMs) hallucinate and how calibration, abstention, incentives, tools, and evaluation improve reliability.
13 mins read
Sep 15, 2025
Share

Last week, I asked a well-known LLM a simple, checkable question. The generated reply was crisp and confident — but spectacularly wrong.

I was surprised that a state-of-the-art model could miss so clearly while sounding so sure. That moment stayed with me. If a system cannot reliably tell fact from fiction, what does that mean for how we write, teach, code, and support customers with AI in the loop?

Then I read OpenAI’s new study, “Why Language Models Hallucinate,” and it clicked. I learned why these confident mistakes happen and how to think about them. In this newsletter, we'll explore what exactly hallucinations even are, what causes them, and how to evaluate them more honestly.

What does the problem look like in real use?#

Before we discuss causes or fixes, let's observe the behavior as the OpenAI team showcased it in small, reproducible trials.

For the trials below, we used base-model behavior in a no-web, no-retrieval setting. Apps like ChatGPT/GPT-5 often have tool access (search, citations, structured lookups) and safety policies that change outcomes. Our goal was to isolate the answer vs. abstain decision without tool assistance.

What happens when you tell a model “answer only if you know”?#

The idea is simple: pick a concrete fact and give the model an explicit pass option. Then see whether it uses that option or prefers a guess.

Start with a plain trial. Pick a fact with a single correct value (like a birthday) and specify a fixed format, then tell the model to answer only if it knows. For example: “If you know, reply DD-MM; otherwise say ‘I don’t know.’” It’s a way to see if the model follows directions and produces output in a strict format, with no room for ambiguity.

According to researchers who ran this test, three independent runs produced three cleanly formatted dates: 12-03, 25-10, and 04-07. None is correct, and none is “I don’t know.” The safer option is in the prompt, yet the model still prefers a guess.

Tighten the trial, and the behavior persists. Rephrase the instruction (“Only answer if you are certain; otherwise say ‘I don’t know.’”), specify the format twice, or request a confidence score after the date. You still tend to get a date rather than an abstention. Swap in another single-answer fact (a keynote year or a grant number) and the pattern holds: fluent form, unstable truth, and a reluctance to say “I don’t know.”

These tests explicitly include a pass option (“Only answer if you’re certain; otherwise say ‘I don’t know.’”) because we’re measuring calibration—whether the model withholds when uncertain. If you omit the pass instruction (as in many casual tests), you’re mostly measuring fluency on easy items, not calibration under uncertainty. Good app behavior without this instruction doesn’t contradict the claim here; it’s a different setting.

Can it fabricate citations that sound convincing?#

Use a plain setup instead of heavy jargon. Ask: “What is the year and publisher of The Art of Tiny Habits by Usama Ahmed?”

(Spoiler alert: that book does not exist; the author is made up.)

Yet the reply may arrive fully detailed: “Penguin Random House, 2018.”

Nudge for more detail, “What is the ISBN?”, and you may get a tidy thirteen-digit number and a city of publication.

None of it maps to a real book.

What is happening is pattern completion, not source retrieval. The model has seen many citations. It knows self-help titles often mention “habits,” major publishers recur, and ISBNs are usually thirteen digits grouped in a certain way. Ask about a specific book and it assembles a citation-shaped output from nearby patterns: a familiar publisher, a plausible year, and numbers that pass an eyeball test. Style is accurate; substance is improvised.

Retry and the tone stays confident while details drift. One answer says 2018 with Penguin Random House; another switches to 2019 with HarperCollins; a third offers a different ISBN. Ask for a summary, and you may get a fluent paragraph about “stacking tiny behaviors.” Press for a link, and you might see a publisher-looking URL that resolves nowhere. The tells are simple once you look at them. Formatting is consistent while core facts vary, and “anchors” like ISBNs or catalog links yield more specificity, not more caution.

In a tool-augmented chat app, a retrieval or catalog lookup would usually catch this; our test purposely disables those tools to observe the underlying “pattern completion” reflex.

Will it improve product details and still miss easy, mechanical tasks?#

These two categories of failure have names. Intrinsic hallucinations are slips on reasoning or algorithmic tasks like counting, copying, or following rigid rules. Extrinsic hallucinations are fabrications about the outside world: dates, citations, and product specs that look plausible but contradict reality. Clarifying the distinction helps when considering fixes since the causes and remedies differ.

Ask a consumer question with a single, checkable answer: “What is the warranty on the Nimbus X3 earbuds?” The reply may come back crisp: “Two years, parts and labor.” It sounds plausible, but the real product offers one year, and that phrase does not appear in the policy. You are seeing a composite. The model averages similar product pages, assembling a warranty-shaped answer from familiar pieces. Retry and the details drift while the tone stays authoritative, “18 months,” sometimes “limited lifetime,” a region-specific clause. The style stays neat, but the substance changes.

Now flip to a task with zero wiggle room: “Count the letter N in ‘BANANAS’. If unsure, say ‘not sure’.” The strict answer is two. You may still see “three,” delivered with confidence. The model is better at continuing patterns than at running tiny algorithms. Even with the pass option visible, it often commits. Across runs, you may get different numbers, but rarely the abstention you requested. Confidence stays high; correctness does not.

Together, these examples show two faces of the same problem. The warranty is an external fabrication that contradicts the world by averaging across similar products. The letter count is an internal slip that contradicts the prompt by failing a mechanical check. The causes differ, but the feel is the same: fluent form, unstable truth, and a bias toward answering rather than abstaining. Keep that in mind as we turn to causes.

What patterns should we notice here?#

Across the examples, the surface is polished while the facts vary. The answers are fluent and specific, yet the details shift. Even with a clear pass option, the model rarely takes it. On retries, you do not get silence or caution; you get a different polished answer. Style is stable; truth is not. These mistakes spread easily because they are believable at a glance and delivered with confidence.

Viewed behaviorally, this is pattern completion presented as knowledge. The model fills in shapes it recognizes (dates, citations, warranties, counts) with plausible pieces drawn from nearby examples. Small randomness in generation yields a rotating set of alternatives. The effect is confidence without evidence. Keep that in mind as we move forward. The next question is not “How do we fix it?” but “Where does this bias toward answering rather than abstaining come from?”

Where does this bias toward answering come from?#

A hallucination is a confident wrong answer from an AI. The model is not trying to deceive us; it does not realize it is off. We built systems that prefer any answer over “I am not sure,” we grade them as a bullseye or a bust. In that world, guessing sometimes pays off, so models learn to guess.

Imagine a student graded only on right or wrong. “I don’t know” earns zero; an occasional guess lands. Over time, that student learns to bluff. Large language models live under the same rules. When they sound certain and still miss, that is a hallucination: a guess encouraged by the game.

Apps reduce observed hallucinations by changing the game: They add retrieval, citations, and UI affordances that make passing feel normal. The paper’s claim targets the training and scoring incentives themselves. That’s why the experiments are run with tools constant (i.e., off): the bias is easiest to see there.

Under the hood, the first push comes from pretraining. The objective is to produce the most plausible text continuation, not verify facts. In the data these models ingest, “I don’t know” is rare and unrewarded. Well-formed answers are everywhere. So the reflex the model learns is to complete the shape (dates that look like dates, citations that look like citations) rather than to pause and check.

Then, post-training and evaluation strengthen that habit. Instruction tuning nudges models toward sounding helpful and decisive. Benchmarks and leaderboards score answers as right or wrong. In many setups, an abstention earns the same as a miss: zero. If silence is scored like failure, the rational move is to answer whenever there is any chance of being right. Over time, that pressure builds a habit: polished completions come easily; withholding does not.

Add a little randomness at generation time, and you get the flavor we saw in the examples. The style stays steady while facts drift on retries.

In short, these systems are good at saying something that looks right and comparatively weak at refusing to say something that might be wrong. Keep that frame in mind as we move to evaluation, so we can tell the difference between models that only sound calibrated and those that know when to stay quiet.

How can we tell whether a model knows when not to answer?#

Most demos grade only the final answer, which hides what we care about: does the model withhold when unsure? Judge the decision to answer separately from the content. Treat each candidate (including “I don’t know”) as a simple yes or no against your validity bar. Run the same prompt a few times or across similar items, and watch three signals in prose rather than a scoreboard: how often it answers at all, how accurate those answered cases are, and how often it abstains when it should. In practice, you will see a familiar pattern: high fluency, middling accuracy, and low willingness to abstain.

Keep it concrete. Pick several single-truth questions (order IDs, release years, ISBNs). Tell the model, “Answer only if you are confident; otherwise, say ‘I don’t know.’” Log results as answered-correct, answered-wrong, and abstained. The headline is not raw accuracy but precision on the answered subset and the abstention rate on hard items. Models that only sound calibrated answer a lot and miss too often. Calibrated models answer less; when they do, they are usually right.

Want to learn about how it all works behind the scenes? Explore these courses:

Not all facts strain the system equally. Rare or idiosyncratic details invite pattern completion, so the model fills the right shape with plausible pieces. Other errors are mechanical. Counting letters, copying exact strings, or applying fixed rules can vary because the system predicts likely text rather than executing tiny algorithms. Tool use can help, but the behavioral bias remains. When faced with a choice between saying nothing and saying something that looks right, the model tends to talk. Your evaluation should surface that habit and set up scoring that rewards honest abstention.

How should we score answers so honesty wins, and what changes in practice?#

Think of evaluation as a calibration check, not just an accuracy check.

  • Coverage: How often the model chooses to answer.

  • Precision (on answered cases): When it does answer, how often it is right.

  • Abstention: How often it sensibly says, “I don’t know.”

Make abstention a core, graded outcome by adding confidence targets directly to the instructions: “Answer only if you are more than t confident. A mistake costs t/(1−t) points; ‘I don’t know’ gets 0.” Try multiple thresholds, for example, t equals 0.5, 0.75, and 0.9. A behaviorally calibrated model will answer when above the target and refrain when below it. You will see coverage drop and precision rise as you ask it to be more cautious. This separates models that only sound careful from models that actually are.

Think of a quiz show where contestants can answer, pass, or answer with a confidence claim (“I am very sure” versus “I am a bit unsure”). The grading rule is simple and fair:

  • Being right and confident is rewarded.

  • Being wrong and confident counts heavily against you.

  • Being wrong but hesitant still counts against you, but not as much.

  • Passing (“I don’t know”) is acceptable. No reward, no penalty.

The point is to make bluffing costly and honest uncertainty safe. If a model claims it is very sure and it is wrong, it should “feel” that mistake more than if it had admitted doubt or passed. Over time, this nudges the system toward a healthy habit: speak up when you truly know; otherwise, stay quiet.

To judge fairly, look at two plain-English signals as you try a few different strictness levels (“be cautious,” “be normal,” “be bold”):

  • How often it chooses to answer. This is coverage.

  • How often it is right when it does answer. This is precision.

A well-calibrated model will answer less often when you ask it to be cautious, and when it does speak, it is almost always right. An overconfident model will keep talking even when asked to be careful, and you will see more polished mistakes. No equations needed. You can see the difference in a single page of examples.

What changes in practice once you grade this way?#

First, give every task a real pass option in the prompt and in your UI, and treat “I don’t know” as a legitimate outcome. This will lower the temperature and discourage performative certainty.

Second, ask for a small receipt with each answer, such as a link, a citation, or a one-line reason. You do not need a thesis. You need enough to spot-check. This shifts the model from “say it nicely” toward “show your work.”

Third, review your logs like a coach, not a scorekeeper. For each task type, skim a small sample and note how often the model spoke, how often those answers were correct, and how often it sensibly passed. Patterns appear quickly. Rare, niche facts tend to trigger confident composites. Mechanical prompts (counting or copying) benefit from handing the job to a tool (search, calculator, or code) rather than asking for more prose.

Finally, set the tone per task. For medical disclaimers or financial instructions, ask the model to be cautious by default and recognize passes. For low-stakes brainstorming, let it be bold. Show users when the model chose to pass. Transparency builds trust and makes “I don’t know” feel like professionalism, not failure.

Wrapping up#

Some labs are also tackling the problem at the system level. Gemini, for example, enforces “refusal policies” that steer the model toward saying no when it lacks support, while Anthropic experiments with “Constitutional AI,” using written principles to constrain outputs. These guardrails don’t replace calibration work, but they create a friendlier environment for abstention to be seen as success instead of failure.

Here is the practical summary from the study:

  • Fix the incentives in mainstream benchmarks: Update existing evaluations to include confidence targets and clear penalties for wrong answers. That makes abstention rational when the model is below the threshold, instead of treating “I don’t know” like failure by default.

  • Make abstention legitimate in training and UX: Bake pass options into prompts, scorers, and UI. If the grading rubric never rewards passing, models will keep guessing.

  • Evaluate behavioral calibration, not just raw accuracy: Run the same items at different strictness levels, from cautious to bold, and track coverage, precision, and abstention. Reward models that change behavior appropriately.

  • Target fixes by error type: For intrinsic slips (counting, copying, exact rules), lean on tools or structured reasoning. For extrinsic fabrications (dates, publishers, warranties), lean on retrieval and stronger incentives to abstain when evidence is thin.

Together, these changes turn confidence without evidence into something measurable and improvable: a system that talks when it should, and stays quiet when silence is the smartest move.


Written By:
Fahim ul Haq
The AI Infrastructure Blueprint: 5 Rules to Stay Online
Whether you’re building with OpenAI’s API, fine-tuning your own model, or scaling AI features in production, these strategies will help you keep services reliable under pressure.
9 mins read
Apr 9, 2025