...

/

Hallucinations and Jailbreaks

Hallucinations and Jailbreaks

Learn how hallucinations and jailbreaks occur in LLMs and how interpretability reveals their internal causes.

Imagine you’re building a conversational AI that confidently makes up a fact about your company’s product during a demo, or worse, someone finds a way to trick it into revealing sensitive info despite safeguards. These scenarios illustrate hallucinations and jailbreaks, two notorious issues with large language models (LLMs). They frequently appear in real-world use of AI, so it’s no surprise they also pop up in technical interviews for GenAI roles. Interviewers love to ask about these topics because they test your understanding of how AI models behave and how to make them safer.

Hallucinations (when an AI fabricates information) and jailbreaks (when a user bypasses an AI’s safety filters with clever prompts) are hot topics in the AI community. Companies deploying LLMs (from customer service bots to coding assistants) need engineers who grasp why these issues happen and how to address them. In an interview, if you can explain why language models hallucinate and how interpretability research helps tackle these problems, you’ll demonstrate a forward-thinking approach to AI safety.

Press + to interact

When an interviewer asks, “How can interpretability help improve LLM safety and alignment?”, they are probing for several things. Here’s what they typically want to hear: a few things in your response. This lesson will equip you with a deep understanding of these concepts so you can confidently discuss them in an interview.

What hallucinations and refusals are in LLMs?

Before diving into interpretability and safety, we must clearly understand the phenomena we’re trying to address: hallucinations and refusals in language models.

In the context of LLMs, it’s widely known that a hallucination is when the model produces information that isn’t grounded in reality or the provided data. The response might sound confident and detailed, but it’s essentially made up. For example, if you ask a vanilla GPT model a question it doesn’t truly know, it might still give you a detailed answer that looks correct but is factually false. It could cite non-existent research papers, invent biographical details about a person, or misstate how a tool works—all without the model “realizing” it’s wrong. Why does this happen? Because the model’s training objective is usually to predict likely text. It has seen millions of examples of Q&A or explanatory text, and when faced with an unfamiliar query, it will still try to produce a “reasonable” answer by drawing on whatever fragments seem relevant. It has no direct interface to reality or a database (unless explicitly connected to one); it’s generating words that statistically follow from the input it got. So, if the training data doesn’t contain the answer, the model improvises. In short, the model doesn’t know when it doesn’t know—unless we teach it that behavior (we’ll get to that soon). Hallucinations are a big problem because they can mislead users or spread misinformation while sounding authoritative.

A refusal is when a model declines to fulfill a request. You often see this in aligned or instruction-tuned models (like ChatGPT, Claude, etc.) when a user asks for something against the usage guidelines. For instance, if you ask ChatGPT, “How do I make weapons at home?”, it will likely respond with a refusal: “I’m sorry, but I cannot assist with that request.” This behavior is intentionally trained as a safety measure. Refusals also occur in milder forms, like the model saying it doesn’t have enough information or is unsure, especially if the system has been tuned to avoid guessing. In the context of hallucinations, a refusal can be seen as the model playing it safe: instead of making something up, it chooses not to answer or to ask for clarification. Some advanced models have been specifically tuned to prefer silence over fabrication in certain situations. For example, Anthropic’s Claude has often responded with, “I’m not sure about that” or a gentle refusal if it’s uncertain about an answer, rather than confidently guessing. This design choice aims to reduce hallucinations by injecting a bit of humility into the model.

Hallucinations and refusals are two sides of the same coin: how an LLM handles uncertainty or prohibition. A base language model (pretrained only on text) doesn’t naturally refuse to answer – it was trained to always say something. It only learned to continue text, not to judge whether it should continue. Refusals come into ...