Hallucinations and Jailbreaks
Explore hallucinations and jailbreaks in large language models, understanding why models fabricate information or refuse answers. Learn how internal competing circuits influence model behavior and the role interpretability plays in identifying and mitigating safety risks for safer AI development.
Imagine you’re building a conversational AI that confidently makes up a fact about your company’s product during a demo, or worse, someone finds a way to trick it into revealing sensitive info despite safeguards. These scenarios illustrate hallucinations and jailbreaks, two notorious issues with large language models (LLMs). They frequently appear in real-world use of AI, so it’s no surprise they also pop up in technical interviews for GenAI roles. Interviewers love to ask about these topics because they test your understanding of how AI models behave and how to make them safer.
Hallucinations (when an AI fabricates information) and jailbreaks (when a user bypasses an AI’s safety filters with clever prompts) are hot topics in the AI community. Companies deploying LLMs (from customer service bots to coding assistants) need engineers who grasp why these issues happen and how to address them. In an interview, if you can explain why language models hallucinate and how interpretability research helps tackle these problems, you’ll demonstrate a forward-thinking approach to AI safety.
When an interviewer asks, “How can interpretability help improve LLM safety and alignment?”, they are probing for several things. Here’s what they typically want to hear: a few things in your response. This lesson will equip you with a deep understanding of these concepts so you can confidently discuss them in an interview.
What are hallucinations and refusals in large language models?
Before diving into interpretability and safety, we must clearly understand the phenomena we’re trying to address: hallucinations and refusals in language models.
In the context of LLMs, it’s widely known that a hallucination is when the model produces information that isn’t grounded in reality or the provided data. The response might sound confident and detailed, but it’s essentially made up. For example, if you ask a vanilla GPT model a question it doesn’t truly know, it might still give you a detailed answer that looks correct but is factually false. It could cite non-existent research papers, invent biographical details about a person, or misstate how a tool works—all without the model “realizing” it’s wrong. Why does this happen? Because the model’s training objective is ...