Search⌘ K
AI Features

Prompt Injection, Leaking, and Jailbreaking

Understand the mechanics and risks of prompt injection, leaking, and jailbreaking attacks in large language models. Learn how adversarial input can alter or expose system prompts and bypass safety features, which compromises AI applications. This lesson helps build foundational knowledge to prepare for effective defensive strategies.

In the previous lesson, techniques like PAL and ReAct gave LLMs the ability to call external tools, execute code, and interact with live systems. That same capability, however, dramatically expands the attack surface available to malicious users. Every tool call, every database query, and every API request becomes a potential vector for exploitation if an attacker can manipulate the model’s behavior through carefully crafted text.

The root cause is architectural. LLMs process all text, whether it comes from a trusted system prompt written by the developer or from untrusted user input, as a single stream of tokens. The model’s attention mechanism weighs these tokens based on relevance and position, not based on trust level. There is no built-in firewall separating “instructions I should follow” from “data I should process.” This fundamental design characteristic makes LLMs inherently vulnerable to a family of attacks known collectively as adversarial prompting.

Three distinct attack categories exploit this vulnerability. Prompt injection tricks the model into following attacker-supplied instructions instead of the developer’s system prompt. Prompt leaking extracts the hidden system prompt itself, exposing confidential business logic. Jailbreaking bypasses the model’s safety alignment to generate content it was trained to refuse. Each targets a different layer of the system, but all three stem from the same root problem.

Consider a concrete scenario to ground these ideas. Imagine an LLM-powered customer-service agent deployed by an e-commerce company. This agent has a system prompt containing return-policy rules, access to an internal order-lookup API, and instructions never to reveal discount codes. An attacker who can manipulate this agent through crafted input could extract those discount codes, override the return policy, or trigger unauthorized refunds through the API. These are not hypothetical risks; they are actively exploited against production systems.

This lesson walks through the mechanics and severity of each attack type so that the defensive strategies covered in the next lesson are ...