What Is Context Engineering?
Discover how to curate and manage the full information environment that AI models use during inference, focusing on optimizing context window quality to ensure reliable, coherent responses in multi-turn and agentic systems. Understand key concepts like context rot, attention budgets, and strategies for maintaining high signal-to-noise ratios with techniques including compaction, structured note-taking, and multi-agent architectures.
A developer builds a customer support agent. The system prompt is clean, specific, and well-structured. In the first few turns, the agent handles user requests with impressive precision. Then, about a dozen messages in, something shifts. It starts ignoring constraints it followed perfectly earlier. It repeats information the user already confirmed. It loses coherence entirely.
The prompt did not change. The model did not change. What changed was the surrounding information environment the model was operating within, and nobody was managing it.
This is the problem that context engineering exists to solve. As AI systems grow more capable and take on longer, more complex tasks, the quality of their outputs depends less on finding the perfect phrasing for a single instruction and more on controlling the full information environment the model operates within. Understanding this discipline is essential for anyone building reliable applications with large language models.
Defining context engineering
To define context engineering precisely, we first need to understand what context means in the technical sense. When a LLM generates a response, it has no persistent memory across sessions and no awareness of anything outside the current interaction. The only information available to it at any moment is the complete set of tokens passed to it during that inference call. That set of tokens is its context. It includes the system prompt, the conversation history, any data retrieved from external sources, tool outputs, and examples provided by the engineer.
Context engineering is the discipline of deliberately curating, structuring, and managing that set of tokens to maximize the likelihood of a desired model output. Andrej Karpathy, a leading voice in applied AI, described it as “the art and science of filling the context window.” That framing captures something important: context engineering is both a technical practice and a craft that requires judgment.
Where prompt engineering focuses on how to write instructions, context engineering asks a broader question: given everything the model could potentially see, what is the optimal subset of information to place in front of it, in what form, and at what moment?
Context engineering is the practice of curating and managing the full set of tokens available to a model during inference, going beyond prompt writing to orchestrate all information the model receives.
Context engineering vs. prompt engineering
The relationship between context engineering and prompt engineering is one of the most important distinctions to understand clearly, because the two terms are often used interchangeably when they should not be.
Prompt engineering: It is the practice of designing and refining the instructions given to a language model. It focuses on wording, structure, tone, and the placement of examples within a prompt. It is primarily concerned with a single key component of context: the instruction itself.
Context engineering: It is the broader discipline that encompasses prompt engineering and extends far beyond it. According to Anthropic's engineering team, context engineering refers to the full set of strategies for curating and maintaining the optimal tokens during inference, including all the information that lands in the context beyond the prompts themselves.
The distinction becomes concrete with a practical example. Suppose a model is tasked with answering a user's billing question inside a support application:
Prompt Engineering | Context Engineering | |
Focus | Writing the instruction clearly | Deciding what data to retrieve and include |
Scope | The system prompt | System prompt + retrieved records + conversation history + tool outputs |
Key question | How should I phrase this? | What should the model see, and when? |
Output | A well-worded instruction | A complete, curated information environment |
A well-engineered prompt placed inside a poorly managed context will still produce unreliable results. Equally, the most carefully retrieved data becomes noise if the instructions around it are ambiguous. The two practices are complementary: prompt engineering handles the "how to instruct" problem, while context engineering handles the "what information to provide, and how to manage it over time" problem. Together, they form a complete approach to ai contextual refinement.
Note: Context engineering does not replace prompt engineering. It is the larger framework within which prompt engineering operates.
The context window as a finite resource
Understanding context engineering requires confronting a critical constraint: the context window does not just have a token limit. Its quality degrades as it fills up. This phenomenon is known as context rot.
Context rot is the degradation in LLM output quality that happens as input context grows longer. More tokens in, worse output out, even when the model's context window is not close to full. This matters because it directly challenges a common assumption: that larger context windows solve the problem of information management. The decline is continuous rather than a sudden cliff. A model with a 200K token window can exhibit significant degradation at 50K tokens. Context window capacity is the wrong metric. Signal-to-noise ratio is what determines output quality.
The architectural reason for this lies in how transformer-based LLMs process tokens. LLMs are based on the transformer architecture, which enables every token to attend to every other token across the entire context. As context length increases, a model's ability to capture these pairwise relationships gets stretched thin, creating a natural tension between context size and attention focus. Anthropic's engineering team refers to this as the model's attention budget, which every new token depletes by some amount.
Compounding this is the “lost in the middle” problem, documented in research from Stanford and UC Berkeley. LLMs perform significantly worse when relevant information is placed in the middle of the input context rather than at the beginning or end. Research measured accuracy drops of over 30% on multi-document question answering when the answer document moved from position 1 to position 10 in a 20-document context. The practical implication is serious: adding more information to the context does not guarantee the model will use it. Where that information sits within the context matters as much as whether it is there at all.
These realities together explain why context engineering is not a nice-to-have refinement. It is a structural necessity for any system that relies on LLMs to perform reliably over more than a handful of turns.
The anatomy of a well-engineered context
Given that the context window is a finite resource with diminishing returns, effective context engineering means finding the smallest possible set of high-signal tokens that maximizes the likelihood of the desired outcome. This guiding principle applies across every component of context.
System prompts form the instructional backbone of any AI application. Anthropic's engineering guidance describes the goal as finding the right altitude: specific enough to guide behavior effectively, yet flexible enough to give the model strong heuristics rather than brittle, hardcoded logic. A system prompt that micromanages every edge case becomes fragile. One that is too vague provides insufficient signal. The sweet spot is a minimal, well-organized prompt that covers the most important behaviors and leaves room for the model to reason.
Retrieved data is the content fed into the context from external sources, typically through a process called retrieval-augmented generation (RAG). Rather than assuming the model knows relevant facts from training, a RAG system retrieves specific documents, records, or data points and includes them in the context at inference time. This grounds the model in accurate, up-to-date information and reduces hallucination. The context engineer's job here is not to retrieve everything potentially relevant, but to retrieve only what the current task actually requires.
Conversation history is the record of prior turns in a multi-turn interaction. Including it allows the model to maintain coherence and avoid repeating itself. Excluding or trimming it aggressively, however, risks losing important earlier context. Managing this tradeoff is one of the ongoing challenges of context engineering in long-running applications.
Tool outputs are the results returned when an agent calls an external function, API, or service. These outputs land directly in the context and must be managed carefully. A tool that returns a large, verbose payload when only two fields are needed is polluting the context with low-value tokens.
Examples (few-shot prompting) remain one of the most reliable ways to guide model behavior. Rather than listing every possible edge case, a small set of well-chosen, diverse examples communicates expected behavior far more efficiently.
Agentic context engineering
Single-turn interactions, where a user sends one message and the model responds once, place relatively modest demands on context management. The real challenge emerges in agentic systems, where a model operates autonomously across multiple steps, calls external tools, and works toward a goal that may take dozens of inference cycles to complete.
In these settings, agentic context engineering becomes the defining engineering problem. With each step an agent takes, it generates new information: tool outputs, intermediate results, observations, and decision logs. All of this data could theoretically be relevant to the next step. If it is all kept in the context, the window fills with noise. If it is discarded too aggressively, the agent loses the thread of its own task.
Anthropic's engineering team describes a pattern called “just in time” context, where agents maintain lightweight references such as file paths, stored queries, and web links rather than loading full data objects into the context. When a specific piece of information is needed, the agent retrieves it on demand rather than keeping it loaded throughout the task. This mirrors how a skilled human researcher works: they do not memorize every document they might need. They maintain an index and look things up when the moment requires it.
For long-horizon tasks that span tens of minutes or hours, additional techniques become necessary. Three worth understanding are:
Compaction: Periodically summarizing or compressing accumulated conversation history and tool outputs to reduce token count while preserving the most important information.
Structured note-taking: Having the agent write key findings and decisions to an external memory store, which it can consult later without keeping the full detail in the active context window.
Multi-agent architectures: Distributing work across specialized sub-agents, each with its own clean context window focused on a narrow subtask, rather than having one agent accumulate context across an entire complex project.
Together, these strategies shift the role of the engineer. In agentic systems, the primary challenge is less about writing the right words in a prompt and more about designing the information flow that surrounds the model at every step of its reasoning process.
Why context engineering matters beyond agents
While agentic context engineering is where the discipline is most visible, the principles apply to any AI application that goes beyond a single turn. A chatbot managing a long customer conversation, a document analysis tool processing a large file, or a coding assistant working through a complex codebase all face the same underlying challenge: the model only knows what it is shown, and what it is shown must be chosen carefully.
Context engineering is also closely tied to cost and reliability in production systems. Coding agents routinely push past 100K tokens in a session. Every file read, search result, and tool output stays in the window for the rest of the session. Real coding tasks take 15 to 60 minutes, during which context continuously degrades. Bloated contexts do not just degrade output quality. They increase the computational cost of every inference call. A lean, well-managed context is both more accurate and more economical.
The practical reframing that context engineering demands is a shift in the central question. The question moves from “how do I fit more tokens in?” to “how do I keep irrelevant tokens out?” That shift, from accumulation to curation, is what separates a fragile AI prototype from a reliable production system.
Conclusion
Context engineering represents a maturation in how the field thinks about working with LLMs. As models become more capable and tasks more complex, managing the information environment around a model is as important as crafting the instructions within it. The core principles, curating for relevance, managing finite attention budgets, and designing information flow across multi-step interactions, apply whether building a simple chatbot or a sophisticated autonomous agent. A solid grasp of context engineering gives practitioners the foundation to build AI systems that remain accurate, coherent, and cost-effective even as the complexity of their tasks scales up.