Search⌘ K
AI Features

Prompt Engineering Best Practices

Explore essential prompt engineering best practices to design clear and specific instructions, use structured templates and examples, apply delimiters, enable step-by-step reasoning, and conduct iterative testing. This lesson helps you create reliable, consistent AI outputs suitable for production applications.

Consider two engineers working on the same task: extracting structured data from customer support tickets and returning it as JSON. Both use the same model. The first engineer types a quick instruction and runs it. The model returns something roughly useful, but inconsistently formatted, occasionally missing fields, and sometimes slipping into explanatory prose instead of clean JSON. The second engineer spends fifteen extra minutes applying a handful of deliberate techniques. Their model produces clean, correctly structured JSON on every run.

The difference between these two outcomes has nothing to do with the model. It has everything to do with the quality of the prompt. The second engineer is applying prompt engineering best practices, a set of proven, documented techniques that make the difference between an AI feature that demos well and one that holds up in production.

This lesson walks through those practices systematically, explaining what each one is, why it works, and how to apply it.

Start with clarity and specificity

The single most important prompt engineering best practice, is also the most straightforward: be clear and specific about our needs.

Language models are probabilistic systems. They do not have intent, and they do not infer what we meant to say. They respond to what we actually wrote. Ambiguity in a prompt does not get resolved by the model using common sense. It gets resolved by the model making a guess, and that guess may be a perfectly reasonable one that happens to be wrong for your specific use case.

Consider this prompt:

Prompt: Summarize this article.

The model has no idea how long the summary should be, who the audience is, what format it should take, or which aspects of the article matter most. It will produce something reasonable and generic. Compare that to:

Prompt: Summarize the following article in three bullet points for a non-technical executive

The second prompt eliminates guesswork entirely. Specificity across four dimensions drives this improvement:

  1. The task (summarize)

  2. The audience (non-technical executive)

  3. The format (three bullet points)

  4. The constraint (25 words each, business focus)

Whenever a prompt is not performing as expected, the first question to ask is: Have I been specific enough about all four of these dimensions?

Use a prompt engineering template

One of the most practical prompt engineering methods for building consistent, reusable prompts is to work from a structured template. Rather than writing every prompt from scratch, a template defines the standard components that a high-performance prompt should include and gives each component a designated place.

A reliable prompt engineering template consists of five components:

  • Role: Assigns the model a persona or professional context that frames its tone and approach. Example: “You are a senior technical writer with expertise in developer documentation.”

  • Task: States the core action the model must perform. This should be a single, clear instruction. Example: “Your task is to rewrite the following API reference section to be clearer and more concise.”

  • Context: Provides the grounding information the model needs to perform the task accurately. This is where retrieved data, user-supplied details, or background information goes. Example: “The audience is mid-level software engineers who are familiar with REST APIs but new to this specific platform.”

  • Examples: Demonstrates the desired output through one or more input-output pairs. Even a single well-chosen example dramatically narrows the space of acceptable responses.

  • Output format: Specifies the structure, length, tone, and format of the expected response. Example: “Respond in plain prose. Maximum 150 words. Avoid jargon.”

Not every prompt needs all five components. A simple single-turn task may need only a task and an output format. The value of thinking in templates is that it forces deliberate decisions about each component rather than leaving them to chance. In production systems, these templates become dynamic structures where the task and role remain fixed while the context slot is populated programmatically with real-time data on every call.

Provide examples

If specificity is the most important single principle, examples are the most powerful single technique. Providing the model with one or more demonstrations of the desired input-output behavior is consistently more effective than writing longer, more detailed instructions.

This technique is called few-shot prompting when multiple examples are provided, and one-shot prompting when a single example is used. The underlying reason it works is that examples communicate intent at a level of precision that instructions alone rarely achieve. An example does not just tell the model what to do. It shows the model the exact format, tone, level of detail, and style of reasoning that a correct response requires.

Anthropic's engineering guidance describes examples as the “pictures worth a thousand words” of prompt engineering. Rather than enumerating every edge case in a list of rules, a small set of well-chosen, diverse examples conveys the same information far more efficiently and reliably.

When selecting examples for a few-shot prompt, three principles apply:

  • Diversity over quantity: Two or three examples that cover meaningfully different cases outperform ten examples that are all variations of the same scenario.

  • Match the required format: If the desired output is a JSON object, the examples should show JSON objects, not prose descriptions of what a JSON object should contain.

  • Keep examples canonical: Examples should represent the expected, correct behavior in typical cases, not an exhaustive catalog of edge cases. Edge cases are better handled through explicit constraints in the instruction.

Use delimiters and structure

As prompts grow more complex, containing a system instruction, background context, user input, and examples all in the same text block, the model can struggle to distinguish where one component ends and another begins. This ambiguity leads to the model misinterpreting which part of the prompt it should follow as an instruction versus which part is data it should process.

The solution is to use delimiters to create clear structural boundaries within the prompt. Three common approaches are as follows:

  • XML tags: Wrapping components in descriptive tags such as <instructions>, <context>, <example>, and <user_input> creates unambiguous separation that models handle reliably.

  • Triple backticks: Using ``` to fence off blocks of data or user-supplied content signals to the model that the enclosed text should be treated as input to process, not as an instruction to follow.

  • Markdown headers: Using ## headers to organize sections of a long system prompt makes the structure explicit and keeps related instructions grouped together.

Here is the same prompt, first without delimiters and then with them:

Without delimiters

With delimiters

You are a helpful assistant. Summarize the text below in one sentence.

The quarterly results exceeded expectations with revenue up 18%.

You are a helpful assistant. Summarize the text in <document> tags in one sentence.


<document>

The quarterly results exceeded expectations with revenue up 18%.

</document>

The structured version removes any ambiguity about what the instruction is and what the data is. This matters especially in production systems where user-supplied content is injected into the prompt dynamically, since a user could otherwise craft an input that gets interpreted as an instruction, which is the basis of prompt injection attacks.

Ask the model to think step by step

For tasks that require multi-step reasoning, mathematical calculation, logical analysis, or any problem where arriving at the right answer depends on correctly sequencing intermediate steps, instructing the model to reason explicitly before producing a final answer significantly improves accuracy.

This technique is called chain-of-thought (CoT) prompting. It improves performance on a range of arithmetic, commonsense, and symbolic reasoning tasks, with empirical gains that can be striking. When a model is forced to show its work, the transformer's attention mechanisms spend more tokens on each sub-problem, reducing shortcut guesses and surfacing hidden errors.

In practice, chain-of-thought prompting can be applied in two ways:

  • Zero-shot CoT: Appending a phrase such as “Think step by step before giving your final answer” to the prompt, with no examples of reasoning provided. This is the fastest approach and works well for most reasoning tasks on capable models.

  • Few-shot CoT: Including examples where the input, reasoning chain, and final answer are all shown. This is more effective for specialized domains where the structure of correct reasoning is non-obvious.

It is worth noting that CoT prompting yields meaningful performance gains primarily with larger models. Smaller models may produce illogical chains of thought, which can actually lead to worse accuracy than standard prompting. For straightforward tasks that do not require multi-step reasoning, adding chain-of-thought instructions is unnecessary and increases token usage without a corresponding improvement in output quality.

Iterative prompting and version control

A prompt is rarely correct on the first attempt. One of the most important mindset shifts in advanced prompt engineering is treating prompt development as an iterative engineering process, not a one-time writing task.

Iterative prompting is the disciplined practice of testing a prompt against a representative set of inputs, analyzing where it fails, forming a hypothesis about why, making a targeted change, and re-testing. This cycle mirrors how software is debugged. Each iteration should change one variable at a time, just as a controlled experiment does, so that the effect of each change can be measured clearly.

This iterative approach connects directly to the best practices for managing AI prompts and evaluation data in production systems. Specifically:

  • Version controlling of prompts: Prompts used in production should be stored in a version control system, such as a Git repository, alongside the test cases used to evaluate them. This makes it possible to track what changed between versions, reproduce earlier behavior, and roll back a change that degraded performance.

  • Maintaining an evaluation dataset: A curated set of representative inputs with expected outputs is essential for measuring whether a prompt change is an improvement or a regression. Without a test dataset, prompt iteration becomes guesswork.

  • Documenting the reasoning behind changes: Noting why a particular wording was chosen, or what failure mode a specific constraint was added to address, makes the prompt maintainable by other engineers and easier to revisit later.

This discipline is what separates a fragile prototype prompt from a production-grade one. The prompt itself is only part of the asset. The test cases, version history, and evaluation results that surround it are equally important.

Calibrate output format precisely

A prompt that specifies the task clearly but leaves the output format undefined will produce responses that vary in structure, length, and style from one call to the next. For user-facing applications, this variability is inconvenient. For programmatic use cases where the output is parsed by downstream code, it is a defect.

Output format specification should be treated as a required component of any production prompt. The key dimensions to specify are:

Dimension

What to Specify

Example

Structure

Prose, list, table, JSON, XML

Return a JSON object with keys: name, date, amount

Length

Word count, sentence count, or token guidance

No more than 100 words

Tone

Formal, conversational, technical, accessible

Write in plain, accessible language for a general audience

Constraints

What to exclude or avoid

Do not include introductory or closing sentences

For applications that consume the model's output programmatically, specifying JSON output with a defined schema eliminates parsing ambiguity entirely. Including a concrete example of the expected output format in the prompt, as a few-shot demonstration, is the most reliable way to enforce structural consistency.

Use prompt engineering tools effectively

Writing and refining prompts manually is feasible for small-scale use, but as the number of prompts, models, and evaluation cases grows, dedicated prompt engineering tools become essential for maintaining quality and consistency.

Several categories of tools support the prompt engineering workflow:

  • Prompt playgrounds such as Anthropic's Console, OpenAI Playground, and Google AI Studio provide interactive environments for testing prompts against models, comparing outputs side by side, and adjusting parameters like temperature in real time. These are the primary environments for rapid prototyping.

  • Prompt management platforms such as PromptLayer and LangSmith allow teams to store, version, and organize prompts centrally, track performance metrics across versions, and collaborate on prompt development across engineering teams.

  • Evaluation frameworks such as Anthropic's built-in evaluation tooling and OpenAI's open-source evaluations project provide the infrastructure to run structured tests at scale, measuring prompt performance against accuracy, relevance, format compliance, and safety criteria before a prompt reaches production.

Using these tools in combination supports a repeatable, professional development workflow: prototype in a playground, refine iteratively against an evaluation dataset, manage versions in a prompt library, and measure production performance through continuous monitoring.

Conclusion

Prompt engineering best practices form the technical foundation for getting consistent, high-quality outputs from large language models across a wide range of use cases. Applying structured templates, providing concrete examples, using delimiters, enabling step-by-step reasoning, and iterating methodically against evaluation data are not advanced optimizations reserved for complex systems. They are the baseline disciplines that separate reliable AI applications from unpredictable ones. As models continue to evolve, these principles remain stable because they address the fundamental nature of how language models interpret and respond to input. Investing in these practices early pays dividends at every stage of development.