Search⌘ K
AI Features

Hands-On: Prompt Engineering

Explore how to design effective prompts that control large language model behavior without retraining. Understand and apply zero-shot, few-shot, and instruction refinement methods through hands-on Python examples to create consistent, accurate, and production-ready LLM outputs.

A developer building a customer-support classifier sends the same product complaint to an LLM twice, once with a casual prompt and once with a tightly structured one. The casual version returns a three-paragraph essay about customer feelings. The structured version returns a single word: “Negative.” Same model, same input text, completely different outputs. The difference is entirely in how the prompt is worded, and that difference determines whether the system is production-ready or unreliable.

Prompt engineering is the primary lever for controlling LLM behavior without retraining the model. Instead of adjusting billions of parameters, you adjust the instructions the model receives. This lesson implements three core techniques through runnable Python code against the OpenAI API. Zero-shot prompting sends a task with no examples. Few-shot prompting prepends a small set of labeled examples before the query. Instruction refinement iteratively tightens the prompt’s wording to constrain output format and content. Each technique directly impacts response accuracy, token usage, and latency, making prompt design a critical factor in production cost efficiency.

Every concept in this lesson is explored by writing, executing, and comparing prompts side by side. You will see weak prompts fail and optimized prompts succeed against the same inputs.

Setting up the environment

The setup is minimal. You need Python 3.10 or later, the openai Python package, and an API key stored as an environment variable named OPENAI_API_KEY. The code below defines a small helper function called call_llm that sends a prompt to the ChatCompletion endpointThe OpenAI API method that accepts a sequence of messages (system, user, assistant) and returns a model-generated response, powering conversational and single-turn interactions.. This helper abstracts away boilerplate so you can focus entirely on prompt design throughout the lesson.

Note: Temperature is set to 0 in all examples. This makes the model’s output nearly deterministic, which is essential when comparing prompt variations. In production, you might raise temperature for creative tasks, but during experimentation, reproducibility matters more.

The following code block imports the library, defines the helper, and runs a quick sanity check.

Python
from openai import OpenAI
client = OpenAI()
def call_llm(prompt, model="gpt-4.1", temperature=0):
response = client.chat.completions.create(
model=model,
temperature=temperature,
messages=[
{"role": "user", "content": prompt}
]
)
return response.choices[0].message.content
result = call_llm("What is prompt engineering?")
print(result)

With the helper in place, every experiment from here forward requires only a prompt string and a single function call.

Zero-shot prompting in practice

Zero-shot prompting means giving the model a task with no examples at all. The model relies entirely on patterns learned during pretraining to interpret what you want. For well-known tasks like sentiment classification, this often works, but the quality depends heavily on how the prompt is structured.

Weak vs. optimized zero-shot prompts

Consider classifying the sentiment of product reviews. A vague prompt like “What do you think about this review?” gives the model no constraints. It might respond with an opinion, a summary, or a multi-sentence analysis. The output format changes unpredictably across different reviews.

Now compare that with a structured zero-shot prompt that assigns a role (“You are a sentiment classifier”), specifies the task explicitly, and constrains the output format (“Respond with exactly one word: Positive, Negative, or Neutral”). This version eliminates ambiguity. The model knows what role to play, what task to perform, and what shape the answer should take.

Zero-shot prompting works well for tasks the model encountered extensively during pretraining, such as sentiment analysis, translation, and simple summarization. It breaks down when the task is domain-specific, ambiguous, or requires an output schema the model has rarely seen.

Practical tip: Always start with zero-shot. It costs the fewest tokens and often works for common tasks. Only escalate to more complex techniques when zero-shot output is inconsistent.

The code below runs both the weak and optimized prompts against two different product reviews so you can observe the consistency gap.

Python
import openai
client = openai.OpenAI() # assumes OPENAI_API_KEY is set in environment
# --- Prompt Templates ---
# Weak: no role, no explicit task, no format constraint
WEAK_PROMPT = "What do you think about this review? Review: {review}"
# Optimized: role assignment + explicit task + format constraint
OPTIMIZED_PROMPT = (
"You are a sentiment classifier. " # role assignment
"Classify the sentiment of the following product review. " # explicit task
"Respond with exactly one word: Positive, Negative, or Neutral. " # format constraint
"Review: {review}"
)
# --- Sample Reviews ---
reviews = [
"This product is absolutely amazing! Best purchase I've ever made.", # clearly positive
"Terrible quality. Broke after one day. Complete waste of money.", # clearly negative
]
def query_llm(prompt: str) -> str:
"""Send a single prompt to the LLM and return the response text."""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
temperature=0, # deterministic output for comparison
)
return response.choices[0].message.content.strip()
# --- Run Comparison ---
for review in reviews:
print(f"Review: {review}")
# Inject review into weak prompt and query
weak_output = query_llm(WEAK_PROMPT.format(review=review))
print(f"Weak prompt output: {weak_output}")
# Inject review into optimized prompt and query
optimized_output = query_llm(OPTIMIZED_PROMPT.format(review=review))
print(f"Optimized prompt output: {optimized_output}")

Running this code reveals that the weak prompt produces varying response lengths and formats, while the optimized prompt consistently returns a single classification word. That consistency is what makes a prompt production-ready.

Few-shot prompting with examples

When zero-shot prompting produces inconsistent or inaccurate results, especially for nuanced or domain-specific tasks, few-shot prompting provides the model with a small number of input-output examples before the actual query. These examples act as a behavioral template, steering the model toward a specific output schema and reducing ambiguity.

Selecting and formatting examples

Choosing the right examples matters more than choosing many examples. Three best practices guide effective example selection.

  • Cover edge cases: Include at least one example that represents a tricky or borderline input, such as a review with mixed sentiment, so the model learns how to handle ambiguity.

  • Maintain balanced label distribution: If you are classifying into three categories, include at least one example per category to prevent the model from favoring a single label.

  • Use consistent delimiters: Format every example identically (for instance, “Review: … Sentiment: …”) so the model can detect the pattern reliably.

Common pitfalls

A few mistakes can undermine few-shot prompting entirely. Too many examples waste tokensThe basic units of text that LLMs process, where one token roughly corresponds to four characters in English; both input and output tokens count toward API costs and context window limits. without improving accuracy. Biased example sets, such as providing only positive examples for a three-class task, skew the model’s predictions toward the overrepresented class. Inconsistent formatting between examples confuses the model about where one example ends and the next begins.

Attention: If all your few-shot examples share the same label, the model may default to that label regardless of the actual input. Always verify label balance before deploying a few-shot prompt.

The following code constructs a balanced few-shot prompt and a biased one for the same sentiment task, then compares their outputs.

Python
from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY from environment
def call_llm(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini", # fast + cheap model
messages=[
{"role": "user", "content": prompt}
],
temperature=0 # deterministic for classification
)
return response.choices[0].message.content.strip()
new_review = "The product works as described and arrived on time."
# --- Balanced few-shot prompt: one example per sentiment class ---
balanced_prompt = (
"Classify the sentiment of each review as Positive, Negative, or Neutral."
"Review: This is the best purchase I have ever made!"
"Sentiment: Positive"
"Review: The item broke after one day. Completely useless."
"Sentiment: Negative"
"Review: It does what it says, nothing more, nothing less."
"Sentiment: Neutral" # three distinct labels shown
f"Review: {new_review}"
"Sentiment:"
)
balanced_result = call_llm(balanced_prompt)
print("Balanced few-shot output:", balanced_result)
# --- Biased few-shot prompt: all three examples are Neutral ---
biased_prompt = (
"Classify the sentiment of each review."
"Review: The product arrived as described and works as expected."
"Sentiment: Neutral"
"Review: It does the job, nothing more and nothing less."
"Sentiment: Neutral"
"Review: Average quality, neither good nor bad."
"Sentiment: Neutral"
"Review: I have used it a few times; it seems fine so far."
"Sentiment: Neutral"
f"Review: {new_review}"
"Sentiment:"
)
biased_result = call_llm(biased_prompt)
print("Biased few-shot output:", biased_result)

The balanced prompt should classify the new review accurately, while the biased prompt is likely to return “Neutral” regardless of the review’s actual sentiment. This demonstrates that example selection is not a minor detail; it is a design decision that directly affects classification accuracy.

Instruction refinement workflow

Instruction refinement is the iterative process of rewriting a prompt’s instructions to improve output quality without adding examples. Think of it like editing a draft: each revision tightens the language, removes ambiguity, and adds constraints until the output meets your requirements.

Walking through a realistic scenario

Consider generating a JSON summary of a news article. A vague instruction like “Summarize this article” produces free-form text. The model has no reason to output JSON, include specific fields, or omit opinions. The result is unpredictable and unusable by downstream code that expects structured data.

Progressive constraint addition

Refinement proceeds in stages. Version 1 uses the vague instruction and gets free-form prose. Version 2 adds an output format specification: “Return your response as a JSON object with three fields: title, summary, keywords.” This alone forces the model into the correct structure. Version 3 adds field-level constraints and a negative instructionA prompt directive that tells the model what not to do (e.g., "Do not include opinions"), which reduces hallucination and unwanted content by explicitly closing off undesirable output paths. For instance, “The summary field must be under 30 words. The keywords field must be a list of exactly 5 strings. Do not include opinions or speculation.”

Each version narrows the space of acceptable outputs. By Version 3, the model produces clean, parseable JSON that a downstream service can consume directly.

Practical tip: Track token usage across refinement iterations. Adding constraints increases input tokens slightly but often reduces output tokens by eliminating verbose, off-target responses. The net effect is frequently a cost reduction.

Instruction refinement complements both zero-shot and few-shot prompting. You can refine the instructions in a zero-shot prompt, or refine the instructions that precede your few-shot examples. It is not a separate category so much as a discipline applied on top of any prompt.

The code below runs all three versions against the same article excerpt so you can observe progressive improvement.

Python
from openai import OpenAI
client = OpenAI() # uses OPENAI_API_KEY from environment
def call_llm(prompt: str) -> str:
response = client.chat.completions.create(
model="gpt-4o-mini", # fast + cheap model
messages=[
{"role": "user", "content": prompt}
],
temperature=0 # deterministic for classification
)
return response.choices[0].message.content.strip()
# Short news article excerpt used as input for all prompt versions
article = (
"Scientists at MIT have developed a new battery technology that could "
"charge electric vehicles in under five minutes. The breakthrough uses "
"a novel anode material that dramatically increases ion transfer speed, "
"potentially revolutionizing the EV industry and reducing range anxiety."
)
# V1: Vague prompt — minimal instruction, unpredictable output format
prompt_v1 = f"Summarize this article: {article}"
# V2: Format specified — requests structured JSON output
prompt_v2 = (
f"Summarize this article: {article}"
"Return your response as a JSON object with fields: title, summary, keywords."
)
# V3: Fully refined — adds length, count, and tone constraints for reliable output
# Slightly longer input, but produces a more concise and structured response
prompt_v3 = (
f"Summarize this article: {article}"
"Return your response as a JSON object with fields: title, summary, keywords."
"The summary must be under 30 words. "
"keywords must be a list of exactly 5 strings. "
"Do not include opinions or speculation."
)
# Call the LLM for each prompt version and display labeled results
output_v1 = call_llm(prompt_v1)
print("V1 (vague):", output_v1)
output_v2 = call_llm(prompt_v2)
print("V2 (format specified):", output_v2)
output_v3 = call_llm(prompt_v3) # V3 input is slightly longer but output is more concise and structured
print("V3 (fully refined):", output_v3)

The following flowchart visualizes the refinement loop as a repeatable process you can apply to any prompt.

Prompt engineering refinement loop for iterative LLM output optimization
Prompt engineering refinement loop for iterative LLM output optimization

This loop is not a one-time exercise. Production prompts evolve as edge cases surface, and the refinement cycle restarts whenever a new failure mode appears.

Conclusion

Prompt engineering is the most direct way to control LLM behavior without changing model weights. This lesson implemented zero-shot, few-shot, and instruction refinement techniques and demonstrated improvements in output consistency, accuracy, and structure. You also learned that prompt design is not static but iterative, evolving through testing and refinement. Together, these techniques form a practical foundation for building reliable, cost-efficient, and production-ready LLM applications.