Search⌘ K
AI Features

Generating Synthetic Data for Evaluation and Edge-Case Testing

Explore how to generate structured synthetic data to intentionally expose diverse behaviors and edge cases in LLM systems. Learn to define evaluation dimensions, manually create scenario tuples, and leverage large language models to convert these into realistic user inputs. This lesson helps you design meaningful synthetic traces that guide targeted testing and uncover failure points early in development.

Capturing traces from real users is ideal, but in the early stages, it is often insufficient and sometimes not viable. Many systems do not yet have enough usage, and even when they do, user behavior tends to cluster around a narrow set of common paths. As a result, important edge cases and failure modes may never appear naturally. Synthetic data enables you to intentionally guide the system through a broader range of behaviors, allowing for the collection of more diverse traces for evaluation and analysis.

The goal of synthetic data is not to create fake users, but rather to generate realistic data. The goal is to create inputs that exercise different execution paths within the system. When done well, synthetic inputs help you uncover failures earlier, before they affect real users. When done poorly, they produce generic traces that offer little insight. The difference comes down to structure.

Why does unstructured synthetic data fail?

A common mistake is prompting a large language model (LLM) to “generate test queries” or “give me example inputs.” This almost always yields repetitive, overly safe outputs that reflect the model’s defaults rather than real variation in user behavior. The traces look reasonable, but rarely expose new failures.

Another failure mode is generating synthetic data without a clear purpose. If you do not have hypotheses about where the system might fail, the generated data drifts toward the center of the distribution. You end up testing what already works instead of probing what might break.

Effective synthetic data starts with structure. A structured approach begins by defining dimensions, which are the axes along which user behavior varies in ways that affect system behavior. Each dimension captures one source of variation that could plausibly introduce failure.

What are examples of useful dimensions?

For example:

  • In a recipe application, dimensions might include dietary restriction, cuisine type, and query complexity.

  • In a customer support assistant, dimensions might include issue type, customer mood, and prior context.

Failure hypotheses should inform evaluation dimensions. If the hypotheses are unclear, use the product directly or ask a small group to interact with it to clarify their understanding. Early hypotheses do not need to be correct. They need to be concrete enough to guide initial exploration.

How many dimensions should you start with?

Avoid defining too many dimensions at once. Three to five dimensions are usually sufficient to start. Before involving an LLM, manually write around twenty concrete tuples by selecting one value from each dimension. For example:

  • Billing issue, frustrated, follow-up

  • Technical issue, neutral, new inquiry

This step is intentionally manual. It forces you to reason about the problem space and often reveals missing or poorly defined dimensions. At this stage, the tuples are purely structural in nature. They describe what scenario you want to test, not how a user would phrase it.

How to turn tuples into realistic user inputs

Once you are comfortable with the structure, the LLM’s job becomes narrow and well-defined: convert each tuple into a realistic user message without changing its meaning. You are not asking the model to invent scenarios, only to express a given scenario naturally.

What does a simple prompting pattern look like?

A simple prompt pattern looks like this:

You are generating realistic user messages for a customer support chatbot.
Each input is a structured tuple describing a user scenario in the form:
(Issue type, customer mood, prior context)
Convert the tuple below into a single realistic user message.
Do not introduce new information.
Do not soften or exaggerate the tone beyond what is specified.
Sample prompt

Given this input, the model might produce:

“I’m following up on a billing issue I reported last week. I’m still seeing the incorrect charge on my account and this is becoming really frustrating.”

This output can be passed directly through your system to generate a full trace.

You only need to write a handful of messages yourself to validate that your dimensions make sense. After that, the LLM generates the rest. If you find yourself manually rewriting many synthetic messages, it usually means the dimensions are underspecified or the prompt is too vague. Refining the structure is almost always more effective than manual editing.

How do you scale tuple-to-input generation reliably?

Once this works for a single tuple, you can scale the process by providing the LLM with a list of tuples, generating one message per tuple, validating a small sample manually, and then running all generated messages through your actual system. Because the structure is explicit, it is easy to audit coverage and reason about what you are testing.

By separating structure from phrasing, you avoid repetitive language while preserving intentional coverage of behaviors that matter to the system.

Example

The interactive tool below walks through this process step by step.

You start by seeing why unstructured generation fails, then explore predefined dimensions for a customer support assistant. From there, you build tuples by selecting values from each dimension and watch as structured scenarios are converted into realistic user messages. Each tuple can produce multiple phrasings, demonstrating how structure preserves coverage while a large language model (LLM) handles natural variation.

This is the same workflow you would use with your own system. Define dimensions based on failure hypotheses, create tuples manually until the structure feels right, then let the model generate the rest.

What’s next?

Now that you know how to generate diverse traces intentionally, the next step is to use those traces to learn from failures systematically. In the following lessons, we shift from data generation to analysis. You will take the traces you have collected, review them in a disciplined way, and turn observations into a clear failure taxonomy.

This is where traces stop being raw artifacts and start guiding concrete decisions about what to fix, what to test, and which evaluations to build next.