Search⌘ K
AI Features

The 4D Framework (Discover, Distill, Deploy, Deliver)

Explore the 4D framework behind building and operating reliable LLM applications. Understand how to define use cases, curate golden datasets, deploy scalable infrastructure, and monitor systems continuously. Gain insights on quality gates essential for moving through each phase and ensuring production readiness in large language model operations.

Consider this scenario. We built a chatbot that works most of the time.

We noticed a bug where the system misinterprets a specific question, so we tweaked the system prompt. Now it answers that question perfectly, but its accuracy on other questions drops. We tweak the prompt again, change the model temperature, and switch from OpenAI to Anthropic.

Six weeks later, we still have a demo. It works in the sense that it responds, but we can’t confidently ship it. We are not sure what standard of quality is expected; we don’t know what is good enough, and every change creates new uncertainty.

The inifinite loop of the eternal protype syndrome
The inifinite loop of the eternal protype syndrome

This endless cycle is called the eternal prototype syndrome.

In traditional software, we have clear stages: development, testing, staging, and production. In GenAI, the lines are blurry. Prompt engineering resembles coding, but it occurs in the midst of operations. Evaluation feels like testing, but it never really ends.

To ship reliable systems, we need a map.

We need a framework that clearly indicates when we have completed one phase and are allowed to proceed to the next. This methodology transforms the vague goal of building an AI agent into a series of engineered phases, each producing specific artifacts and gated by rigorous exit criteria.

In this lesson, we will explore the 4D framework in detail. We will see exactly what happens in each stage and, most importantly, the quality gates you must pass to move to the next.

Stage 1: Discover

Discover is the product definition phase.

Before designing prompts or selecting a model, we identify the real problem, confirm that an LLM is the right tool for the job, and verify that the required data is available and can be used safely. Many LLM projects fail because teams start with a preferred solution instead of a clearly defined user problem.

Many RAG projects fail for an even simpler reason: the data is messy, inaccessible, or restricted.

The discover phase
The discover phase

A practical way to think about discover is that we are mapping the full contract of the system. What questions should it answer, what questions must it refuse, what data is it allowed to see, and what does success look like in numbers?

How to execute:

  1. Use case qualification: We start by qualifying the use case. Instead of asking: Can an LLM do this? We ask: Should an LLM do this? High-impact, low-risk use cases, such as internal search or support assistants, are usually good candidates. High-risk domains like medical advice, legal judgments, or automated HR decisions require stronger governance and often a narrower scope.

  2. Data inventory and access: We need a map of the data inventory. Here’s a simple flow that can help:

    1. Identify: Where does the data live? (SharePoint, SQL, PDFs).

    2. Ingest: How do we get it? Do we need a crawler for a wiki, or an ETLETL (Extract, Transform, Load) is a process that pulls data from sources, cleans/transforms it, and loads it into a target system like a database or data warehouse. pipeline for a database?

    3. Access control: Does the data contain PII (Personally Identifiable Information)? If so, we need a redaction strategy before sending the data to an embedding model.

  3. Service level objective (SLO): This is also where we define the system’s targets. An SLO is a measurable reliability target for a system. For LLM apps, common SLOs include latency (p95 response time), cost (average cost per query), and quality (offline eval score). Without numerical baselines, we can’t optimize our system at all.

The output of the discover phase is a product specification with acceptance criteria: what the system must do, what it must not do, and the measurable thresholds that determine success.

The quality gate: Data availability and governance

This gate answers one question: Are we allowed to build? We cannot proceed to the next stage until we have passed the data availability check. Here are some basic questions we must have clear answers for at this quality gate:

  • Question: Do we have programmatic access to the trusted source of truth?

  • Question: Do we have the legal right to send this data to a third-party model provider (e.g., OpenAI or Anthropic), or does data residency require us to host a local model (e.g., Llama 4) in our own VPC?

If we fail this gate, we stop and resolve the blocker. We cannot use prompts as a replacement for missing data.

We must pause engineering and switch to data engineering. If we lack access, we build the missing ETL pipelines and permission flows. If the data is messy (e.g., scanned PDFs), our project is now an OCROCR (Optical Character Recognition) is the process of converting images of text (scans, screenshots) into machine-readable text. project, not an LLM project, until that data is clean.

Stage 2: Distill

Distill is the experimentation and proof phase. Here, we turn the messy space of LLM possibilities into a specific, working configuration with measured quality. We iterate on prompts, choose models, and tune retrieval.

The distill phase
The distill phase

This is where we scientifically prove that our system is working. In traditional software development, we write unit tests to verify logic. In LLMOps, we use evaluation datasets to prove quality.

How to execute:

  1. Prompt engineering: We begin with prompt engineering and interface design. We decide what the system is allowed to output, how it should cite sources, and how strict it must be about grounding.

  2. RAG tuning: Then we tune retrieval if we are building a RAG system. Chunk sizing, splitting strategy, metadata filters, and embedding model choices are all part of the retrieval configuration.

  3. The artifact (golden dataset): This is the most important asset we will build. A golden dataset is a curated set of representative inputs and ideal outputs verified by a human expert. For example:

    1. Input question: What is the refund policy for digital goods?

    2. Golden output: Refunds for digital goods are allowed within 14 days if the product has not been downloaded.

When RAG and prompt engineering fail to yield results, fine-tuning can be a viable option. Advanced systems often combine both methods, using RAG to retrieve facts and a fine-tuned model to interpret those facts and format the response.

The quality gate: Offline evaluation score

Before a system can proceed to deployment, it must pass an offline evaluation on the golden dataset, meeting a minimum threshold on metrics such as answer relevance and faithfulness.

We refer to this as an offline evaluation because it occurs in the development environment rather than in production. It is our safety net. If our system fails the golden dataset test here, it will definitely fail in the real world. If quality gates fail, we analyze why.

Common failures for RAG-based systems at this stage include:

  • Retrieval error: Did the system fail to find the right document? Adjust chunk sizes or try a better embedding model.

  • Generation error: Did it find the document but fail to answer? Iterate on the system prompt or switch to a smarter (larger) model.

This will be the toughest quality gate to pass, as it sets the baseline for the user experience that the application will provide.

Stage 3: Deploy

Deploy is where we wrap the probabilistic model in deterministic infrastructure to ensure reliability, scalability, and security. A production LLM system requires more than just a model endpoint.

The deploy phase
The deploy phase

How to execute:

  1. Containerization: Package the code and its dependencies into a standard unit, typically using Docker. Python environments are fragile. An LLM app can depend on specific versions of PyTorch, CUDA drivers, and tokenizer libraries. If we don’t containerize, code that works on our local machine will likely crash on the cloud server due to library mismatches. Containers make environments reproducible.

  2. Model routing: Implement logic to route simple queries to cheaper, faster models (e.g., GPT-5-nano) and complex reasoning tasks to powerful models (e.g., GPT-5). This can save on cost and potentially improve response latency for simpler tasks.

  3. Caching: Implement semantic caching. If a user asks: How do I reset my password? and another asks: Password reset steps, the system should recognize that they mean the same thing and return the cached answer instantly, costing $0.

The quality gate: SLO check

This gate answers: Can the system run under load while staying within our budgets? We cannot release to production until the system passes the staging traffic check. We validate the system against its SLOs. For example, we can define thresholds as such:

  • Latency budget: 95 percent of requests must complete in < 2000ms.

  • Cost budget: The average cost per query must be < $0.01.

  • Error rate: < 0.1% of requests result in a 5xx error.

If this quality gate fails, we have an efficiency blocker. Depending on the issue, we can implement a suitable fix. For example:

  • If the cost exceeds the budget: Aggressively implement caching or switch to a smaller, quantized model.

  • If the average latency is too high: Parallelize your retrieval calls or switch to a faster inference provider (e.g., Groq or a dedicated vLLM server).

The actual numbers that you want to achieve on this quality gate will depend on the problem your application is solving. Expecting very low latency for a slow task, such as video generation, will require significant time, energy, and effort, with perhaps limited return.

Stage 4: Deliver

Deliver is the operating phase. Once our system is live and real users are using it, we enter an ongoing process of monitoring, gathering feedback, and ensuring it continues to provide value.

The deliver phase
The deliver phase

LLM systems are inherently non-deterministic. Data changes, user behavior changes, and edge cases may have gone undetected by our golden dataset. Deliver is how we detect these issues early and convert them into improvements.

How to execute:

  1. Online monitoring: Implement LLM-as-a-judge to sample live traffic and score it for quality in near-real time.

  2. Feedback loops: Capture feedback. Even a simple thumbs up/down becomes high-value operational data when it is attached to the full trace: user input, retrieved context, model response, and prompt version. This information can also be sent to a Human-in-the-Loop (HITL) workflow, where this feedback can be better classified and organized

  3. Drift detection: Alerting if the topics users are asking about suddenly change (e.g., a sudden spike in questions about a competitor), which might require you to update your golden dataset and go back to the distill phase.

Operational observability requires tracking both system metrics, such as latency and error rates, and LLM-specific metrics, including token usage and estimated cost. This data can then be sent back into the distill phase, where it can help create new examples in the golden dataset.

This creates a closed-loop system where operational data directly drives model improvement.

The quality gate: Rollback trigger

In the deliver stage, the gate works in reverse.

Instead of passing to move forward, we implement a rollback if this fails. If key metrics drop below a safety threshold, such as a hallucination rate exceeding 5% or critical error rates (e.g., refusal to answer, toxic output) exceeding 1 percent, an automated rollback can be triggered, reverting the system to its previous stable version.

Failing this quality gate often means that a safety incident has occurred.

The first course of action should be to rollback immediately and revert to the previous stable version of the prompt/model configuration. Then we should take the failing examples, add them to the evaluation suite, and return to the distill stage to engineer a fix.

The cycle never ends

The 4D framework is not linear. It is a loop with gates between stages.

  1. We discover to define the system.

  2. We distill to prove quality.

  3. We deploy to prove operational viability.

  4. We deliver to operate, observe, and improve.

We can sum up our discussion with this table:

Phase Transition

Required Artifacts

Exit Criteria

Discover Distill

Product spec + SLO targets + data access plan

Data access and governance approved

Distill Deploy

Golden dataset and evaluation report

Offline Eval Score > Threshold (e.g., 85% Accuracy).

Deploy Deliver

Load test report and staging endpoint

Latency p95 < 2000ms; Cost < Budget under load.

Deliver Discover

Usage dashboard and change log

Feedback incorporated into the next roadmap iteration.

Now we have our map.

The next step is to start building with the system. Most production LLM applications start with retrieval-augmented generation because it delivers usable results quickly. In the next lesson, we move from concepts to implementation and define a production-grade RAG system from end to end.