Search⌘ K
AI Features

MLOps to LLMOps: What Changes and What Stays

Explore the distinctions and continuities between MLOps and LLMOps to understand how large language model systems require new assets, workflows, and risks. Learn about semantic drift, prompt versioning, retrieval-augmented generation, and specialized monitoring vital for operating LLMs in production environments.

Imagine we are operating a mature MLOps pipeline for a credit scoring system.

The system predicts whether a loan applicant is likely to default based on structured features such as income, credit history, and debt-to-income ratio. The pipeline is in a stable state, with versioned features, models deployed through CI/CD, and monitoring tools that detect drift.

We use tools like EvidentlyEvidently is a popular open-source monitoring tool for detecting data drift and model performance degradation in ML systems. to detect statistical changes in input data and model performance.

If metrics like average applicant income or prediction accuracy drop significantly, we can drill down to investigate. Now we apply the same operational logic to an LLM-powered chatbot. The system receives thousands of free-form text queries each day.

One user asks: What is the PTO policy? Another asks: Can I take next Friday off?

Statistically, these sentences look totally different (different length, different words). But semantically, they are identical. Conversely, The bank is running low on cash (financial), and The river bank is running low on water (environmental) share the statistically similar phrase bank is running low on, but they refer to completely different semantic domains (finance vs. geography) and have opposite contextual meanings.

Traditional MLOps tools are often unable to accurately detect semantic drift. This can become a problem because we rely on metrics that measure how data looks (statistics) rather than what it means (semantics) when monitoring advanced models like LLMs.

It would be incorrect to assume that LLMOps is just MLOps with bigger models. It is MLOps adapted to a system where semantics, stochasticity, and unstructured data define the runtime behavior. In this lesson, we map the traditional MLOps stack to LLM-powered systems.

We identify which practices remain, which components must be replaced, and which entirely new workflows and risks emerge.

The foundation: What stays

Before analyzing the new components, we must first examine the foundational principles. LLMOps is still a subset of software engineering. The bottom layers of the stack remain largely unchanged.

Here are some core software engineering fundamentals that remain largely the same:

  1. Containerization (Docker and Kubernetes): We can think of LLM applications as just heavy microservices. LLM applications are still services that must be packaged, deployed, scaled, restarted, and health-checked. Docker and orchestration platforms like Kubernetes remain the standard for managing runtime environments.

  2. CI/CD (Git and Actions): The Continuous Integration and Continuous Deployment (CI/CD) pipeline remains the central nervous system of our operation. We still use Git for version control and CI runners to run tests. While the content of the tests changes from unit tests to evaluation sets, the mechanism stays the same.

  3. Infrastructure as Code (Terraform/CloudFormation): Just like in traditional MLOps, we still need to provision infrastructure. Provisioning a GPU instance, a vector database, or a web service still benefits from declarative definitions using Terraform or CloudFormation. LLM infrastructure is more expensive and stateful, which makes reproducibility and controlled rollout even more important.

These foundations give us reliability, repeatability, and scale. LLMOps builds on them rather than replacing them.

The shift: The component swap

The first major change is an expansion of what we manage.

LLM systems introduce a new set of digital assets that are as important as our source code. The assets directly control system behavior. If these assets are not versioned, evaluated, and governed, system behavior becomes impossible to reproduce. Let’s explore these components.

The data layer

In traditional MLOps, the data layer is centered around a feature store.

Features are structured, numeric, and explicitly defined: age, transaction count, and credit utilization. Queries rely on exact matches and joins. In LLMOps, the primary inputs are unstructured text chunks. Instead of retrieving rows by equality, we retrieve documents by semantic similarity.

This necessitates a swap from feature stores to vector databases such as pgvector, Pinecone, or Milvus.

Vector databases employ techniques such as Approximate Nearest Neighbor (ANN) search to identify documents that are conceptually similar to the user’s query. ANN structures allow us to search millions of embeddings in milliseconds while accepting a small approximation error.

The tuning layer

In classical MLOps, optimization happens through hyperparameter tuning. Engineers search over learning rates, tree depths, or regularization coefficients to improve predictive accuracy.

In LLMOps, we rarely retrain the base model because pre-training costs are often prohibitively expensive. Instead, we tune the context. System behavior is shaped through:

  • System prompts that define instructions and constraints.

  • Retrieval strategies that control what knowledge the model sees.

  • Chunking, ranking, and filtering decisions in the RAG pipeline.

Fine-tuning base models can still be valuable for domain adaptation or style consistency. While much more cost- and time-efficient than retraining a model from scratch, it is slower, less flexible, and harder to iterate on than prompt and retrieval tuning. For most production systems, prompts and RAG configuration are the primary levers of control.

The artifact layer

In traditional MLOps, the output is a binary file containing model weights. These files are heavy, opaque, and strictly versioned.

In LLMOps, behavior emerges from multiple artifacts:

  • Prompts that define instructions and tone.

  • Embeddings that encode knowledge.

  • Knowledge bases that evolve independently of code.

A prompt is effectively a function definition written in English. If we change Summarize this text to Summarize this text concisely, we would have altered the system’s behavior. This makes a prompt registry essential. This is essentially a Git for prompts.

sIt is a centralized system that versions, tracks, and manages these text definitions, allowing us to test and roll back changes as we would with compiled code.

The evaluation problem

The biggest friction point between MLOps and LLMOps is evaluation. In traditional ML, ground truth is usually absolute. A transaction can either be fraudulent or not. Metrics like accuracy and F1-score are easy to understand and reason about.

In LLMOps, ground truth is subjective.

  • Output: The policy allows 15 days of PTO.

  • Reference: Employees receive 15 days off per year.

A string comparison would flag these two sentences as different, but a human reviewer would judge them equivalent because they convey the same meaning.

As a result, evaluation needs to move beyond deterministic metrics and account for semantic meaning. Approaches such as LLM-as-a-judge use model-based scoring to evaluate dimensions like faithfulness, relevance, and policy alignment.

Although these evaluators are probabilistic, their outputs tend to better reflect human quality judgments. We will explore these evaluation methods in more detail in a later lesson.

New workflows to operationalize

The next change is in how we work with these systems. The unique architecture of LLM applications requires entirely new operational workflows that don’t exist in classical MLOps.

  • Retrieval-augmented generation (RAG): In MLOps, a prediction is usually an atomic operation. In LLMOps, it is a distributed system transaction. RAG is the most common workflow in modern LLM applications. It’s a multi-stage pipeline: first, retrieve relevant documents from a knowledge base; second, augment the user’s prompt with this information; and third, generate an answer based on that context. This is a small, distributed system that requires its own specialized monitoring for retrieval quality, chunking strategies, and data freshness.

  • Guardrail integration: In MLOps, a model might produce incorrect predictions, but LLMs can produce unsafe, toxic, or confidential outputs. To operate LLMs responsibly, we need to enforce policies that govern their use. This involves a workflow where inputs and outputs are passed through a guardrail. A guardrail is a set of programmable checks. For example, a guardrail might scan for personally identifiable information (PII) and redact it, check for toxic language, or ensure the model’s response doesn’t stray into forbidden topics.

  • Feedback loops: In MLOps, feedback is often implicit, such as a user clicking an ad. In LLMOps, feedback is explicit. Users might click a thumbs-down button or write: This answer is wrong. We need a dedicated database to store this feedback, allowing us to continually improve our prompts over time.

New risks to control

Finally, the stochastic nature of these models introduces risk factors that deterministic software does not face. Traditional application security measures are often unable to detect these new attack vectors.

Prompt injection attacks use natural language to manipulate system behavior. Traditional firewalls cannot detect them because the input is linguistically valid. This requires different templating techniques to cater to these attack vectors.

Cost volatility becomes a production concern. LLMs are billed by tokens. A single long input or runaway loop can multiply costs unexpectedly. Token budgets, request limits, and monitoring can act as safeguards against these risks.

Conclusion

We now have a clear mental model. LLMOps is not a replacement for MLOps; it is a superset. We start with the solid foundation of the MLOps engineering discipline. On top of that, we add three new layers of specialization:

  1. New artifacts: Managing prompts, embeddings, and model configs.

  2. New workflows: Operationalizing RAG, LLM-as-a-judge, and guardrails.

  3. New risks: Defending against prompt injection and governing runaway costs.

With this mental model in place, we can now map these concepts onto a structured, end-to-end life cycle. How do we take a business idea and systematically move it through the stages of discovery, development, deployment, and delivery using this new, integrated LLMOps toolkit?

That is the subject of our next lesson, where we will dive deep into the 4D LLMOps life cycle.