Large language models are powerful, but building reliable apps on top of them can feel like coding with duct tape; it is ingenious and flexible, yet prone to breaking under pressure due to unpredictable behavior and fragile integrations. Prompts turn into hidden business logic, workflows collapse when the model shifts behavior, and scaling a project means maintaining a tangle of brittle strings.
Essentials of Large Language Models: A Beginner’s Journey
In this course, you will acquire a working knowledge of the capabilities and types of LLMs, along with their importance and limitations in various applications. You will gain valuable hands-on experience by fine-tuning LLMs to specific datasets and evaluating their performance. You will start with an introduction to large language models, looking at components, capabilities, and their types. Next, you will be introduced to GPT-2 as an example of a large language model. Then, you will learn how to fine-tune a selected LLM to a specific dataset, starting from model selection, data preparation, model training, and performance evaluation. You will also compare the performance of two different LLMs. By the end of this course, you will have gained practical experience in fine-tuning LLMs to specific datasets, building a comprehensive skill set for effectively leveraging these generative AI models in diverse language-related applications.
If you’ve ever built something with a large language model, you’ve probably felt the pain of prompt engineering. You spend hours nudging phrasing, “summarize in bullet points” vs. “make a list,” only to have the model suddenly ignore you after an update. One tiny wording change, or even switching to a different model, and the whole thing collapses like a Jenga tower.
This approach:
Can work, but is fragile and time-consuming.
Breaks easily as a minor wording change, or a new model version can disrupt a carefully tuned prompt.
Leaves us with brittle strings that are hard to maintain.
Frameworks such as LangChain and LlamaIndex try to make things easier by helping us organize multiple model calls or connect to external data. That’s useful, but they don’t remove the core problem. This is because we’re still hand-crafting prompts as if we’re writing assembly code, which is low-level, finicky, and resistant to change.
DSPy’s big idea is simple: stop treating prompts like shortcuts and start treating them like code.
Instead of writing long prompt strings, we:
Write Python code that declares what each part of the pipeline should do.
Let DSPy translate those declarations into effective prompts behind-the-scenes.
Think of it as moving from an assembly language to a high-level language. You could handwrite every instruction, but why would you sign up for that added work when the compiler can do that for you instead? In DSPy, the “compiler” is the optimizer that translates your Python signatures and modules into effective prompts for whichever model that you are using.
This shift has two major benefits for developers.
Maintainability: Your logic lives in Python, not in textual guesswork. If you swap models, change the output format, or add evaluation data, you simply recompile, and DSPy adapts the prompts automatically.
Self-improvement: DSPy does not freeze your first attempt. It experiments. Built-in algorithms generate variations, test them against your examples and metrics, and keep only the winners. This means that no more endless trial-and-error tweaking by hand is required from your end.
The next step is to raise the abstraction level. Instead of micromanaging fragile strings, we should be able to describe our intent in clean Python code and let the system handle the messy prompt details instead.
The result: You write clean, modular Python, and DSPy turns that into a prompt pipeline that evolves as your application or your models change.
The DSPy framework (Declarative Self-improving Python) is a Python framework for building applications with large language models (LLMs). Instead of hand-crafting fragile prompts, DSPy lets developers program LLM behavior directly in Python. In other words, DSPy turns prompt engineering into declarative, modular Python code.
The result is a more maintainable and self-optimizing workflow for AI development. This blog explains what DSPy is, why it matters, how it compares to alternatives such as LangChain and LlamaIndex, and how to use DSPy with practical code examples.
If you want to explore this further, check out our full DSPy guide.
When the creators say “programming, not prompting, LLMs,” they mean we express tasks in structured code rather than raw text.
In DSPy, we define behavior as modules. Each module has a signature that describes its inputs and outputs, similar to a function signature. For example:
A signature like question -> answer: float indicates the module takes a question and should output a number.
We can also define signatures as small Python classes with InputField and OutputField.
This is like specifying the format we expect, but in code, rather than prose.
DSPy expands modules into prompts and parses the model’s outputs back into structured data types.
If a signature says the output is a float, DSPy parses text into a Python float.
If the output is a list or a custom class, DSPy enforces that structure, often by instructing the model to produce JSON and then parsing it.
There are a few main components that make DSPy work. We look first at the signature and the module we discussed above.
A signature is a blueprint for a task’s inputs and outputs. You can define one with arrow notation as we did above, or as a small class:
import dspyclass ClassifySentiment(dspy.Signature):"""Determine sentiment of a sentence."""sentence: str = dspy.InputField()sentiment: str = dspy.OutputField(desc="positive, negative, or neutral")
The docstring provides task guidance and desc clarifies the expected form of the output.
A module implements the signature with an LLM strategy.
For a simple transformation, such as sentiment analysis or summarization, use dspy.Predict. 
For tasks that benefit from intermediate reasoning, use dspy.ChainOfThought. 
When tool use is required (such as math or web search), use dspy.ReAct.
Here is an example of using dspy.Predict:
classify_sentiment = dspy.Predict(ClassifySentiment)result = classify_sentiment(sentence="I love this book; it was thrilling!")print(result.sentiment) # "positive"
DSPy constructs the prompt and parses the result into result.sentiment so we do not write any prompt or parsing code.
We can connect modules to create pipelines that solve multi-step tasks. For retrieval-augmented question answering, we can use one module to fetch relevant text, then, pass that context and the question into a generator module. As modules share explicit signatures, composition is robust to change.
Additionally, DSPy comes with built-in optimizers. Optimizers automatically improve prompts or model weights for your task. When we compile a DSPy program, we choose an optimizer and provide training examples (or allow bootstrapping) along with a metric. Optimizers may adjust the features mentioned below.
Instructions: The textual guidance a module uses.
Demonstrations: Example input-output pairs for in-context learning.
Model weights: For supported local models via fine-tuning.
Think of optimizers as something that defines how a program learns.
Here are some of the most used optimizers available in DSPy.
BootstrapFewShot or BootstrapFewShotWithRandomSearch: We use these when we have about 10 to 100 examples and want solid few-shot prompts fast. They generate and select demonstrations per module, and pick the best program based on our metric. This is a great day-one lift for RAG, extraction, or classification.
MIPROv2: This is used when we have more data or need stronger instruction tuning across a pipeline. It proposes instructions and demos and uses Bayesian optimization to search for combinations that improve our metric. It is a good default for multi-module pipelines (RAG, agents).
GEPA (reflective prompt evolution): This is used when we want the model to reflect on failures and propose fixes. This is often helpful for math or long reasoning, including enterprise IE.
Ensemble or BetterTogether: This is used when we want robustness or to trade more inference compute for quality. For example, safety-critical answers or high-variance tasks are ideal for this.
BootstrapFinetune: We use it when we want to distill a prompt-optimized pipeline into a small or local model for cheaper, faster inference. It is typical for classification or lightweight RAG when we control the model weights.
Behind-the-scenes, optimizers combine LLM-driven search with standard optimization techniques. The compiled result is an updated module that uses the best prompt and examples found.
DSPy can optimize prompts for any provider that you configure. However, weight optimization (fine-tuning via BootstrapFinetune) targets models where you control or have access to weights/endpoints. These are typically local or smaller open-source models (e.g., Llama-3.2-1B run locally). In those cases, stick to prompt-level optimizers (e.g., BootstrapFewShot, MIPROv2).
Remember: DSPy does not know what “better” means for your task until you define it. Use an exact match for single-answer tasks, or a semantic metric such as Semantic F1 when meaning overlap matters. You can also define custom metrics as these guide how the optimizer evaluates and improves performance.
Compiling runs the optimizer’s improvement loop. DSPy generates candidate prompts, tests them on our data, and keeps what improves our metric. If we add data or change output format later, we just need to recompile to adapt.
Let’s look at a few tasks that are common in AI engineering.
The first example we will examine is chain-of-thought reasoning/prompting. We will solve a math word problem using the chain-of-thought.
import dspy# Assume an LLM is configured, for example: dspy.configure(lm=my_model)math_solver = dspy.ChainOfThought("question -> answer: float")result = math_solver(question="Two dice are tossed. What is the probability that the sum equals two?")print(result.answer) # 0.0277778 (that is, 1/36)
The ChainOfThought module instructs the model to reason step-by-step and returns a structured object such as:
Prediction(reasoning="When two dice are tossed... exactly one outcome sums to two...",answer=0.0277778)
That’s it! We can also extract structured fields without manual JSON prompting or parsing:
class ExtractInfo(dspy.Signature):"""Extract structured information from text."""text: str = dspy.InputField()title: str = dspy.OutputField()headings: list[str] = dspy.OutputField()entities: list[dict[str, str]] = dspy.OutputField(desc="a list of entities and their metadata")extract_module = dspy.Predict(ExtractInfo)response = extract_module(text="Apple Inc. announced its latest iPhone 14 today. The CEO, Tim Cook, highlighted its new features in a press release.")print(response.title)print(response.headings)print(response.entities)
This yields a Python object with the requested fields, without manual JSON prompting or parsing.
We can also use DSPy to create agentic workflows:
def search_wikipedia(query: str) -> list[str]:...def evaluate_math(expr: str) -> float:...react_agent = dspy.ReAct("question -> answer: float", tools=[evaluate_math, search_wikipedia])prediction = react_agent(question="What is 9,362,158 divided by the year of birth of David Gregory of Kinnairdy Castle?")print(prediction.answer) # A numeric result
The ReAct loop lets the model decide when to search or calculate, calls our tools, and returns an answer.
DSPy is especially effective when building multi-step systems (retrieval-augmented generation (RAG), agents, tool use), when structured outputs are required, or when iteration is expected, such as swapping models, adjusting metrics, or refining steps. This is done without constantly managing fragile prompts. It may be unnecessary when:
We’re doing a one-off call or a trivial transformation that we’ll never maintain. Plain API calls are fine.
We can’t define a metric (even a rough one). The “self-improving” part needs a dev set and a scoring function.
We have hard latency or cost constraints and no room for an optimization run (compiling explores prompt variants, that is, more LM calls). DSPy lets us control the budget and the number of trials, but there’s still overhead.
We must keep prompts frozen for compliance or exact string matching. DSPy wants freedom to rewrite prompts unless we constrain it.
We need fine-tuning, but we’re limited to closed APIs that don’t provide access to training endpoints. See the scope note below.
Here is a quick map of how DSPy compares to popular frameworks and the mindset behind it. These contrasts are high level; in practice, teams often mix and match tools.
LangChain vs. DSPy: LangChain is a flexible toolbox for chaining LLM calls, integrating data sources, and building agents. Prompt design and optimization remain largely manual. DSPy emphasizes declarative prompt design and automated optimization. The two can be complementary: use LangChain to wire external services and DSPy modules to implement complex, self-tuning prompt logic.
LlamaIndex vs. DSPy: LlamaIndex focuses on retrieval-augmented generation and connects LLMs to documents. DSPy can incorporate retrieval as well, including retrievers such as ColBERTv2, and then optimize how question answering is structured for a chosen metric and model.
Hard-coded prompts become a bottleneck as systems grow. DSPy treats prompts like code: abstract them, version them, and compile them. The conceptual shift mirrors database development, where we declare what we want and a planner figures out the “how” behind it. As DSPy programs are Python modules with clear signatures, they are easier to read, review, and extend than a tangle of prompt strings. A new team member can understand a ClassifySentiment signature at a glance.
In case studies on math (GSM8K) and multi-hop question answering (HotpotQA), a few lines of DSPy, after several minutes of compilation, outperformed standard few-shot prompting. It did this by more than 25% on GPT-3.5 and about 65% on Llama-2-13B, outperforming expert-written prompt pipelines by 5% to 46% (GPT-3.5) and 16% to 40% (Llama-2-13B).
DSPy is especially helpful when we:
Build multi-step tasks or require structured outputs, and want to avoid writing custom prompts for each step.
Have, or can create, evaluation data or metrics so that prompts can improve automatically.
Value maintainability and iteration speed, including the ability to swap models or strategies and recompile.
Want to reduce prompt fragility when moving between models.
If the use case is extremely simple (a one-off API query with no structure or optimization), a direct call may suffice. However, as soon as we find ourselves writing complex logic around prompts or maintaining a system long term, DSPy is worth considering.
DSPy raises the abstraction of working with LLMs so that we can write code, and not essays to direct the model. With signatures, modules, optimizers, and compilation, it turns prompt engineering into a standard software task. Beginners can accomplish real tasks with a few lines of code, and experts can let algorithms tune prompts and, where supported, model weights. In short, DSPy’s declarative design and self-improving capabilities make solutions more robust and easier to optimize. It bridges the flexibility of prompting and the rigor of programming.