Contra DSPy and GEPA — Benjamin Anderson

I have something shameful to confess—I have been holding hate in my heart. It pains me to say this, because the folks behind it are brilliant, kind, and thoughtful people, and deserve nothing but love and appreciation. Nevertheless, I must share my authentic truth: I hate DSPy and GEPA to the core of my being. I tried to like them. I really did. I went through the 45-minute Colab notebook. I even made an effort to add a flavor of GEPA to lm-deluge, my open-source LLM SDK, before giving up in a fit of rage (sorry, Claude). My conclusion? Trying to treat LLM workflows as modular programs is (often) a mistake. It’s backwards, rigid, and the wrong fit for the most interesting tasks.

In this post (polemic? rant?) I try to unpack the "why" behind the hate, and consider whethe…

In this post (polemic? rant?) I try to unpack the "why" behind the hate, and consider whether it’s possible to salvage the good parts of GEPA.

What Is DSPy?

If you’re new around here, you might have no idea what I’m talking about. That’s OK. DSPy is a toolkit created in 2023 by Omar Khattab (@lateinteraction on Twitter), who also known for inventing ColBERT and popularizing the general concept of "late interaction" in information retrieval. (See, like I said, smart guy.) The idea is to make AI programs more reliable by (a) breaking them down into modules, and (b) decoupling the function of those modules from their specific implementation.

For instance, suppose you want to to use an LLM to summarize a long story into a single paragraph. Probably the first thing you’d do is make a prompt like "Summarize this long story into a single paragraph" and send that to an AI model like Claude. You might then spend a bunch of time improving that prompt: reminding Claude that a paragraph should have no less than 4 but no more than 8 sentences, noting the aspects of a story that should generally be captured in a summary, and telling it for the love of God to not use emojis.

Then, the next week, your budget gets slashed by 90% and you can no longer afford Claude. Or a better model comes out, and you decide you’d rather use that one. Suddenly, all that work you did was basically useless. Other models don’t fail in the same ways that Claude did. Maybe Gemini works better with few-shot examples than detailed instructions. Maybe Grok doesn’t have the emoji problem. Maybe you want to fine-tune now instead of prompting. You have to start all over.

DSPy says: This seems bad. How about you define what you want your AI "program" to do, and we’ll take care of the optimization for you. You write a PyTorch-like syntax with different "modules", and DSPy can test out a bunch of different few-shot examples, do prompt optimization, and even fine-tuning on a small dataset of examples, so that you don’t have to worry about how to get your AI model to do what you want. For example (from DSPy docs):

def search_wikipedia(query: str) -> list[str]:    results = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")(query, k=3)    return [x["text"] for x in results]rag = dspy.ChainOfThought("context, question -> response")question = "What's the name of the castle that David Gregory inherited?"rag(context=search_wikipedia(question), question=question)

This little program does what you want (RAG) with any model out of the box, but then you can apply a DSPy "optimizer" to it, which uses one of a few techniques to make it do the task better. Sounds great! To really get mileage out of it, you can compose these little modules into a full workflow, e.g. search → filter → summarize → extract, and DSPy can help optimize all the pieces of the pipeline so it works great end-to-end. This sounds great, what’s not to love? I’ll get to that.

What is GEPA?

GEPA is short for "Genetic Pareto", and is one of the optimizers you can use to improve your "LLM program" (this is what a pipeline composed of little modules is called in DSPy-land). The required ingredients are simple: the modular LLM program (consisting of components and their associated prompts), plus a training dataset of (input, expected-output) pairs. The optimizer works by using LLMs to reflectively improve the policy (i.e. the prompts), based on analysis of trajectories on the training dataset.

More specifically, it implements a genetic prompt-evolution algorithm, which maintains the titular Pareto frontier of policies. To track the frontier, GEPA stores a "grid" of scores, the performance of each policy on each sample. To be kept around, a policy must be the best policy on at least 1 input, and can’t be strictly dominated by any other policy. This makes sure that the pool of candidates stays diverse, versus just taking the top-scoring N candidates across the whole dataset, which might all be very similar.

That Pareto piece determines how candidates are kept or discarded. The other piece is how candidates are "evolved" (the policy-improvement step). Instead of the sparse signal of reinforcement learning (you do a whole rollout and basically get a thumbs-up or a thumbs-down, but have no idea why), GEPA’s reflective approach incorporates specific, information-dense feedback to improve the system rapidly. It generates a new candidate in one of two ways. Either:

(a) Mutation: A previous candidate and its trajectory are fed to an LLM, and it tries to improve a single prompt (module); OR
(b) Crossover: Two previous candidates are sampled, and an LLM tries to take the best pieces from each to make a new candidate.

Proposed candidates are evaluated on a minibatch, and thrown away if they don’t demonstrate an improvement on that minibatch. Otherwise, they’re kept and evaluated on the whole evaluation dataset. This keeps going until the budget is exhausted.

The headline result of the GEPA paper is vastly superior final performance compared to RL (GRPO) on a handful of sample tasks, plus dramatically improved sample efficiency. When I saw those results, I knew I had to take it for a spin.

Problem Setup

My initial goal was to test out GEPA on a task I’ve been working on in my free time: multi-turn agentic search. Since the launch of Claude Code, the zeitgeist has shifted away from single-turn RAG with vector databases, and towards iterated agentic search with tool calls (read my post about that here). Because latency often matters for search, there’s huge benefits to trying to get Opus- or Sonnet-level agentic search performance on a smaller, specialized model. That’s what Cognition did with SWE-grep, and others are following suit: Morph’s WarpGrep and Relace’s "Fast Agentic Search" (unfortunate acronym) for code, and SID-1 for general search.

In my setting, a model is given a query, and is tasked with returning up to 5 documents from a search index it considers most relevant to answer the query. It is equipped with just two tools, search and fetch, which call out to a Tantivy (BM25) search index. Training data comes from generating synthetic queries for documents from a corpus, and then filtering out examples that are too easy (where simply searching for the query verbatim gives you the result). Outputs are scored +1 if the ground-truth document is included in the list, +0 otherwise; with a -0.1 penalty for each tool call after the third (to reward efficiency).

I built this environment with Verifiers, and plan to do some RL runs on it over the holidays, but I thought GEPA would be a nice, lightweight baseline. I was surprised to see that someone hadn’t already built a Verifiers-GEPA integration. I was about to find out why.

Mea Culpa – Engineer’s Hubris

The goal was to use GEPA to optimize 3 "modules" of my search agent: the system prompt, the search tool description, and the fetch tool description. The result was a disaster. With the help of Claude and Codex, I threw together some of the most spaghetti, slop code I’ve ever seen, got it to run, saw the eval numbers climb, and even got the final "improved" prompts. But none of it felt good or natural. It felt like I was combining things that didn’t really go together—like mushrooms and cotton candy.

I want to be careful to separate out the parts that were my fault from the parts that I think are deeper problems with the DSPy/GEPA approach. My fault:

(1) I decided it would be more fun to implement GEPA from scratch on top of my pre-existing LLM API SDK (try saying that three times fast), lm-deluge. The official GEPA implementation already has a provider-agnostic adapter (LiteLLM), so this was pure hubris.
(2) I also wanted it to work with Verifiers, so that I could keep using the same environment I’d already built with RL in mind. This required a bunch of weird hacks, because typically tool docstrings are treated as fixed, and for prompt-optimization, they’re one of the few things you can change besides the system prompt.

It turns out that trying to make 3 different things that don’t go together play nice is just not a good idea. Who could have possibly predicted this!?

But Also, An Agent is Not a Modular Program

On the other hand, some of the frustration I felt while doing this was, I believe, due to trying to shoehorn an "agentic" workflow into what DSPy and the GEPA harness expect: a modular, linear LLM program, with independent parts that can be improved in isolation. In some ways, it seemed like a natural fit: I have 3 "prompts" (if you count a tool docstring as a prompt), which are "modular", and at each iteration, we can pick one to improve. And it kind of worked, a bit. But on closer inspection, this all falls apart.

A classic modular program is "search, then summarize, then extract." There’s discrete steps, a different prompt for each step, and clear inputs and outputs for each. There is locality. For GEPA as presented, this is important, because when you select a module to improve/reflect on, you can look at the inputs and outputs of that module to determine what is and isn’t working. This gives you concrete bits of feedback, which lets you improve that module’s prompt, e.g. "the extract module produced the wrong date format" or "the summaries are too short" or "the search results have semantic instead of lexical matches." The GEPA paper is clear that the compound nature of the AI system being optimized is a key feature:

GEPA is different from all previous work for using textual feedback from the environment, searching the best prompt candidates through Pareto-aware optimization, and using evolution strategies to optimize prompts for each submodule in a compound AI optimization.

But open-ended agentic tasks have a different shape. Instead of each step being a different, specialized module that carries out a specific action like "summarize," each step is exactly the same—the model has all the actions to choose from, and has to decide what to do. Instead of "retrieve, summarize, extract," it’s "search or fetch, search or fetch, search or fetch, search or fetch."

This means that ALL the prompts are relevant at ALL the steps. Which "module" or step does the system prompt apply to? Just the first one? No, it applies to all of them. It directs how to choose an action at each step. What about the search tool description? Is it only relevant to the steps where the agent chose search? No—in the same manner as the system prompt, tool descriptions are globally relevant. At each step, the agent considers all the tool descriptions, and decides which one is appropriate. If I change the search tool description to say "Don’t use this tool," the model will choose the fetch tool more. All tools affect all other tools.

If I had been more humble, and tried to optimize my task with GEPA the "right" way (using DSPy and the official library), I would have just run into this wall earlier, the second I tried to turn my search agent into a DSPy program. Because an agent is not a workflow.

Conclusion: Can GEPA be Salvaged?

One should not conclude from the forgoing that DSPy or GEPA are bad or useless. To the contrary, the most valuable AI system I’ve ever produced (if money is a measure of value) is much more like a boring, deterministic DSPy chain-of-prompts than it is a dynamic agent. If DSPy and GEPA promise that I can spend less time thinking about chaining boring prompts together, I’m all for it.

On the other hand, it’s hard to feel that this is the future of AI. The "Claude Code moment" made clear the power of a smart model in a good agent harness, and Opus 4.5 (combined with modular Skills) is now driving that home even more strongly. Some tasks can be simplied to a sequence of deterministic steps, but many important ones can’t. (And even for simple tasks, what happens when a step goes wrong?) Modular LLM programs therefore can’t be the paradigm for artificial intelligence—just one among many.

I would observe, however, that the two most important parts of GEPA, the Pareto frontier and reflective prompt optimization, do not require a modular LLM program at all. They just require a dataset, a way to evaluate candidate prompts on that dataset, and a tractable way to mutate prompts into better prompts. Indeed, many interesting applications of GEPA are single-step tasks, for which it’s really overkill to define a "compound AI system" consisting of exactly one (1) prompt.

I can imagine applying GEPA’s evolutionary search and reflective prompt optimization to an agentic task like mine, while ditching the idea of "modules." Accept that all the prompts are interconnected, don’t assume there’s any local structure, and use the whole trajectory to make changes to textual parameters that steer the agent towards success. It’s not as simple as updating your Summary Prompt to tell your Summary LLM to write shorter summaries, but I believe it’s feasible now with feedback from smarter AI models.

Maybe after I recover my strength over the holidays, I’ll give GEPA another try.