So there I was at 2am staring at my OpenAI dashboard wondering how the hell my bill went from $80 to $400 in a single day. The answer? One of my agents decided to call the same tool 47 times in a loop. In production. While real users were waiting.
The Problem Nobody Talks About
I’ve been running custom AI agents in production for about six months now. Here’s what I learned the hard way: agents that work perfectly on your local machine will absolutely betray you in production. Sometimes they hallucinate tools that don’t exist. Sometimes they answer questions without calling any tools at all, just making stuff up with complete confidence. Sometimes they get stuck in loops burning through tokens like there’s no tomorrow. The worst part? You don’t find out until a user complains...
So there I was at 2am staring at my OpenAI dashboard wondering how the hell my bill went from $80 to $400 in a single day. The answer? One of my agents decided to call the same tool 47 times in a loop. In production. While real users were waiting.
The Problem Nobody Talks About
I’ve been running custom AI agents in production for about six months now. Here’s what I learned the hard way: agents that work perfectly on your local machine will absolutely betray you in production. Sometimes they hallucinate tools that don’t exist. Sometimes they answer questions without calling any tools at all, just making stuff up with complete confidence. Sometimes they get stuck in loops burning through tokens like there’s no tomorrow. The worst part? You don’t find out until a user complains. Or until you check your billing dashboard and feel your stomach drop. I tried writing unit tests but how do you even test something that’s nondeterministic by design? Mock the LLM? Cool, now you’re testing your mocks, not your agent.
What I Actually Wanted
I wanted something dead simple. Write down what the agent is supposed to do. Run it. Fail the build if it does something stupid. That’s it. No PhD required. So I built it.
Meet EvalView
The idea is embarrassingly simple. You write a YAML file describing what should happen:
name: order lookup
input:
query: "What's the status of order 12345?"
expected:
tools:
- get_order_status
thresholds:
max_cost: 0.10
That’s a real test. If the agent answers without calling get_order_status, the test fails. If it suddenly costs more than 10 cents, the test fails. Red error, CI breaks, deploy blocked. The tool call check alone catches probably 90% of the dumb stuff. Agent confidently answered a question about an order without actually looking up the order? Caught. Agent called some random tool instead of the right one? Caught. Agent decided to call the same tool fifteen times? You get the idea.
Running It
pip install evalview
evalview quickstart
The quickstart spins up a tiny demo agent and runs some tests against it so you can see how it works. Takes maybe fifteen seconds. For your own agent you just point it at your test files:
evalview run
Throws it in CI and now you have actual guardrails.
What Changed For Me
Before EvalView I was averaging maybe two or three angry user reports per deploy. Something would break in some weird edge case and I’d spend my evening debugging production. After adding these tests? Ten deploys in a row with zero incidents. I actually deploy on Fridays now. I know, I know, but I do. The $400 surprise bills stopped too. Turns out catching infinite loops before production is good for your wallet.
The Boring Technical Stuff
It works with LangGraph, CrewAI, OpenAI, Anthropic, basically anything you can hit with an HTTP request. There’s also an LLM as judge feature for checking output quality since exact string matching is useless for AI responses.
What I’m Working On Next
Also thinking about adding test generation from production logs so you can turn real failures into regression tests automatically. And maybe a comparison mode to test different agent versions or configurations side by side and see which one performs better. If you’ve got ideas or want to contribute I’m very open to that. The codebase is not that big and there’s plenty of low hanging fruit.
Go Look At It
Here’s the repo: https://github.com/hidai25/eval-view If you’ve ever had an agent embarrass you in production or if you’ve ever opened a cloud bill and felt physical pain, maybe give it a shot. And if it saves you even one late night debugging session, throw it a star. I’m genuinely curious what other people are doing for this stuff. Do you have some elaborate eval setup? Let me know in the comments because I’m still figuring this out as I go.