EvalView β Pytest-style Testing for AI Agents
The open-source testing framework for LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude agents. Write tests in YAML, catch regressions in CI, and ship with confidence.
EvalView is pytest for AI agentsβwrite readable test cases, run them in CI/CD, and block deploys when behavior, cost, or latency regresses.
What is EvalView?
EvalView is a testing framework for AI agents.
It lets you:
- π§ͺ Write tests in YAML that describe inputs, expected tools, and acceptance thresholds
- π Turn real conversations into regression suites (record β generate tests β re-run on every change)
- π¦ Gate deployments in CI on behavior, tool calls, cost, and latency
- π§© Plug into **LangGraph, CrewAI, OpenAI Assistants, β¦
EvalView β Pytest-style Testing for AI Agents
The open-source testing framework for LangGraph, CrewAI, OpenAI Assistants, and Anthropic Claude agents. Write tests in YAML, catch regressions in CI, and ship with confidence.
EvalView is pytest for AI agentsβwrite readable test cases, run them in CI/CD, and block deploys when behavior, cost, or latency regresses.
What is EvalView?
EvalView is a testing framework for AI agents.
It lets you:
- π§ͺ Write tests in YAML that describe inputs, expected tools, and acceptance thresholds
- π Turn real conversations into regression suites (record β generate tests β re-run on every change)
- π¦ Gate deployments in CI on behavior, tool calls, cost, and latency
- π§© Plug into LangGraph, CrewAI, OpenAI Assistants, Anthropic Claude, HTTP agents, and more
Think: "pytest / Playwright mindset, but for multi-step agents and tool-calling workflows."
Try it in 2 minutes (no DB required)
You donβt need a database, Docker, or any extra infra to start.
# Install
pip install evalview
# Set your OpenAI API key (for LLM-as-judge evaluation)
export OPENAI_API_KEY='your-key-here'
# Run the quickstart β creates a demo agent, a test case, and runs everything
evalview quickstart
Youβll see a full run with:
- β A demo agent spinning up
- β A test case created for you
- β A config file wired up
- π A scored test: tools used, output quality, cost, latency πΊ Example quickstart output
βββ EvalView Quickstart βββ
Step 1/4: Creating demo agent...
β
Demo agent created
Step 2/4: Creating test case...
β
Test case created
Step 3/4: Creating config...
β
Config created
Step 4/4: Starting demo agent and running test...
β
Demo agent running
Running test...
Test Case: Quickstart Test
Score: 95.0/100
Status: β
PASSED
Tool Accuracy: 100%
Expected tools: calculator
Used tools: calculator
Output Quality: 90/100
Performance:
Cost: $0.0010
Latency: 27ms
π Quickstart complete!
Do I need a database?
No.
By default, EvalView runs in a basic, no-DB mode:
- No external database
- Tests run in memory
- Results are printed in a rich terminal UI
You can still use it locally and in CI (exit codes + JSON reports).
Thatβs enough to:
- Write and debug tests for your agents
- Add a "fail the build if this test breaks" check to CI/CD
If you later want history, dashboards, or analytics, you can plug in a database and turn on the advanced features:
- Store all runs over time
- Compare behavior across branches / releases
- Track cost / latency trends
- Generate HTML reports for your team
Database config is optional β EvalView only uses it if you enable it in config.
Why EvalView?
- π Fully Open Source β Apache 2.0 licensed, runs entirely on your infra, no SaaS lock-in
- π Framework-agnostic β Works with LangGraph, CrewAI, OpenAI, Anthropic, or any HTTP API
- π Production-ready β Parallel execution, CI/CD integration, configurable thresholds
- π§© Extensible β Custom adapters, evaluators, and reporters for your stack
Behavior Coverage (not line coverage)
Line coverage doesnβt work for LLMs. Instead, EvalView focuses on behavior coverage:
| Dimension | What it measures |
|---|---|
| Tasks covered | Which real-world scenarios have tests? |
| Tools exercised | Are all your agentβs tools being tested? |
| Paths hit | Are multi-step workflows tested end-to-end? |
| Eval dimensions | Are you checking correctness, safety, cost, latency? |
The loop: weird prod session β turn it into a regression test β it shows up in your coverage.
# Compact summary with deltas vs last run + regression detection
evalview run --summary
βββ EvalView Summary βββ
Suite: analytics_agent
Tests: 7 passed, 2 failed
Failures:
β cohort: large result set cost +240%
β doc QA: long context missing tool: chunking
Deltas vs last run:
Tokens: +188% β
Latency: +95ms β
Cost: +$0.12 β
β οΈ Regressions detected
# Behavior coverage report
evalview run --coverage
βββ Behavior Coverage βββ
Suite: analytics_agent
Tasks: 9/9 scenarios (100%)
Tools: 6/8 exercised (75%)
missing: chunking, summarize
Paths: 3/3 multi-step workflows (100%)
Dimensions: correctness β, output β, cost β, latency β, safety β
Overall: 92% behavior coverage
What it does (in practice)
- Write test cases in YAML β Define inputs, required tools, and scoring thresholds
- Automated evaluation β Tool accuracy, output quality (LLM-as-judge), hallucination checks, cost, latency
- Run in CI/CD β JSON/HTML reports + proper exit codes for blocking deploys
# tests/test-cases/stock-analysis.yaml
name: "Stock Analysis Test"
input:
query: "Analyze Apple stock performance"
expected:
tools:
- fetch_stock_data
- analyze_metrics
output:
contains:
- "revenue"
- "earnings"
thresholds:
min_score: 80
max_cost: 0.50
max_latency: 5000
$ evalview run
β
Stock Analysis Test - PASSED (score: 92.5)
Cost: $0.0234 | Latency: 3.4s
π Generate 1000 Tests from 1
Problem: Writing tests manually is slow. You need volume to catch regressions.
Solution: Auto-generate test variations.
Option 1: Expand from existing tests
# Take 1 test, generate 100 variations
evalview expand tests/stock-test.yaml --count 100
# Focus on specific scenarios
evalview expand tests/stock-test.yaml --count 50 \
--focus "different tickers, edge cases, error scenarios"
Generates variations like:
- Different inputs (AAPL β MSFT, GOOGL, TSLA...)
- Edge cases (invalid tickers, empty input, malformed requests)
- Boundary conditions (very long queries, special characters)
Option 2: Record from live interactions
# Use your agent normally, auto-generate tests
evalview record --interactive
EvalView captures:
- β Query β Tools called β Output
- β Auto-generates test YAML
- β Adds reasonable thresholds
Result: Go from 5 manual tests β 500 comprehensive tests in minutes.
Connect to your agent
Already have an agent running? Use evalview connect to auto-detect it:
# Start your agent (LangGraph, CrewAI, whatever)
langgraph dev
# Auto-detect and connect
evalview connect # Scans ports, detects framework, configures everything
# Run tests
evalview run
Supports 7+ frameworks with automatic detection: β LangGraph β’ β CrewAI β’ β OpenAI Assistants β’ β Anthropic Claude β’ β AutoGen β’ β Dify β’ β Custom APIs
βοΈ EvalView Cloud (Coming Soon)
Weβre building a hosted version:
- π Dashboard - Visual test history, trends, and pass/fail rates
- π₯ Teams - Share results and collaborate on fixes
- π Alerts - Slack/Discord notifications on failures
- π Regression detection - Automatic alerts when performance degrades
- β‘ Parallel runs - Run hundreds of tests in seconds
π Join the waitlist - be first to get access
Features
- π Test Expansion - Generate 100+ test variations from a single seed test
- π₯ Test Recording - Auto-generate tests from live agent interactions
- β YAML-based test cases - Write readable, maintainable test definitions
- β‘ Parallel execution - Run tests concurrently (8x faster by default)
- π Multiple evaluation metrics - Tool accuracy, sequence correctness, output quality, cost, and latency
- π€ LLM-as-judge - Automated output quality assessment
- π° Cost tracking - Automatic cost calculation based on token usage
- π Universal adapters - Works with any HTTP or streaming API
- π¨ Rich console output - Beautiful, informative test results
- π JSON & HTML reports - Interactive HTML reports with Plotly charts
- π Retry logic - Automatic retries with exponential backoff for flaky tests
- π Watch mode - Re-run tests automatically on file changes
- βοΈ Configurable weights - Customize scoring weights globally or per-test
Installation
# Basic installation
pip install evalview
# With HTML reports (Plotly charts)
pip install evalview[reports]
# With watch mode
pip install evalview[watch]
# All optional features
pip install evalview[all]
CLI Reference
evalview quickstart
The fastest way to try EvalView. Creates a demo agent, test case, and runs everything.
evalview run
Run test cases.
evalview run [OPTIONS]
Options:
--pattern TEXT Test case file pattern (default: *.yaml)
-t, --test TEXT Run specific test(s) by name
--verbose Enable verbose logging
--sequential Run tests one at a time (default: parallel)
--max-workers N Max parallel executions (default: 8)
--max-retries N Retry flaky tests N times (default: 0)
--watch Re-run tests on file changes
--html-report PATH Generate interactive HTML report
--summary Compact output with deltas vs last run + regression detection
--coverage Show behavior coverage: tasks, tools, paths, eval dimensions
--judge-model TEXT Model for LLM-as-judge (e.g., gpt-5, sonnet, llama-70b)
--judge-provider TEXT Provider for LLM-as-judge (openai, anthropic, huggingface, gemini, grok)
Model shortcuts - Use simple names, they auto-resolve:
| Shortcut | Full Model |
|---|---|
gpt-5 | gpt-5 |
sonnet | claude-sonnet-4-5-20250929 |
opus | claude-opus-4-5-20251101 |
llama-70b | meta-llama/Llama-3.1-70B-Instruct |
gemini | gemini-3.0 |
# Examples
evalview run --judge-model gpt-5 --judge-provider openai
evalview run --judge-model sonnet --judge-provider anthropic
evalview run --judge-model llama-70b --judge-provider huggingface # Free!
evalview expand
Generate test variations from a seed test case.
evalview expand TEST_FILE --count 100 --focus "edge cases"
evalview record
Record agent interactions and auto-generate test cases.
evalview record --interactive
evalview report
Generate report from results.
evalview report .evalview/results/20241118_004830.json --detailed --html report.html
Evaluation Metrics
| Metric | Weight | Description |
|---|---|---|
| Tool Accuracy | 30% | Checks if expected tools were called |
| Output Quality | 50% | LLM-as-judge evaluation |
| Sequence Correctness | 20% | Validates exact tool call order |
| Cost Threshold | Pass/Fail | Must stay under max_cost |
| Latency Threshold | Pass/Fail | Must complete under max_latency |
Weights are configurable globally or per-test.
CI/CD Integration
EvalView is CLI-first. You can run it locally or add to CI.
GitHub Actions
name: EvalView Agent Tests
on: [push, pull_request]
jobs:
evalview:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: "3.11"
- run: pip install evalview
- run: evalview run --pattern "tests/test-cases/*.yaml"
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
Architecture
evalview/
βββ adapters/ # Agent communication (HTTP, OpenAI, Anthropic, etc.)
βββ evaluators/ # Evaluation logic (tools, output, cost, latency)
βββ reporters/ # Output formatting (console, JSON, HTML)
βββ core/ # Types, config, parallel execution
βββ cli.py # Click CLI
Guides
| Guide | Description |
|---|---|
| Testing LangGraph Agents in CI | Set up automated testing for LangGraph agents with GitHub Actions |
| Detecting LLM Hallucinations | Catch hallucinations and made-up facts before they reach users |
Further Reading
| Topic | Description |
|---|---|
| Getting Started | 5-minute quickstart guide |
| Framework Support | Supported frameworks and compatibility |
| Cost Tracking | Token usage and cost calculation |
| Debugging Guide | Troubleshooting common issues |
| Adapters | Building custom adapters |
Examples
- LangGraph Integration - Test LangGraph agents
- CrewAI Integration - Test CrewAI agents
- Anthropic Claude - Test Claude API and Claude Agent SDK
- Dify Workflows - Test Dify AI workflows
Using Node.js / Next.js? See @evalview/node for drop-in middleware.
Roadmap
Coming Soon:
- Multi-run flakiness detection
- Multi-turn conversation testing
- Grounded hallucination checking
- Error compounding metrics
- Memory/context influence tracking
Want these? Vote in GitHub Discussions
Contributing
Contributions are welcome! Please open an issue or submit a pull request.
See CONTRIBUTING.md for guidelines.
License
EvalView is open source software licensed under the Apache License 2.0.
Support
- Issues: https://github.com/hidai25/eval-view/issues
- Discussions: https://github.com/hidai25/eval-view/discussions
EvalView just stopped your agent from:
- hallucinating tools that donβt exist
- tool-calling itself into bankruptcy
β Smash β if it saved your sanity (and your wallet) today
Ship AI agents with confidence π