Building an internal agent: Evals to validate workflows (opens in new tab)

Whenever a new pull request is submitted to our agent’s GitHub repository, we run a bunch of CI/CD operations on it. We run an opinionated linter, we run typechecking, and we run a bunch of unittests. All of these work well, but none of them test entire workflows end-to-end. For that end-to-end testing, we introduced an eval pipeline. This is part of the Building an internal agent series. Why evals matter The harnesses that run agents have a lot of interesting nuance, but they’re generally pr...

Read the original article