In our previous blogpost, we outlined the problem with evaluating using pre-defined error taxonomies - these evaluations become stale (i.e. go off-policy) when your agent or domain shifts. This means you might miss the real issues hiding in your agent’s traces!

For this reason, we believe that the best way to understand your agents’ failures is to aggregate bottom up from your agents’ actual (on-policy) traces. This is the process of error-analysis that Shreya Shankar and Hamel Husain refer to as one of the most critical techniques in evaluation.

Error analysis involves several steps, each of which are individually important, but can become labori…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help