Are we in a GPT-4-style leap that evals can't see?
martinalderson.com·5w·
Preview
Report Post

I feel like we’ve just had another GPT4 moment (after everyone being sure that scaling laws were collapsing and progress was stalled). It’s more subtle than GPT4, but I think it has huge implications for many industries and potentially the economy as a whole.

Chat is a terrible eval

I’ve come to the conclusion that we’ve (mostly) maxxed out ad hoc chat as a way to evaluate models. I think everyone got very used to this being the defacto way to test LLMs - GPT4 was clearly so much better at answering (any) question than 3.5 that it was so obviously a big step forward.

Lately that definitely hasn’t been happening, and I think most people have got to a point (especially with the hype) that each model release is a bit of a disappointment.

Over the last year or so I’ve fo…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help