Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
📊 AI Benchmarks
Specific
benchmark, leaderboard, evaluation, MMLU, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
46753
posts in
20.6
ms
Mapping AI
benchmarks
onto a common
capability
scale
🏆
LLM Benchmarking
aiiq.org
·
1d
·
Hacker News
https://
vercel.com/changelog/live-model-performance-metrics-accessible-via-ai-gateway
📊
AI Performance Profiling
vercel.com
·
5d
Arena
AI Model
ELO
History: A Live Tracker!
🏆
LLM Benchmarking
dev.to
·
40m
·
DEV
made-to-order training data for
classifiers
and
evals
🎯
AI Training
abliteration.ai
·
10h
·
Hacker News
The
Autorater
Problem:
Trusting
LLM Judges Without Treating Them Like Ground Truth
🏆
LLM Benchmarking
hackernoon.com
·
1d
Frontier AI models don't just
delete
document content — they
rewrite
it, and the errors are nearly impossible to catch
🏆
LLM Benchmarking
venturebeat.com
·
15h
I built a
benchmark
for AI “memory” in coding agents. looking for
others
to beat it.
🤖
AI Codegen
github.com
·
5d
·
r/artificial
Building an Evaluation
Harness
for Production AI Agents: A 12-Metric Framework From 100+
Deployments
🧠
Context Engineering
towardsdatascience.com
·
23h
Model Performance Management Done Right: Build
Responsibly
Using
Explainable
AI
🛡️
AI Safety
mlops.community
·
1d
Old PC vs New AI: Can a 2015 Desktop Actually Run
Gemma
4? (2B vs
4B
Benchmark)
📊
AI Performance Profiling
dev.to
·
5h
·
DEV
Researchers say AI just
broke
every benchmark for autonomous cyber
capability
🤖
Artificial Intelligence
cyberscoop.com
·
13h
·
Hacker News
Claude
Mythos
and the 16-Hour Problem: When AI Agents
Outgrow
Their Own Benchmarks
🎯
AI Reliability
revolutioninai.com
·
1d
·
r/ClaudeAI
Show HN:
CADBench
– every AI CAD tool I tested fails on basic
mechanical
parts
🤖
AI Coding Tools
evals-for-ai-cads.vercel.app
·
4d
·
Hacker News
Microsoft’s multi-agent AI system
tops
Anthropic’s
Mythos
on cybersecurity benchmark
💪
AI Power Users
geekwire.com
·
11h
Distilling
a strategic-reasoning framework into 7B
weights
🏆
LLM Benchmarking
lerugray.github.io
·
1d
·
Hacker News
What
Inference-Platform
Benchmark
Posts Leave Out
🏠
Local LLM Deployment
dev.to
·
22h
·
DEV
Scale Labs debuts new
Refactoring
Leaderboard
for AI
✨
Code Quality
testingcatalog.com
·
6d
Through the looking
glass
of benchmark
hacking
👨
indie hacker
poolside.ai
·
2d
·
Hacker News
How Nvidia Made Its
ASR
Models 3x Faster Than the
Competition
🎙️
AI Voice
hackernoon.com
·
1d
A
benchmark
is a
sensor
🏆
LLM Benchmarking
lesswrong.com
·
5d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help