Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
📊 AI Benchmarks
Specific
benchmark, leaderboard, evaluation, MMLU, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
200027
posts in
30.5
ms
Mapping AI
benchmarks
onto a common
capability
scale
🏆
LLM Benchmarking
aiiq.org
·
1d
·
Hacker News
APIEval-20 -
一个用于测试
API 的 AI
代理的开源基准测试框架
🏆
LLM Benchmarking
resources.kusho.ai
·
4d
The Sequence Opinion #860: Every Company’s Last
eXam
: Some Reflection About Practical AI
Evals
🏆
LLM Benchmarking
thesequence.substack.com
·
38m
·
Substack
EVA-Bench
: A New End-to-end Framework for Evaluating Voice Agents
💬
Conversational AI
arxiv.org
·
7h
AI is ready to take over Python
programming
, but not much
else
🛡️
AI Safety
infoworld.com
·
1d
I built a
benchmark
for AI “memory” in coding agents. looking for
others
to beat it.
🤖
AI Codegen
github.com
·
5d
·
r/artificial
Model Performance Management Done Right: Build
Responsibly
Using
Explainable
AI
🛡️
AI Safety
mlops.community
·
1d
Beyond the Vibe Check: Scaling
Cymbal
Air Agent Reliability with LangGraph and Vertex AI
Evals
🎯
AI Reliability
medium.com
·
1d
not much
happened
today
🤖
AI News
news.smol.ai
·
2d
Improving
Reproducibility
in Evaluation through Multi-Level
Annotator
Modeling
🤖
Large Language Models
arxiv.org
·
7h
Edit-Compass
&
EditReward-Compass
: A Unified Benchmark for Image Editing and Reward Modeling
🖼️
Image Generation
arxiv.org
·
7h
Evaluate
your LLM for Technical Compliance with
COMPL-AI
🏆
LLM Benchmarking
mlops.community
·
1d
Towards
Apples
to
Apples
for AI
Evaluations
: From Real-World Use Cases to Evaluation Scenarios
🛡️
AI Safety
arxiv.org
·
3d
Towards Reliable LLM Evaluation:
Correcting
the Winner's
Curse
in Adaptive Benchmarking
🏆
LLM Benchmarking
arxiv.org
·
6d
MLS-Bench: A
Holistic
and
Rigorous
Assessment of AI Systems on Building Better AI
🏆
LLM Benchmarking
arxiv.org
·
2d
An
Executable
Benchmarking
Suite for Tool-Using Agents
📊
AI Performance Profiling
arxiv.org
·
1d
SkillRet
: A Large-Scale Benchmark for Skill
Retrieval
in LLM Agents
🏆
LLM Benchmarking
arxiv.org
·
6d
Can Agent Benchmarks Support Their Scores?
Evidence-Supported
Bounds
for Interactive-Agent Evaluation
🤖
Game AI
arxiv.org
·
2d
Query-efficient model evaluation using
cached
responses
📊
Model Evals
arxiv.org
·
3d
LLMSYS-HPOBench
: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems
⚡
LLM Optimization
arxiv.org
·
2d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help