Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📏 Model Evaluation
Specific
evals, benchmarking, MMLU, model performance, evaluation metrics
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
151309
posts in
20.3
ms
How Far Are We?
Systematic
Evaluation of LLMs vs. Human Experts in
Mathematical
Contest in Modeling
📊
ML Research
arxiv.org
·
3d
A Deep Dive into LLM Evaluation
Metrics
: From
Perplexity
to Production
🧠
LLMs
medium.com
·
23h
A Hands-On Guide to Testing Agents with
RAGAs
and
G-Eval
✅
TLA+
machinelearningmastery.com
·
2d
Run
evals
for Conversational Analytics agents using
Prism
✅
TLA+
cloud.google.com
·
1h
A Fast and
Loose
Clustering
of LLM Benchmarks
📊
ML Research
lesswrong.com
·
15h
Better
Harness
: A Recipe for
Harness
Hill-Climbing with
Evals
💻
AI Coding
blog.langchain.com
·
1d
Déjà
Code: How LLMs Quietly Cheat on
Repos
They've Already Seen
✅
TLA+
blogs.latentforce.ai
·
1h
·
Hacker News
benchmarking
inference
of popular models on consumer hardware
✅
TLA+
inferena.tech
·
5d
·
Hacker News
Benchmarking
LLMs with
Marimo
Pair
✅
TLA+
ericmjl.github.io
·
19h
·
Hacker News
Show HN:
Proposal
for a real long-term AI memory
benchmark
✅
TLA+
penfieldlabs.substack.com
·
1d
·
Substack
How We
Evaluate
Search Quality at Scale With LLM
Judging
and IR Metrics
📊
ML Research
medium.com
·
22h
Overshoot-ai/vlm-benchmarks
: 2500+ VLM benchmarks, auto-updated daily from
arXiv
✅
TLA+
github.com
·
1d
·
Hacker News
,
r/LocalLLaMA
Why
Measuring
AI Features Is Nothing Like
Measuring
Regular
Software
🔒
AI Safety
medium.com
·
4d
Benchmarks
are the new stars
🔒
AI Safety
mercurialsolo.github.io
·
22h
I
benchmarked
my own product, published everything, and 0.2.0 is
basically
the list of things I had to fix.
✅
TLA+
blog.routerly.ai
·
2d
·
r/SideProject
Zed
Editor launches Agent
Metrics
, offering public AI agent usage data
🔄
Agentic Workflows
alternativeto.net
·
2h
Meta
Muse
Spark: What the
Benchmarks
Actually Mean
✅
TLA+
medium.com
·
23h
Give an LLM an API and It'll
Thrive
. Give It a
Touchscreen
and It Struggles
✅
TLA+
blog.allada.com
·
4d
·
Hacker News
,
Hacker News
AXI
: Agent EXperience
Interface
🔄
Agentic Workflows
axi.md
·
1d
·
Hacker News
Performance analysis and prediction of single-phase
immersion
cooling
data center
🌐
Distributed Systems
sciencedirect.com
·
1h
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help