Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 Model Evaluation
Specific
LLM eval, benchmarks, evals, model assessment, MMLU
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
151236
posts in
19.0
ms
A Deep Dive into LLM Evaluation
Metrics
: From
Perplexity
to Production
🧠
LLMs
medium.com
·
23h
How Far Are We?
Systematic
Evaluation of LLMs vs. Human Experts in
Mathematical
Contest in Modeling
🧠
LLMs
arxiv.org
·
3d
A Hands-On Guide to Testing Agents with
RAGAs
and
G-Eval
🧠
LLMs
machinelearningmastery.com
·
2d
Déjà
Code: How LLMs Quietly Cheat on
Repos
They've Already Seen
🧠
LLMs
blogs.latentforce.ai
·
1h
·
Hacker News
Benchmarking
LLMs with
Marimo
Pair
⚡
Inference
ericmjl.github.io
·
18h
·
Hacker News
Better
Harness
: A Recipe for
Harness
Hill-Climbing with
Evals
🎛️
Fine-tuning
blog.langchain.com
·
1d
Run
evals
for Conversational Analytics agents using
Prism
⚙️
Agent Frameworks
cloud.google.com
·
44m
Give an LLM an API and It'll
Thrive
. Give It a
Touchscreen
and It Struggles
⚡
Inference
blog.allada.com
·
4d
·
Hacker News
,
Hacker News
A Fast and
Loose
Clustering
of LLM Benchmarks
⚡
Inference
lesswrong.com
·
15h
Show HN:
Proposal
for a real long-term AI memory
benchmark
⚡
Inference
penfieldlabs.substack.com
·
1d
·
Substack
How We
Evaluate
Search Quality at Scale With LLM
Judging
and IR Metrics
🧠
LLMs
medium.com
·
22h
I
benchmarked
my own product, published everything, and 0.2.0 is
basically
the list of things I had to fix.
🎛️
Fine-tuning
blog.routerly.ai
·
2d
·
r/SideProject
Show HN: Pre-training,
fine-tuning
, and
evals
platform
🎛️
Fine-tuning
oumi.ai
·
6d
·
Hacker News
Benchmarks
are the new stars
🔬
AI Research
mercurialsolo.github.io
·
22h
Overshoot-ai/vlm-benchmarks
: 2500+ VLM benchmarks, auto-updated daily from
arXiv
🔬
AI Research
github.com
·
1d
·
Hacker News
,
r/LocalLLaMA
Testing Open Source and
Commercial
LLMs – Can Anyone Beat Claude
Opus
?
⚡
Inference
akitaonrails.com
·
4d
·
Hacker News
MLOps
in 2026: What Is It and Why Should You Care?
🧠
LLMs
flexiana.com
·
1d
NL2SQLBench
: A Modular Benchmarking Framework for
LLM-Enabled
NL2SQL Solutions
🧠
LLMs
vldb.org
·
1d
The case for Model-as-a-Service over
self-managed
inference
🧠
LLMs
news.ycombinator.com
·
3d
·
Hacker News
The STAR Unit Test: Stop
Telling
Stories, Start
Proving
Value
✍️
Prompt Engineering
connectsblue.com
·
2d
·
DEV
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help