Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 LLM Evaluation
Specific
benchmarks, evals, LLM scoring, evaluation metrics
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
150294
posts in
11.9
ms
A Deep Dive into LLM Evaluation
Metrics
: From
Perplexity
to Production
🧠
LLMs
medium.com
·
20h
The LLM Evaluation
Playbook
Every AI
Engineer
Needs
🏢
LLM Adoption
medium.com
·
4d
The LLM Effect on IR Benchmarks: A Meta-Analysis of
Effectiveness
,
Baselines
, and Contamination
🎯
LLM Finetuning
arxiv.org
·
2d
Benchmarking
LLMs with
Marimo
Pair
🎯
LLM Finetuning
ericmjl.github.io
·
15h
·
Hacker News
Better
Harness
: A Recipe for
Harness
Hill-Climbing with
Evals
🛡️
AI Safety
blog.langchain.com
·
1d
How We
Evaluate
Search Quality at Scale With LLM
Judging
and IR Metrics
🔍
Information Retrieval
medium.com
·
19h
A 0.30-Dollar Model Beat GPT-5.4 and
Sonnet
at Teaching Kids to Code - Why Fair Benchmarks Are Deeply
Unfair
🎯
LLM Finetuning
yaoke.pro
·
5d
·
r/PromptEngineering
I
benchmarked
my own product, published everything, and 0.2.0 is
basically
the list of things I had to fix.
🎯
LLM Finetuning
blog.routerly.ai
·
2d
·
r/SideProject
A Fast and
Loose
Clustering
of LLM Benchmarks
💻
Local AI
lesswrong.com
·
12h
Overshoot-ai/vlm-benchmarks
: 2500+ VLM benchmarks, auto-updated daily from
arXiv
🚀
LLM Deployment
github.com
·
1d
·
Hacker News
,
r/LocalLLaMA
Show HN:
Proposal
for a real long-term AI memory
benchmark
💻
Local AI
penfieldlabs.substack.com
·
20h
·
Substack
Show HN: Pre-training,
fine-tuning
, and
evals
platform
🚀
LLM Deployment
oumi.ai
·
6d
·
Hacker News
Benchmarks
are the new stars
🎯
LLM Finetuning
mercurialsolo.github.io
·
19h
Your
RAG
App Has a 35% Performance Gap You’ve Never
Measured
🔍
Information Retrieval
medium.com
·
4d
Meta
Muse
Spark: What the
Benchmarks
Actually Mean
🎯
LLM Finetuning
medium.com
·
20h
Give an LLM an API and It'll
Thrive
. Give It a
Touchscreen
and It Struggles
🔬
Small LMs
blog.allada.com
·
4d
·
Hacker News
,
Hacker News
Does Pass Rate Tell the Whole Story? Evaluating Design
Constraint
Compliance
in LLM-based Issue Resolution
🏢
LLM Adoption
arxiv.org
·
2d
Why
Measuring
AI Features Is Nothing Like
Measuring
Regular
Software
🔓
Open Source AI
medium.com
·
4d
Semantic
Layer
vs. Text-to-SQL: 2026 Benchmark Update (11 minute read)
🧠
LLMs
docs.getdbt.com
·
3d
Beyond LLM-as-a-Judge:
Deterministic
Metrics for
Multilingual
Generative Text Evaluation
🧠
LLMs
arxiv.org
·
2d
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help