Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
📊 Model Evals
Specific
LLM evaluation, benchmarks, model evaluation, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
47066
posts in
17.5
ms
Effective Practices for
Mocking
LLM Responses During the Software Development
Lifecycle
🧪
Software Testing
mlops.community
·
1d
Jankmarking
: Janky
Benchmarking
📊
AI Performance Profiling
williamangel.net
·
5d
·
Hacker News
made-to-order training data for
classifiers
and
evals
🎯
AI Training
abliteration.ai
·
12h
·
Hacker News
My
colleague
's AI agent kept breaking in production. Here's what we found when we
looked
closer.
⚙️
AI Automation
getnetra.ai
·
48m
·
DEV
Mapping AI
benchmarks
onto a common
capability
scale
📊
AI Benchmarks
aiiq.org
·
1d
·
Hacker News
jdanielbcosta/ufc-predictor
: UFC Fight Predictor — A machine learning system for predicting UFC fight outcomes with 68.45% accuracy on unseen 2023–2026 data, outperforming published academic benchmarks (best: 66.71%, Yan et al. ACM
ICIIP
2024).
🚀
Model Releases
github.com
·
15h
·
r/learnmachinelearning
Eval
Set
Sizing
: The Statistical Power Math Behind LLM A/B Tests
🤖
LLM
dev.to
·
6d
·
DEV
What you measure
depends
on where you draw the
boundary
🎮
WebGPU
blog.arkstack.dev
·
2h
·
Hacker News
The AI Engineer
Illusion
: Why Calling LLM
APIs
Is Not Enough
🤖
AI Engineering
dev.to
·
2d
·
DEV
Cube
:
Wrapping
Benchmarks Once, Unlocking Agentic AI for Everyone
📊
AI Benchmarks
thealliance.ai
·
1h
·
Hacker News
AI cyber capability is
speeding
past earlier
projections
📊
AI Benchmarks
helpnetsecurity.com
·
4h
Verbalised
evaluation awareness in language models has little effect on their
behaviour
🏆
LLM Benchmarking
lesswrong.com
·
2d
Your AI Agent
Passes
Your
Evals
.
🧠
Context Engineering
pub.towardsai.net
·
5d
OpenAI GPT 5.5: Vision
Benchmarks
&
Roboflow
Workflows
🧠
OpenAI
blog.roboflow.com
·
20h
programmablemanufacturing/programmable-manufacturing-lab
: Community
repository
for physics-informed AI and programmable manufacturing: demos, benchmarks, notes, and roadmap.
⚙️
AI Automation
github.com
·
12m
·
r/learnmachinelearning
Claude
Mythos
and the 16-Hour Problem: When AI Agents
Outgrow
Their Own Benchmarks
🎯
AI Reliability
revolutioninai.com
·
1d
·
r/ClaudeAI
Model
Showdown
:
Benchmarking
Local vs Cloud LLMs on a Real Coding Task
🏠
Local LLM Deployment
dev.to
·
6d
·
DEV
https://
www.together.ai/blog/redpajama-7b
🤖
AI Codegen
together.ai
·
1d
Open Source Robot Policies,
Datasets
, and
Benchmarks
🦾
Embodied AI
festivus.hapticlabs.ai
·
1d
·
Hacker News
Why does AI memory fail at
connecting
facts? I
ran
the benchmarks to find out
📊
AI Performance Profiling
yourmemoryai.xyz
·
4d
·
Hacker News
,
r/SideProject
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help