Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🏆 Model Benchmarks
Specific
model evaluation, benchmark, MMLU, leaderboard, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
144858
posts in
56.2
ms
Show HN: Pre-training,
fine-tuning
, and
evals
platform
🔓
Open Source AI
oumi.ai
·
5d
·
Hacker News
Show HN:
Benchmark
multiple LLMs to
compare
quality, speed, and cost
🧠
LLMs
loopthink.ai
·
4h
·
Hacker News
Are LLM-Based
Retrievers
Worth Their Cost? An Empirical Study of Efficiency, Robustness, and Reasoning
Overhead
🧠
LLMs
arxiv.org
·
1d
Sanity
check
experiments
💬
Prompt Engineering
sebiwette.de
·
13h
Better
Harness
: A Recipe for
Harness
Hill-Climbing with
Evals
✨
Vibe Coding
blog.langchain.com
·
3h
Scraping
and analyzing
submissions
to Terminal Bench 2.0
✨
Vibe Coding
primeradiant.com
·
2d
I
benchmarked
my own product, published everything, and 0.2.0 is
basically
the list of things I had to fix.
🔓
Open Source AI
blog.routerly.ai
·
22h
·
r/SideProject
Thoughts on
causal
isolation
of AI evaluation benchmarks
⚠️
AI Safety
lesswrong.com
·
6d
reviseio/errata-bench
: A
proofreading
benchmark for LLMs
🧠
LLMs
github.com
·
1d
·
Hacker News
I
benchmarked
GPT-4o
, Claude 3.5, and Gemini 1.5 for security
💬
Prompt Engineering
aibench.trypromptguard.com
·
21h
·
DEV
HappyHorse-1.0
hits #1 on Artificial Analysis video
leaderboard
🤖
AI News
artificialanalysis.ai
·
3h
·
Hacker News
The Model That
Passed
Every
Benchmark
🔓
Open Source AI
medium.com
·
2d
Gemma
4 and what makes an open model
succeed
🔓
Open Source AI
interconnects.ai
·
5d
Our AI
Hallucinated
in Production: How We Fixed It With Evals —
Yicheng
Guo at AI Engineer Melbourne 2026
⚠️
AI Safety
webdirections.org
·
22h
Ansible
CIS
Benchmark: A
Fling
or a Serious Date?
💬
Prompt Engineering
medium.com
·
5h
Gemma 4
E4B
vs. Gemma Family: Enterprise Benchmark Across 8 Task
Suites
✨
Vibe Coding
aiexplorer-blog.vercel.app
·
1d
·
Hacker News
The
Insert
Benchmark vs
MariaDB
10.2 to 13.0 on a 24-core server
📊
Data Analysis
smalldatum.blogspot.com
·
20h
·
smalldatum.blogspot.com
Inference
Arena
– new
benchmark
of local inference and training
🧠
LLMs
kvark.github.io
·
3d
·
Hacker News
Show HN:
ErrataBench
- A
Proofreading
Benchmark for LLMs
🧠
LLMs
revise.io
·
1d
·
Hacker News
These gaming phones got
busted
for
cheating
, and here’s what the brand says in defence
🟢
OpenAI
androidauthority.com
·
9h
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help