Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 LLM Evals
Specific
model evaluation, benchmarks, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
6402
posts in
21.3
ms
[
WIP
] Benchmarking Local LLMs Against Coding Agent
Harnesses
⚙️
Performance Profiling
neuralnoise.com
·
3d
·
Hacker News
Granite
4.1: IBM's
8B
Model Is Competing With Models Four Times Its Size
💬
Prompt Engineering
firethering.com
·
14h
·
Hacker News
Why
RAG
Systems
Fail
in Production
🔍
RAG
digitalocean.com
·
1d
Benchmarking
a Bug
Scanner
🐛
Fuzzing
blog.detail.dev
·
5h
·
Hacker News
Cyborg
evals
💬
Prompt Engineering
lesswrong.com
·
7h
·
Hacker News
Odysseys
: Benchmarking Web Agents on
Realistic
Long Horizon Tasks
📊
Performance Monitoring
odysseys-website.pages.dev
·
1d
·
Hacker News
MathDuels
: Evaluating LLMs as Problem
Posers
and Solvers
λ
Functional Programming
arxiv.org
·
6d
·
Hacker News
The Data
Layer
Tax for Robot Learning
📊
Machine Learning
rerun.io
·
12h
·
Hacker News
Our
evaluation
of OpenAI’s GPT-5.5 cyber
capabilities
🔐
Cybersecurity
simonwillison.net
·
2h
ExaBench
: An Open Database Performance
Leaderboard
⚙️
Performance Profiling
exasol.com
·
1d
·
Hacker News
Memory Machines: Can LLMs create
lasting
flashcards
from readers' highlights?
💬
Prompt Engineering
memory-machines.com
·
21h
·
Hacker News
garrytan/gbrain-evals
🔧
Code Generation
github.com
·
6d
·
Hacker News
atomic_
queue
benchmarks
SMT
vs
no-SMT
performance
⚙️
Performance Profiling
max0x7ba.github.io
·
1d
·
r/cpp
,
r/linux
The Human
Creativity
Benchmark –
Evaluating
Generative AI in Creative Work
🎨
Design Systems
contralabs.com
·
6h
·
Hacker News
Benchmarking
Opus
4.7: ~80% higher cost in practice
🔧
Code Generation
wozcode.com
·
1d
·
Hacker News
local-first MCP code intelligence (and the
runs
we
lose
)
⚙️
Systems Programming
sverklo.com
·
3d
·
Hacker News
The Coding Assistant
Breakdown
: More
Tokens
Please
💬
Prompt Engineering
newsletter.semianalysis.com
·
6d
·
Hacker News
PMZFX/intel-arc-pro-b70-benchmarks
: Benchmark results and performance data for the Intel Arc Pro B70 GPU (
Xe2/Battlemage
) - LLM inference, video generation, dual-GPU scaling.
⚙️
Performance Profiling
github.com
·
6d
·
Hacker News
Claude
Opus
4.6 vs.
Opus
4.7 Effort Levels and Prompt
Steering
Benchmarks
💬
Prompt Engineering
ai.georgeliu.com
·
4d
·
Hacker News
DeepSeek
V4
with
Strix
: a quick test
⚙️
Performance Profiling
theaq.blog
·
5d
·
Hacker News
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help