Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 LLM Evals
Specific
model evaluation, benchmarks, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
6585
posts in
11.3
ms
Pipevals
: Evaluation
pipelines
for every LLM application
💬
Prompt Engineering
pipevals.com
·
1d
·
Lobsters
,
Hacker News
·
…
The
Anatomy
of an LLM
Benchmark
🤨
AI Criticism
cameronrwolfe.substack.com
·
3d
·
Substack
·
…
How We Turned a
Vibe-Coded
Jira
Bot Into a Reliable Agent in Two Weeks
💬
Prompt Engineering
arthur.ai
·
1d
·
Hacker News
·
…
Show HN:
Aludel
– LLM eval
workbench
for Phoenix apps
🦙
Local LLM
github.com
·
3d
·
Hacker News
·
…
Beyond the
Hype
: Is AI taking the
fun
out of software development?
🤨
AI Criticism
blog.scottlogic.com
·
1d
·
…
SlopCodeBench
: Benchmarking How Coding Agents
Degrade
Over Long-Horizon Iterative Tasks
🤖
Game AI
arxiv.org
·
6d
·
Hacker News
·
…
Measuring
AI
Ability
to Complete Long Software Tasks
💬
Prompt Engineering
muratbuffalo.blogspot.com
·
4d
·
Hacker News
,
Blogger
·
…
Show HN:
PyNear
– exact and approximate KNN, faster than
Faiss
🗂️
Vector Databases
news.ycombinator.com
·
4d
·
Hacker News
·
…
We spent 2 hours working in the future
🤖
Agent-Based Simulations
metr.org
·
4d
·
Hacker News
·
…
Browser-based
binary
classifier
evaluation, no back end
🌐
WebAssembly
evalbench-75a.pages.dev
·
3d
·
Hacker News
·
…
Memoriant/dgx-spark-kv-cache-benchmark
: KV cache quantization benchmarks on NVIDIA DGX Spark GB10 — three novel findings including the
dequantization
cliff and unified memory paradox
⚙️
Performance Profiling
github.com
·
2d
·
Hacker News
·
…
Anthropic
Says Use More Agents to Fix Agent Code. Here's What's Missing.
🤖
Agent-Based Simulations
mergeshield.dev
·
3d
·
Hacker News
·
…
Textbooks
, Not the Internet,
Trained
This Powerful AI
🧠
AI
hackernoon.com
·
3d
·
…
Forensic beats
Mem0
with 90.1% on
LOCOMO
⚙️
Systems Programming
forensicmemory.com
·
5d
·
Hacker News
·
…
MacBook
Neo
, the
benchmarks
🍎
Apple
birchtree.me
·
6d
·
Hacker News
·
…
OpenID
AuthZen
Authorization API 1.0 released
⌚
Quantified Self
openid.github.io
·
6d
·
Hacker News
·
…
Chaos Engineering Is the Missing
Layer
in Every AI
Reliability
Stack
💬
Prompt Engineering
hackernoon.com
·
3d
·
…
salespeak-ai/buyer-eval-skill
: B2B software vendor evaluation skill for Claude Code — domain-expert questions, vendor AI agent conversations, evidence-based scoring
💬
Prompt Engineering
github.com
·
6d
·
Hacker News
·
…
I built a universal CLAUDE.md that cuts Claude output tokens by 63% -
validated
with
benchmarks
, fully open source
🔧
Code Generation
github.com
·
3d
·
Hacker News
,
r/ClaudeAI
,
r/LocalLLaMA
·
…
choutos/agent-reliability-engineering
: Agent Reliability Engineering: applying SRE principles to AI agent systems. Evals,
imp
@k metrics, self-improvement, config versioning, transfer experiments.
🤖
Agent-Based Simulations
github.com
·
6d
·
Hacker News
·
…
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help