Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
🧪 LLM Testing
LLM eval, model evaluation, evals, harness, benchmarks
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
67
posts in
12.8
ms
Show HN: Marlin-2B: a tiny VLM to extract structured information from videos
🧠
LLMs
huggingface.co
·
2d
·
Hacker News
Discover the Red Hat OpenShift AI
model
catalog
🧠
LLMs
redhat.com
·
3d
ACE: Self-Evolving
LLM
Coding Framework via
Adversarial
Unit
Test
Generation and Preference Optimization
🤖
AI
arxiv.org
·
2d
tokenspeed — feel
LLM
tokens-per-second
🧠
LLMs
mikeveerman.github.io
·
58m
#1 on the leading AI memory
benchmark
using a smaller, cheaper
model
🧠
LLMs
exabase.io
·
5d
·
Hacker News
Self-Improving Reward
Models
🤖
AI Agent
canvas.inc
·
1d
·
Hacker News
EvalHub
: Because "looks good to me" isn't a
benchmark
🔄
DevOps
developers.redhat.com
·
2d
Supersymmetric Digital Assets & AI Emergence
💾
AI Hardware
qbc.network
·
3d
·
Hacker News
Benchmarking
five live translation systems with an open-source
eval
harness
(including OpenAI's GPT-Realtime-Translate)
🧠
LLMs
github.com
·
1d
·
DEV
Introducing RAMPART and Clarity: Open source tools to bring safety into Agent development workflow
🕵️
AI Agents
malware.news
·
12h
Better Experiments with
LLM
Evals
— A funnel, not a fork
🧠
LLMs
engineering.atspotify.com
·
2d
Enterprises can now train custom AI
models
from production workflows
🤖
AI
venturebeat.com
·
6d
Benchmarking
LLMs for malware triage and static unpacking with Malcat
🧠
LLMs
malcat.fr
·
2d
·
r/Malware
Show HN: Pokémon SVG Generation
LLM
Benchmark
🐹
Go
svg-bench.fenx.work
·
6d
·
Hacker News
Show Us Your (Agent) Skills Ep. 03
🤖
AI Agent
luma.com
·
1d
Why Demand for AI
Data
is Here to Stay
🤖
AI
slator.com
·
6d
Kubernetes Was the Easy Part
☸️
K8S
cloudnativenow.com
·
2d
HWE
Bench
: A new unbounded Benchmark for LLMs (GPT 5.5 is on top)
🧠
LLMs
hwebench.com
·
5d
·
Hacker News
AI researchers flag bias risks in
LLM
judging
🧠
LLMs
kite.kagi.com
·
5d
TOBench: A Task-Oriented Omni-Modal
Benchmark
for
Real-World
Tool-Using Agents
🕵️
AI Agents
arxiv.org
·
2d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help