Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
馃搳 LLM Evals
Specific
LLM evaluation, agent benchmarks, evals, LMSYS
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
169162
posts in
15.9
ms
Benchmarking
LLM Tool-Use in the Wild
聽
馃
agent design
arxiv.org
路
5d
路
Hacker News
LLM
Guardrails
and Safety in Production AI Systems
聽
馃
Agent Memory
pub.towardsai.net
路
7h
Quantization
,
LoRA
, and the 8% Problem: Benchmarking Local LLMs for Production AI
聽
馃
Agent Memory
walsenburgtech.com
路
2d
路
Hacker News
Model API Performance
聽
馃
agent design
news.ycombinator.com
路
1h
路
Hacker News
Quality and evaluation framework for
successful
AI apps and agents in Microsoft
Marketplace
聽
馃
agent design
techcommunity.microsoft.com
路
13h
Center for Responsible,
Decentralized
Intelligence at
Berkeley
聽
馃
Agent Memory
rdi.berkeley.edu
路
2d
路
Hacker News
Benchmarking
LLMs with
Marimo
Pair
聽
馃
agent design
ericmjl.github.io
路
4d
路
Hacker News
Building a Robust Documentation Agent with
DigitalOcean
Gradient
AI Platform
聽
馃
Agent Memory
digitalocean.com
路
17h
路
Hacker News
AI
Aced
the Test. Then It
Hallucinated
Two APIs in a Row.
聽
馃
agent design
medium.com
路
1d
Frontier AI
Benchmarking
Datasets
聽
馃
Agent Memory
ukri.org
路
1h
GPT-5.4 vs
GLM-5
: Is Open Source Finally Matching
Proprietary
AI?
聽
馃
agent design
freecodecamp.org
路
18h
LLMs fall short in
differential
diagnosis if in initial low-data clinical
consultations
聽
馃
agent design
labmate-online.com
路
1h
A Deep Dive into LLM Evaluation
Metrics
: From
Perplexity
to Production
聽
馃
Agent Memory
medium.com
路
4d
Machine Learning Tasks and Evaluation: How to Choose the Right
Metrics
and Avoid Common
Pitfalls
聽
馃
Agent Memory
zeromathai.com
路
2d
路
DEV
How to build effective reward functions with AWS
Lambda
for Amazon Nova model
customization
聽
馃
Agent Memory
aws.amazon.com
路
19h
Introducing
KellyBench
聽
馃
agent design
gr.inc
路
2d
路
Hacker News
Running
GLM-5.1
(
744B
) Locally on a Mac Studio: Benchmark Results
聽
馃
Agent Memory
blog.zolty.systems
路
19h
Run
evals
for Conversational Analytics agents using
Prism
聽
馃
agent design
cloud.google.com
路
3d
Agent skills look great in
benchmarks
but fall apart under
realistic
conditions, researchers find
聽
馃
Agent Memory
the-decoder.com
路
2d
Evaluating
agents for
scientific
discovery
聽
馃
agent design
allenai.org
路
17h
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help