Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Evals
📊 AI Evals
Specific
LLM evaluation, agent evaluation, benchmarks, model measurement
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
87
posts in
6.9
ms
Rank Intervals for
Leaderboards
: A Hierarchical Framework for
Model
Evaluation
📄
LLM Research
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Evals
First,
Models
Second: Building Cheaper, Smarter
AI
Agents With Microsoft Foundry
🤖
AI Agents
Content type:
Blog
medium.com
·
6d
6 days ago
Actions for Evals First, Models Second: Building Cheaper, Smarter AI Agents With Microsoft Foundry
Less-relevant results
Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new
Agents
’ Last Exam
benchmark
🟠
Claude
venturebeat.com
·
18h
18 hours ago
Actions for Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark
Deterministic Checks vs
Model-as-Judge
: A Tiered Approach to
Agent
Evaluation
✨
Generative AI
Content type:
Code
github.com
·
5d
5 days ago
·
DEV
Actions for Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation
Comprehensive
evaluation
of
LLM
capabilities for interpretation and analysis of genome-scale metabolic
models
in metabolic engineering
✨
Generative AI
Content type:
Academic
biorxiv.org
·
2d
2 days ago
Actions for Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering
Evaluating
using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
✨
Generative AI
lesswrong.com
·
5d
5 days ago
Actions for Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
[AINews] FrontierCode:
Benchmarking
for Code Quality over Slop
🧠
Google DeepMind
Content type:
News
latent.space
·
2d
2 days ago
Actions for [AINews] FrontierCode: Benchmarking for Code Quality over Slop
The State of
LLM
Evaluation
(2026): Why Evals Became the New Unit Tests
🧠
Prompt Engineering
Content type:
Blog
medium.com
·
3d
3 days ago
Actions for The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests
Evaluate
AI
agents
systematically with Agent-EvalKit
🤖
AI Agents
Content type:
Blog
aws.amazon.com
·
1h
1 hour ago
Actions for Evaluate AI agents systematically with Agent-EvalKit
Law Professors Prefer
AI
over Peer Answers
🧠
Prompt Engineering
Content type:
Academic
law.stanford.edu
·
4d
4 days ago
·
Hacker News
Actions for Law Professors Prefer AI over Peer Answers
The biggest local
LLM
on your machine is useless if it can't call a single tool, no matter how many parameters it has
✨
Generative AI
xda-developers.com
·
1d
1 day ago
Actions for The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has
A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
13h
13 hours ago
Actions for A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
not much happened today | AINews
🟠
Claude
news.smol.ai
·
6d
6 days ago
Actions for not much happened today | AINews
Apple WWDC On-Device
AI
Deep Dive - Google Docs
📄
LLM Research
gist.is
·
19h
19 hours ago
·
Hacker News
Actions for Apple WWDC On-Device AI Deep Dive - Google Docs
Context windows in
AI
: why every token is a budget decision
🧠
Prompt Engineering
Content type:
Blog
redis.io
·
23h
23 hours ago
Actions for Context windows in AI: why every token is a budget decision
Show HN:
AgentCarousel
– behavioral tests for
AI
agents, with signed evidence
🧠
Prompt Engineering
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
🧠
Google DeepMind
Content type:
Discussion
news.ycombinator.com
·
6d
6 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
VISTA: A Versatile Interactive User Simulation Toolkit for
Agent
Evaluation
🏗️
Agent Infrastructure
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for VISTA: A Versatile Interactive User Simulation Toolkit for Agent Evaluation
How to Train Your Goblin
🧠
Prompt Engineering
goblins.mchen.workers.dev
·
4d
4 days ago
·
Hacker News
,
Hacker News
Actions for How to Train Your Goblin
How Ecolab rebuilt retail intelligence on Databricks and Anthropic Claude
🟠
Claude
Content type:
Blog
databricks.com
·
2h
2 hours ago
Actions for How Ecolab rebuilt retail intelligence on Databricks and Anthropic Claude
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help