Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Evals
📊 AI Evals
Specific
LLM evaluation, agent evaluation, benchmarks, model measurement
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
88
posts in
12.6
ms
TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
How Microsoft Is Building for a World of Metered Intelligence
🧠
Prompt Engineering
every.to
·
6d
6 days ago
Actions for How Microsoft Is Building for a World of Metered Intelligence
NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized
Multi-Agent
RAG System for the Text-to-Text Track
👩💻
AI Practitioners
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
LLM-Based
Visualization
Evaluation
: How Well Do Literacy-Stratified Personas Approximate
Human
Judgments?
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for LLM-Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
Revisiting
GSM-Symbolic
: Do 2026 Frontier
Models
Still Fail at Confounded Grade School Math?
🟠
Claude
lesswrong.com
·
5d
5 days ago
Actions for Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Cutting
LLM
Evaluation
Costs with SySRs: A Bandit Algorithm that Provably Exploits
Model
Similarity
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Agreement in Representation Space for Open-Ended Self-Consistency
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Agreement in Representation Space for Open-Ended Self-Consistency
Claw-SWE-Bench
: A Benchmark for
Evaluating
OpenClaw-style
Agent
Harnesses on Coding Tasks
💻
AI Coding Tools
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks
SurgiQ: A Large-Scale Multi-Domain
Benchmark
for
Evaluating
Surgical Understanding in Large Language
Models
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
Do Coding
Agents
Deceive Us? Detecting and Preventing Cheating via Capped
Evaluation
with Randomized Tests
💻
AI Coding Tools
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests
Multilingual Refusal Alignment for Safer Large Language
Models
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language
Models
✨
Generative AI
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
[AINews] not much happened today
🟠
Claude
Content type:
News
latent.space
·
5d
5 days ago
Actions for [AINews] not much happened today
Evidence Markets
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Evidence Markets
TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs
The Sim-to-Real Gap of Foundation
Model
Agents
: A Unified MDP Perspective
🔧
Tool Use
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for The Sim-to-Real Gap of Foundation Model Agents: A Unified MDP Perspective
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language
Models
🧠
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help