Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Model Evals
📊 Model Evals
Specific
LLM evaluation, benchmarks, model evaluation, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
91
posts in
7.0
ms
Beat the Oracle
📚
RAG
Content type:
Code
github.com
·
4d
4 days ago
·
DEV
Actions for Beat the Oracle
When Languages Disagree: Self-Evolving Multilingual
LLM
Judges
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for When Languages Disagree: Self-Evolving Multilingual LLM Judges
RealMath-Eval
: Why SOTA
Judges
Struggle with Real
Human
Reasoning
🧠
LLMs
Content type:
Academic
arxiv.org
·
20h
20 hours ago
Actions for RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
Phoenix
⚡
AI Apps
arize.com
·
6d
6 days ago
Actions for Phoenix
TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
🔧
MLOps
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
Revisiting
GSM-Symbolic
: Do 2026 Frontier
Models
Still Fail at Confounded Grade School Math?
🧠
LLMs
lesswrong.com
·
5d
5 days ago
Actions for Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Flaws in the
LLM
Automation Narrative
🧠
LLMs
Content type:
Academic
arxiv.org
·
20h
20 hours ago
Actions for Flaws in the LLM Automation Narrative
Elmes*: Automated Construction of Fine-Grained
Evaluation
Rubrics for Large Language
Models
in Long-Tail Educational Scenarios
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
AI agent performance metrics: what to track and why
🤖
AI Agents
Content type:
Blog
blog.n8n.io
·
5d
5 days ago
Actions for AI agent performance metrics: what to track and why
Cutting
LLM
Evaluation
Costs with SySRs: A Bandit Algorithm that Provably Exploits
Model
Similarity
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Density Ridge Selective Prediction for
LLM
and VLM Hallucination Detection under Calibration Label Scarcity
🧠
LLMs
Content type:
Academic
arxiv.org
·
20h
20 hours ago
Actions for Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
The Fine-Tuning Trap:
Evaluating
Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Predicting every game of the entire World Cup: All the teams and all the winners
🌍
Geopolitics
Content type:
Video
Content type:
News
espn.com
·
5d
5 days ago
Actions for Predicting every game of the entire World Cup: All the teams and all the winners
Multilingual Refusal Alignment for Safer Large Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
Law professors prefer AI over peer answers
🤖
AI
marginalrevolution.com
·
6d
6 days ago
·
Hacker News
Actions for Law professors prefer AI over peer answers
Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
Is the U.S. Men’s National Team Finally Ready for a Breakthrough?
🌍
Geopolitics
Content type:
News
Content type:
Blog
neilpaine.substack.com
·
5d
5 days ago
·
Substack
Actions for Is the U.S. Men’s National Team Finally Ready for a Breakthrough?
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
Who will win the 2026 FIFA World Cup? Why each of the top contenders (and the USMNT?) could win it all
🌍
Geopolitics
cbssports.com
·
6d
6 days ago
Actions for Who will win the 2026 FIFA World Cup? Why each of the top contenders (and the USMNT?) could win it all
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help