Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evaluation
📊 LLM Evaluation
Specific
Benchmarks, Model Testing, Performance Metrics, HELM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
62
posts in
8.3
ms
LLM
Research Papers: The 2026 List (January to May)
🎭
Mixture of Experts
Content type:
News
magazine.sebastianraschka.com
·
5d
5 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
🤖
LLM Agents
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
3d
3 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
Beyond English
benchmarks
: clinical
llm
evaluation
in Brazilian Portuguese
🔧
MLIR
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
Revisiting
GSM-Symbolic
: Do 2026 Frontier
Models
Still Fail at Confounded Grade School Math?
🔄
Transformers
lesswrong.com
·
5d
5 days ago
Actions for Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Why Shrinking an AI
Model
Often Makes It More Useful
🎭
Mixture of Experts
siliconopera.com
·
4d
4 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
LLM-Based
Visualization
Evaluation
: How Well Do Literacy-Stratified Personas Approximate Human Judgments?
🤖
LLM Agents
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for LLM-Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🤖
LLM Agents
latent.space
·
6d
6 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Standing at the Foot of the Singularity
🔲
TPU Architecture
Content type:
Blog
medium.com
·
3d
3 days ago
Actions for Standing at the Foot of the Singularity
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
⚡
Inference Optimization
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Is the U.S. Men’s National Team Finally Ready for a Breakthrough?
🔍
RAG
Content type:
News
Content type:
Blog
neilpaine.substack.com
·
6d
6 days ago
·
Substack
Actions for Is the U.S. Men’s National Team Finally Ready for a Breakthrough?
Predicting every game of the entire World Cup: All the teams and all the winners
📐
Linear Algebra
Content type:
Video
Content type:
News
espn.com
·
6d
6 days ago
Actions for Predicting every game of the entire World Cup: All the teams and all the winners
Cutting
LLM
Evaluation
Costs with SySRs: A Bandit Algorithm that Provably Exploits
Model
Similarity
🎛️
Fine-Tuning
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Evaluating
using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
🔧
MLIR
lesswrong.com
·
5d
5 days ago
Actions for Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
🎯
RLHF
Content type:
Academic
arxiv.org
·
8h
8 hours ago
Actions for Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
🤖
LLM Agents
Content type:
Academic
arxiv.org
·
8h
8 hours ago
Actions for Measuring Epistemic Resilience of LLMs Under Misleading Medical Context
Agreement in Representation Space for Open-Ended Self-Consistency
🔧
MLIR
Content type:
Academic
arxiv.org
·
8h
8 hours ago
Actions for Agreement in Representation Space for Open-Ended Self-Consistency
Rank Intervals for Leaderboards: A Hierarchical Framework for
Model
Evaluation
🎯
RLHF
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
The
Evaluation
Blind Spot: A Stereological Theory of
Benchmark
Coverage for Large Language
Models
🔄
Transformers
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
SurgiQ: A Large-Scale Multi-Domain
Benchmark
for
Evaluating
Surgical Understanding in Large Language
Models
🔍
RAG
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
Multilingual Refusal Alignment for Safer Large Language
Models
🎯
RLHF
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help