Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evals
📊 LLM Evals
Specific
evaluation, benchmarking, LLM testing, model assessment
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
93
posts in
9.0
ms
Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows
🎼
Agent Orchestration
Content type:
Academic
arxiv.org
·
13h
13 hours ago
Actions for Smarter Saboteurs, Better Fixers: Scaling & Security in Linear Multi-Agent Workflows
AI scribes may have ‘profound impact’ on patient care
🧩
AI Frameworks
Content type:
News
healio.com
·
3d
3 days ago
Actions for AI scribes may have ‘profound impact’ on patient care
Valid Inference with Synthetic Data via Task Exchangeability
⚙️
MLOps
Content type:
Academic
arxiv.org
·
13h
13 hours ago
Actions for Valid Inference with Synthetic Data via Task Exchangeability
The Biggest Summer Blockbusters Since 2010, by Box Office Sales
💾
Agent Memory
Content type:
News
visualcapitalist.com
·
3d
3 days ago
Actions for The Biggest Summer Blockbusters Since 2010, by Box Office Sales
LLM-Based
Visualization
Evaluation
: How Well Do Literacy-Stratified Personas Approximate
Human
Judgments?
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for LLM-Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?
PhantomBench:
Benchmarking
the Non-existential Threat of Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for PhantomBench: Benchmarking the Non-existential Threat of Language Models
NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent
RAG
System for the Text-to-Text Track
🔍
RAG
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
RealMath-Eval
: Why SOTA
Judges
Struggle with Real
Human
Reasoning
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
UXBench:
Benchmarking
User Experience in AI Assistants
🧩
AI Frameworks
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for UXBench: Benchmarking User Experience in AI Assistants
Flaws in the
LLM
Automation Narrative
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Flaws in the LLM Automation Narrative
When Languages Disagree: Self-Evolving Multilingual
LLM
Judges
🧠
LLMs
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for When Languages Disagree: Self-Evolving Multilingual LLM Judges
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
💾
Agent Memory
Content type:
Academic
arxiv.org
·
13h
13 hours ago
Actions for Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
Detecting Functional Memorization in Code Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
13h
13 hours ago
Actions for Detecting Functional Memorization in Code Language Models
Attention-Discounted Adaptive Sampler for Masked Diffusion Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
⚙️
MLOps
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles
Sample Where You Struggle: Sharpening Base
Model
Reasoning via Entropy-Guided Power Sampling
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Sample Where You Struggle: Sharpening Base Model Reasoning via Entropy-Guided Power Sampling
CodeAlchemy: Synthetic Code Rewriting at Scale
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for CodeAlchemy: Synthetic Code Rewriting at Scale
Cutting
LLM
Evaluation
Costs with SySRs: A Bandit Algorithm that Provably Exploits
Model
Similarity
🧠
LLMs
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation
🧠
LLMs
Content type:
Academic
arxiv.org
·
13h
13 hours ago
Actions for Constrained Semantic Decompression in LLMs through Persian Proverb-Conditioned Story Generation
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help