Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evals
馃搳 LLM Evals
Specific
LLM evaluation, benchmarks, model assessment, evals framework
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
37
posts in
6.5
ms
Rank Intervals for
Leaderboards
: A Hierarchical
Framework
for
Model
Evaluation
聽
馃敡
MLOps
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Less-relevant results
ChargeBD: Character-Aware Heterogeneous Agent Reasoning for Guided Engineering in Battery Development
聽
馃
AI Agents
聽
Content type:
Academic
arxiv.org
路
2d
2 days ago
Actions for ChargeBD: Character-Aware Heterogeneous Agent Reasoning for Guided Engineering in Battery Development
The Fine-Tuning Trap:
Evaluating
Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
聽
鉁嶏笍
Prompt Engineering
聽
Content type:
Academic
arxiv.org
路
5d
5 days ago
Actions for The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Multilingual Refusal Alignment for Safer Large Language
Models
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
Flaws in the
LLM
Automation Narrative
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
3d
3 days ago
Actions for Flaws in the LLM Automation Narrative
Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language
Models
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
聽
鉁嶏笍
Prompt Engineering
聽
Content type:
Academic
arxiv.org
路
1d
1 day ago
Actions for Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
IDP-Bench
: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts
聽
馃寪
Open Source AI
聽
Content type:
Academic
arxiv.org
路
3d
3 days ago
Actions for IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language
Models
聽
馃寪
Open Source AI
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Density Ridge Selective Prediction for
LLM
and VLM Hallucination Detection under Calibration Label Scarcity
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
3d
3 days ago
Actions for Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Collective Hallucination in Multi-Agent LLMs:
Modeling
and Defense
聽
馃
AI Agents
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for Collective Hallucination in Multi-Agent LLMs:Modeling and Defense
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
聽
鉁嶏笍
Prompt Engineering
聽
Content type:
Academic
arxiv.org
路
3d
3 days ago
Actions for Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
Attention-Discounted Adaptive Sampler for Masked Diffusion Language
Models
聽
馃
LLMs
聽
Content type:
Academic
arxiv.org
路
3d
3 days ago
Actions for Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language
Model
Unlearning
聽
馃搫
AI Papers
聽
Content type:
Academic
arxiv.org
路
3d
3 days ago
Actions for Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
聽
馃
AI Agents
聽
Content type:
Academic
arxiv.org
路
4d
4 days ago
Actions for PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
« Page 1
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help