Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evaluation
📊 LLM Evaluation
Specific
Benchmarks, Model Testing, Performance Metrics, HELM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
62
posts in
7.3
ms
Flaws in the
LLM
Automation Narrative
🤖
LLM Agents
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Flaws in the LLM Automation Narrative
Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language
Models
🎯
RLHF
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
IDP-Bench
: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts
🔧
MLIR
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language
Models
🔄
Transformers
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
UrduMMLU: A Massive Multitask
Benchmark
for Urdu Language Understanding
🔧
MLIR
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Density Ridge Selective Prediction for
LLM
and VLM Hallucination Detection under Calibration Label Scarcity
↩️
Backpropagation
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
Collective Hallucination in Multi-Agent LLMs:
Modeling
and Defense
🤖
agentic system
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Collective Hallucination in Multi-Agent LLMs:Modeling and Defense
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
🔄
Transformers
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
Attention-Discounted Adaptive Sampler for Masked Diffusion Language
Models
⚡
Inference Optimization
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
RealMath-Eval
: Why SOTA Judges Struggle with Real Human Reasoning
🔄
Transformers
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language
Model
Unlearning
🎛️
Fine-Tuning
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Less is MoE: Trimming Experts in Domain-Specialist Language
Models
🎭
Mixture of Experts
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Less is MoE: Trimming Experts in Domain-Specialist Language Models
PACE: Anytime-Valid Acceptance
Tests
for Self-Evolving Agents
🤖
agentic system
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
🎯
RLHF
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Elmes*: Automated Construction of Fine-Grained
Evaluation
Rubrics for Large Language
Models
in Long-Tail Educational Scenarios
🔍
RAG
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
The Fine-Tuning Trap:
Evaluating
Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
🎛️
Fine-Tuning
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking
⚙
post training infra
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking
Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
🤖
agentic system
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
Lightweight Language
Models
are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks
⚡
Inference Optimization
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks
Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language
Models
🎛️
Fine-Tuning
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help