LLM Evaluation

Feeds to Scour
SubscribedAll
Scoured 62 posts in 7.3 ms

Flaws in the LLM Automation Narrative

 🤖LLM Agents  Content type: Academic
arxiv.org·

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

 🎯RLHF  Content type: Academic
arxiv.org·

IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts

 🔧MLIR  Content type: Academic
arxiv.org·

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

 🔄Transformers  Content type: Academic
arxiv.org·

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

 🔧MLIR  Content type: Academic
arxiv.org·

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

 ↩️Backpropagation  Content type: Academic
arxiv.org·

Collective Hallucination in Multi-Agent LLMs:Modeling and Defense

 🤖agentic system  Content type: Academic
arxiv.org·

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

 🔄Transformers  Content type: Academic
arxiv.org·

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

 Inference Optimization  Content type: Academic
arxiv.org·

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

 🔄Transformers  Content type: Academic
arxiv.org·

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

 🎛️Fine-Tuning  Content type: Academic
arxiv.org·

Less is MoE: Trimming Experts in Domain-Specialist Language Models

 🎭Mixture of Experts  Content type: Academic
arxiv.org·

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

 🤖agentic system  Content type: Academic
arxiv.org·

MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following

 🎯RLHF  Content type: Academic
arxiv.org·

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

 🔍RAG  Content type: Academic
arxiv.org·

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

 🎛️Fine-Tuning  Content type: Academic
arxiv.org·

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

 post training infra  Content type: Academic
arxiv.org·

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

 🤖agentic system  Content type: Academic
arxiv.org·

Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks

 Inference Optimization  Content type: Academic
arxiv.org·

Selective-Advantage Entropy-Adaptive Horizon GRPO: Asymmetric Token-Level Discounting for Efficient Reinforcement Learning of Language Models

 🎛️Fine-Tuning  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help