LLM Evals

Feeds to Scour
SubscribedAll
Scoured 37 posts in 6.5 ms

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

馃敡MLOpsContent type: Academic
arxiv.org
Less-relevant results

ChargeBD: Character-Aware Heterogeneous Agent Reasoning for Guided Engineering in Battery Development

馃AI AgentsContent type: Academic
arxiv.org

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

鉁嶏笍Prompt EngineeringContent type: Academic
arxiv.org

Multilingual Refusal Alignment for Safer Large Language Models

馃LLMsContent type: Academic
arxiv.org

Flaws in the LLM Automation Narrative

馃LLMsContent type: Academic
arxiv.org

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

馃LLMsContent type: Academic
arxiv.org

Multi-Turn Reasoning When Context Arrives in Pieces: Scalable Sharding and Memory-Augmented RL

鉁嶏笍Prompt EngineeringContent type: Academic
arxiv.org

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

馃LLMsContent type: Academic
arxiv.org

IDP-Bench: Benchmarking ability of LLMs to protect personal information in interdependent privacy contexts

馃寪Open Source AIContent type: Academic
arxiv.org

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

馃寪Open Source AIContent type: Academic
arxiv.org

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

馃LLMsContent type: Academic
arxiv.org

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

馃LLMsContent type: Academic
arxiv.org

Collective Hallucination in Multi-Agent LLMs:Modeling and Defense

馃AI AgentsContent type: Academic
arxiv.org

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

鉁嶏笍Prompt EngineeringContent type: Academic
arxiv.org

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

馃LLMsContent type: Academic
arxiv.org

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

馃搫AI PapersContent type: Academic
arxiv.org

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

馃AI AgentsContent type: Academic
arxiv.org

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help