LLM Evals

Feeds to Scour
SubscribedAll
Scoured 52 posts in 6.8 ms

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

 🧠AI Research  Content type: Academic
arxiv.org·
Less-relevant results

justification

 C++  Content type: Blog
0gs.bearblog.dev·

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

 🧠AI Research  Content type: Blog
huggingface.co·

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

 🔧MLOps  Content type: Academic
arxiv.org·

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

 🔧MLOps  Content type: Academic
arxiv.org·

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

 🔧MLOps
lesswrong.com·

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

 🔧MLOps  Content type: Academic
arxiv.org·

Agentic threat actor hits the orchestration plane: AI agent-driven container escape

 🐍Python  Content type: Blog
sysdig.com·

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

 🧠AI Research  Content type: Academic
arxiv.org·

Multilingual Refusal Alignment for Safer Large Language Models

 🧠AI Research  Content type: Academic
arxiv.org·

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

 🧠AI Research  Content type: Academic
arxiv.org·

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

 🔧MLOps
lesswrong.com·

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

 🧠AI Research  Content type: Academic
arxiv.org·

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

 🧠AI Research  Content type: Academic
arxiv.org·

Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning

 🧠AI Research  Content type: Academic
arxiv.org·

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

 🧠AI Research  Content type: Academic
arxiv.org·

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

 🧠AI Research  Content type: Academic
arxiv.org·

Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning

 🧠AI Research  Content type: Academic
arxiv.org·

Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking

 🧠AI Research  Content type: Academic
arxiv.org·

Less is MoE: Trimming Experts in Domain-Specialist Language Models

 🔧MLOps  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help