LLM Benchmarking

Feeds to Scour
SubscribedAll
Scoured 30 posts in 31.0 ms

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

 💻Coding Agents  Content type: Academic
arxiv.org·

LLM Research Papers: The 2026 List (January to May)

 🆕New AI  Content type: News

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

 🪄Prompt Engineering
lesswrong.com·

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

 🧠LLM Inference  Content type: Academic
arxiv.org·

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

 🎯Qdrant  Content type: Academic
arxiv.org·
Less-relevant results

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

 🪄Prompt Engineering  Content type: Blog
huggingface.co·

Gemma 4 12B: The Missing Encoders Are the Point

 🤖AI
pub.towardsai.net
·

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

 🪄Prompt Engineering  Content type: Academic
arxiv.org·

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16

 🆕New AI

What Does Abliteration Actually Cost?

 🤖AI
lesswrong.com·

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

 Fast AI Inference  Content type: Academic
arxiv.org·

Let us let Google know that we want the Gemma 4 124b

 Gemini

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

 🤖AI  Content type: Academic
arxiv.org·

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

 🎭Claude
lesswrong.com·

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

 Fast AI Inference  Content type: Academic
arxiv.org·

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

 📊Statistical Ranking  Content type: Academic
arxiv.org·

SemBlock: Semantic Boundary Dynamic Blocks for Diffusion LLMs

 🧠LLM Inference  Content type: Academic
arxiv.org·

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

 🦉Qwen  Content type: Academic
arxiv.org·

Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models

 🪄Prompt Engineering  Content type: Academic
arxiv.org·

Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking

 🤖AI  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help