📊 LLM Evaluation - moyutianzun · Scour

LLM Research Papers: The 2026 List (January to May)

🎭Mixture of Experts News

magazine.sebastianraschka.com

··Hacker News

🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms

🤖LLM Agents News Blog

saanyaojha.substack.com··Substack

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

🔧MLIR Academic

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

🔄Transformers

lesswrong.com·

Why Shrinking an AI Model Often Makes It More Useful

🎭Mixture of Experts

siliconopera.com·

LLM-Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?

🤖LLM Agents Academic

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

latent.space··Hacker News

Standing at the Foot of the Singularity

🔲TPU Architecture Blog

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

⚡Inference Optimization Academic

Is the U.S. Men’s National Team Finally Ready for a Breakthrough?

🔍RAG News Blog

neilpaine.substack.com··Substack

Predicting every game of the entire World Cup: All the teams and all the winners

📐Linear Algebra Video News

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

🎛️Fine-Tuning Academic

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

lesswrong.com·

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

🎯RLHF Academic

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

🤖LLM Agents Academic

Agreement in Representation Space for Open-Ended Self-Consistency

🔧MLIR Academic

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

🎯RLHF Academic

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

🔄Transformers Academic

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

🔍RAG Academic

Multilingual Refusal Alignment for Safer Large Language Models

🎯RLHF Academic

Log in to enable infinite scrolling