📊 Model Evals - gaoyabing · Scour

Beat the Oracle

📚RAG Code

··DEV

When Languages Disagree: Self-Evolving Multilingual LLM Judges

🧠LLMs Academic

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

🧠LLMs Academic

Phoenix

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

🔧MLOps Academic

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

lesswrong.com·

Flaws in the LLM Automation Narrative

🧠LLMs Academic

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

🧠LLMs Academic

AI agent performance metrics: what to track and why

🤖AI Agents Blog

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

🧠LLMs Academic

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

🧠LLMs Academic

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

🧠LLMs Academic

Predicting every game of the entire World Cup: All the teams and all the winners

🌍Geopolitics Video News

Multilingual Refusal Alignment for Safer Large Language Models

🧠LLMs Academic

Law professors prefer AI over peer answers

marginalrevolution.com··Hacker News

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

🧠LLMs Academic

Is the U.S. Men’s National Team Finally Ready for a Breakthrough?

🌍Geopolitics News Blog

neilpaine.substack.com··Substack

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

🧠LLMs Academic

Who will win the 2026 FIFA World Cup? Why each of the top contenders (and the USMNT?) could win it all

🌍Geopolitics

cbssports.com·

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

🧠LLMs Academic

Sign up or log in to see more results

Log in to enable infinite scrolling