Model Evals

Feeds to Scour
SubscribedAll
Scoured 91 posts in 7.0 ms

Beat the Oracle

 📚RAG  Content type: Code
github.com
··DEV

When Languages Disagree: Self-Evolving Multilingual LLM Judges

 🧠LLMs  Content type: Academic
arxiv.org·

RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning

 🧠LLMs  Content type: Academic
arxiv.org·

Phoenix

 AI Apps
arize.com·

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

 🔧MLOps  Content type: Academic
arxiv.org·

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

 🧠LLMs
lesswrong.com·

Flaws in the LLM Automation Narrative

 🧠LLMs  Content type: Academic
arxiv.org·

Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios

 🧠LLMs  Content type: Academic
arxiv.org·

AI agent performance metrics: what to track and why

 🤖AI Agents  Content type: Blog
blog.n8n.io·

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

 🧠LLMs  Content type: Academic
arxiv.org·

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

 🧠LLMs  Content type: Academic
arxiv.org·

The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning

 🧠LLMs  Content type: Academic
arxiv.org·

Predicting every game of the entire World Cup: All the teams and all the winners

 🌍Geopolitics  Content type: Video  Content type: News
espn.com·

Multilingual Refusal Alignment for Safer Large Language Models

 🧠LLMs  Content type: Academic
arxiv.org·

Law professors prefer AI over peer answers

 🤖AI

Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models

 🧠LLMs  Content type: Academic
arxiv.org·

Is the U.S. Men’s National Team Finally Ready for a Breakthrough?

 🌍Geopolitics  Content type: News  Content type: Blog

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

 🧠LLMs  Content type: Academic
arxiv.org·

Who will win the 2026 FIFA World Cup? Why each of the top contenders (and the USMNT?) could win it all

 🌍Geopolitics
cbssports.com·

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

 🧠LLMs  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help