LLM Evaluation

Feeds to Scour
SubscribedAll
Scoured 62 posts in 8.3 ms

LLM Research Papers: The 2026 List (January to May)

 🎭Mixture of Experts  Content type: News

🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms

 🤖LLM Agents  Content type: News  Content type: Blog

Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese

 🔧MLIR  Content type: Academic
arxiv.org·

Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?

 🔄Transformers
lesswrong.com·

Why Shrinking an AI Model Often Makes It More Useful

 🎭Mixture of Experts
siliconopera.com·

LLM-Based Visualization Evaluation: How Well Do Literacy-Stratified Personas Approximate Human Judgments?

 🤖LLM Agents  Content type: Academic
arxiv.org·

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

 🤖LLM Agents
latent.space··Hacker News

Standing at the Foot of the Singularity

 🔲TPU Architecture  Content type: Blog
medium.com·

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

 Inference Optimization  Content type: Academic
arxiv.org·

Is the U.S. Men’s National Team Finally Ready for a Breakthrough?

 🔍RAG  Content type: News  Content type: Blog

Predicting every game of the entire World Cup: All the teams and all the winners

 📐Linear Algebra  Content type: Video  Content type: News
espn.com·

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

 🎛️Fine-Tuning  Content type: Academic
arxiv.org·

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

 🔧MLIR
lesswrong.com·

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

 🎯RLHF  Content type: Academic
arxiv.org·

Measuring Epistemic Resilience of LLMs Under Misleading Medical Context

 🤖LLM Agents  Content type: Academic
arxiv.org·

Agreement in Representation Space for Open-Ended Self-Consistency

 🔧MLIR  Content type: Academic
arxiv.org·

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

 🎯RLHF  Content type: Academic
arxiv.org·

The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models

 🔄Transformers  Content type: Academic
arxiv.org·

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

 🔍RAG  Content type: Academic
arxiv.org·

Multilingual Refusal Alignment for Safer Large Language Models

 🎯RLHF  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help