LLM Evals

Feeds to Scour
SubscribedAll
Scoured 25 posts in 14.7 ms

UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding

 🧠Agent Memory  Content type: Academic
arxiv.org·
Less-relevant results

Adrarsh Divakaran: Building AI Agents in Python

 🤖agent design  Content type: Blog
blog.adarshd.dev·

The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has

 🤖agent design
xda-developers.com·

What Does Abliteration Actually Cost?

 🤖agent design
lesswrong.com·

LLM Routing: From Strategy Selection to Production Architecture

 🧠Agent Memory  Content type: Blog
blog.n8n.io·

nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16

 🤖agent design

LLM Research Papers: The 2026 List (January to May)

 🤖agent design  Content type: News

umair-tareen/philosopher-council: An eleven-philosopher LLM council - ask it questions or point it at AI-research trends. Claude-powered deliberation through the four classical branches of philosophy. Methodology, not metaphysics.

 🤖agent design  Content type: Code
github.com··r/SideProject

SLMJury: Can Small Language Models Judge as Well as Large Ones?

 🧠Agent Memory  Content type: Academic
arxiv.org·

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

 🚀Amateur Rocketry  Content type: Discussion

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

 🤖agent design
latent.space··Hacker News

When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

 🧠Agent Memory  Content type: Academic
arxiv.org·

🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms

 🧠Agent Memory  Content type: News  Content type: Blog

Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output

 🧠Agent Memory  Content type: Academic
arxiv.org·

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

 🧠Agent Memory
lesswrong.com·

Why Shrinking an AI Model Often Makes It More Useful

 🤖agent design
siliconopera.com·

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

 🧠Agent Memory  Content type: Blog
huggingface.co·

Multilingual Refusal Alignment for Safer Large Language Models

 🧠Agent Memory  Content type: Academic
arxiv.org·

MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models

 🧠Agent Memory  Content type: Academic
arxiv.org·

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

 🤖agent design  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help