Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evals
📊 LLM Evals
Specific
LLM evaluation, agent benchmarks, evals, LMSYS
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
24
posts in
5.6
ms
UrduMMLU: A Massive Multitask
Benchmark
for Urdu Language Understanding
🧠
Agent Memory
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Less-relevant results
The biggest local
LLM
on your machine is useless if it can't call a single tool, no matter how many parameters it has
🤖
agent design
xda-developers.com
·
11h
11 hours ago
Actions for The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has
Adrarsh Divakaran: Building AI
Agents
in Python
🤖
agent design
Content type:
Blog
blog.adarshd.dev
·
6d
6 days ago
Actions for Adrarsh Divakaran: Building AI Agents in Python
LLM
Routing: From Strategy Selection to Production Architecture
🧠
Agent Memory
Content type:
Blog
blog.n8n.io
·
12h
12 hours ago
Actions for LLM Routing: From Strategy Selection to Production Architecture
What Does Abliteration Actually Cost?
🤖
agent design
lesswrong.com
·
6d
6 days ago
Actions for What Does Abliteration Actually Cost?
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
🤖
agent design
huggingface.co
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
LLM
Research Papers: The 2026 List (January to May)
🤖
agent design
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
SLMJury: Can Small Language
Models
Judge as Well as Large Ones?
🧠
Agent Memory
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for SLMJury: Can Small Language Models Judge as Well as Large Ones?
umair-tareen/philosopher-council: An eleven-philosopher
LLM
council - ask it questions or point it at AI-research trends. Claude-powered deliberation through the
four
classical branches of philosophy. Methodology, not metaphysics.
🤖
agent design
Content type:
Code
github.com
·
5d
5 days ago
·
r/SideProject
Actions for umair-tareen/philosopher-council: An eleven-philosopher LLM council - ask it questions or point it at AI-research trends. Claude-powered deliberation through the four classical branches of philosophy. Methodology, not metaphysics.
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
🚀
Amateur Rocketry
Content type:
Discussion
news.ycombinator.com
·
5d
5 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🤖
agent design
latent.space
·
6d
6 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
🧠
Agent Memory
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
🧠
Agent Memory
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
3d
3 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
Representation-Aware Advantage Estimation: Your Reward
Model
Provides More Than A Scalar Output
🧠
Agent Memory
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Representation-Aware Advantage Estimation: Your Reward Model Provides More Than A Scalar Output
Evaluating
using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
🧠
Agent Memory
lesswrong.com
·
5d
5 days ago
Actions for Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
Why Shrinking an AI
Model
Often Makes It More Useful
🤖
agent design
siliconopera.com
·
3d
3 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
Multilingual Refusal Alignment for Safer Large Language
Models
🧠
Agent Memory
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
🧠
Agent Memory
Content type:
Blog
huggingface.co
·
6d
6 days ago
Actions for Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language
Models
🧠
Agent Memory
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Rank Intervals for Leaderboards: A Hierarchical Framework for
Model
Evaluation
🤖
agent design
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help