Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Model Evaluation
📊 Model Evaluation
Specific
LLM eval, benchmarks, evals, model assessment, MMLU
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
20
posts in
14.5
ms
The
Evaluation
Blind Spot: A Stereological Theory of
Benchmark
Coverage for
Large
Language Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
Less-relevant results
The biggest local
LLM
on your machine is useless if it can't call a single tool, no matter how many parameters it has
👁️
Multimodal AI
xda-developers.com
·
14h
14 hours ago
Actions for The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has
Bring your own
evaluation
framework to
EvalHub
🎛️
Fine-tuning
developers.redhat.com
·
2d
2 days ago
Actions for Bring your own evaluation framework to EvalHub
What Does Abliteration Actually Cost?
🧠
LLMs
lesswrong.com
·
6d
6 days ago
Actions for What Does Abliteration Actually Cost?
Researchers say they trained a foundation
model
from scratch for about $1,500
🧠
LLMs
venturebeat.com
·
9h
9 hours ago
Actions for Researchers say they trained a foundation model from scratch for about $1,500
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
🧠
Reasoning Models
Content type:
Blog
huggingface.co
·
6d
6 days ago
Actions for Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
LLM
Research Papers: The 2026 List (January to May)
🧠
Reasoning Models
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
Multilingual Refusal Alignment for Safer
Large
Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🤖
AI Agents
latent.space
·
6d
6 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
🤖
AI Agents
Content type:
Discussion
news.ycombinator.com
·
5d
5 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
⚡
Inference
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
⚖️
AI Governance
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
3d
3 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in
Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Why Shrinking an AI
Model
Often Makes It More Useful
✍️
Prompt Engineering
siliconopera.com
·
4d
4 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
🧠
Reasoning Models
huggingface.co
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
Revisiting GSM-Symbolic: Do 2026 Frontier
Models
Still Fail at Confounded Grade School Math?
👁️
Multimodal AI
lesswrong.com
·
5d
5 days ago
Actions for Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Null-Space Constrained Low-Rank Adaptation for Response-Specified
Large
Language
Model
Unlearning
🔬
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Rank Intervals for Leaderboards: A Hierarchical Framework for
Model
Evaluation
🌐
World Models
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
UrduMMLU: A Massive Multitask
Benchmark
for Urdu
Language
Understanding
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
🎯
Reinforcement Learning
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for MDP-GRPO: Stabilized Group Relative Policy Optimization for Multi-Constraint Instruction Following
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help