Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Model Evaluation
📊 Model Evaluation
Specific
LLM eval, benchmarks, evals, model assessment, MMLU
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
22
posts in
6.1
ms
The
Evaluation
Blind Spot: A Stereological Theory of
Benchmark
Coverage for
Large
Language Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for The Evaluation Blind Spot: A Stereological Theory of Benchmark Coverage for Large Language Models
Less-relevant results
The biggest local
LLM
on your machine is useless if it can't call a single tool, no matter how many parameters it has
👁️
Multimodal AI
xda-developers.com
·
16h
16 hours ago
Actions for The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has
Bring your own
evaluation
framework to
EvalHub
🎛️
Fine-tuning
developers.redhat.com
·
2d
2 days ago
Actions for Bring your own evaluation framework to EvalHub
What Does Abliteration Actually Cost?
🧠
LLMs
lesswrong.com
·
6d
6 days ago
Actions for What Does Abliteration Actually Cost?
Researchers say they trained a foundation
model
from scratch for about $1,500
🧠
LLMs
venturebeat.com
·
11h
11 hours ago
Actions for Researchers say they trained a foundation model from scratch for about $1,500
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
🧠
Reasoning Models
Content type:
Blog
huggingface.co
·
6d
6 days ago
Actions for Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
A Controlled Study of Decoding-Time
Truthfulness
Methods on Instruction-Tuned LLMs
🧠
LLMs
Content type:
Academic
arxiv.org
·
5h
5 hours ago
Actions for A Controlled Study of Decoding-Time Truthfulness Methods on Instruction-Tuned LLMs
LLM
Research Papers: The 2026 List (January to May)
🧠
Reasoning Models
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🤖
AI Agents
latent.space
·
6d
6 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
🤖
AI Agents
Content type:
Discussion
news.ycombinator.com
·
5d
5 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
Multilingual Refusal Alignment for Safer
Large
Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
⚖️
AI Governance
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
3d
3 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
⚡
Inference
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Why Shrinking an AI
Model
Often Makes It More Useful
✍️
Prompt Engineering
siliconopera.com
·
4d
4 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
🧠
Reasoning Models
huggingface.co
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in
Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Revisiting GSM-Symbolic: Do 2026 Frontier
Models
Still Fail at Confounded Grade School Math?
👁️
Multimodal AI
lesswrong.com
·
5d
5 days ago
Actions for Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Null-Space Constrained Low-Rank Adaptation for Response-Specified
Large
Language
Model
Unlearning
🔬
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
⚡
Inference
Content type:
Academic
arxiv.org
·
5h
5 hours ago
Actions for Measuring Semantic Progress in Multi-turn Dialogue via Information Gain
Rank Intervals for Leaderboards: A Hierarchical Framework for
Model
Evaluation
🌐
World Models
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help