Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evals
📊 LLM Evals
Specific
AI evaluation, benchmarking LLMs, model assessment, AI harness
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
51
posts in
6.6
ms
Understanding
evaluation
collections in
EvalHub
🧠
AI Research
developers.redhat.com
·
6d
6 days ago
Actions for Understanding evaluation collections in EvalHub
UrduMMLU: A Massive Multitask
Benchmark
for Urdu Language Understanding
🧠
AI Research
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Less-relevant results
Show HN: Storytime – Continuity for Claude Code (and other ideas)
⚙️
AI Infrastructure
1ps0.info
·
1d
1 day ago
·
Hacker News
Actions for Show HN: Storytime – Continuity for Claude Code (and other ideas)
Google Deepmind's Gemma 4 12B squeezes multimodal
AI
onto a laptop with just 16 GB of RAM
🧠
AI Research
the-decoder.com
·
6d
6 days ago
Actions for Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM
The State of
LLM
Evaluation
(2026): Why Evals Became the New Unit Tests
🔭
Bird Watching
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests
What Does Abliteration Actually Cost?
🧠
AI Research
lesswrong.com
·
5d
5 days ago
Actions for What Does Abliteration Actually Cost?
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
🧠
AI Research
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
2d
2 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🧠
AI Research
latent.space
·
5d
5 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
🖥️
Computer Hardware
Content type:
Discussion
news.ycombinator.com
·
4d
4 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
RealMath-Eval
: Why SOTA Judges Struggle with Real
Human
Reasoning
🧠
AI Research
Content type:
Academic
arxiv.org
·
10h
10 hours ago
Actions for RealMath-Eval: Why SOTA Judges Struggle with Real Human Reasoning
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
🖥️
Computer Hardware
huggingface.co
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
Adrarsh Divakaran: Building
AI
Agents in Python
🧠
AI Research
Content type:
Blog
blog.adarshd.dev
·
6d
6 days ago
Actions for Adrarsh Divakaran: Building AI Agents in Python
Beyond English
benchmarks
: clinical
llm
evaluation
in Brazilian Portuguese
🏥
Medical Terms
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
Why Shrinking an
AI
Model
Often Makes It More Useful
🖥️
Computer Hardware
siliconopera.com
·
3d
3 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
Cybersecurity M&A Roundup: 26 Deals Announced in May 2026
🖥️
Computer Hardware
securityweek.com
·
2d
2 days ago
Actions for Cybersecurity M&A Roundup: 26 Deals Announced in May 2026
What Is an Agent?
🔧
MLOps
Content type:
News
Content type:
Blog
tidydesign.substack.com
·
4d
4 days ago
·
Substack
Actions for What Is an Agent?
SurgiQ: A Large-Scale Multi-Domain
Benchmark
for
Evaluating
Surgical Understanding in Large Language
Models
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models
LLM
Research Papers: The 2026 List (January to May)
🧠
AI Research
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
justification
⚡
C++
Content type:
Blog
0gs.bearblog.dev
·
3d
3 days ago
Actions for justification
Rank Intervals for
Leaderboards
: A Hierarchical Framework for
Model
Evaluation
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help