Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Model Evals
📊 Model Evals
Specific
LLM evaluation, benchmarks, model evaluation, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
91
posts in
9.5
ms
Benchmark
Everything Everywhere All at Once
🧠
LLMs
Content type:
Academic
arxiv.org
·
5d
5 days ago
Actions for Benchmark Everything Everywhere All at Once
Introducing FrontierCode
🧠
LLMs
Content type:
Blog
cognition.ai
·
2d
2 days ago
·
Hacker News
Actions for Introducing FrontierCode
Less-relevant results
The biggest local
LLM
on your machine is useless if it can't call a single tool, no matter how many parameters it has
🖥️
Hardware
xda-developers.com
·
4h
4 hours ago
Actions for The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has
1-bit and 1.58 bit
LLM
Benchmarking
on Jetson Orin Nano Super | Bonsai
LM
🖥️
Hardware
smolhub.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for 1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM
What Does Abliteration Actually Cost?
🧠
LLMs
lesswrong.com
·
5d
5 days ago
Actions for What Does Abliteration Actually Cost?
Understanding
evaluation
collections in
EvalHub
⚙️
DevOps
developers.redhat.com
·
6d
6 days ago
Actions for Understanding evaluation collections in EvalHub
Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence
🤖
AI Agents
Content type:
Code
github.com
·
5h
5 hours ago
·
Hacker News
Actions for Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence
Comprehensive
evaluation
of
LLM
capabilities for interpretation and analysis of genome-scale metabolic
models
in metabolic engineering
🧠
LLMs
Content type:
Academic
biorxiv.org
·
1d
1 day ago
Actions for Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering
MLPerf and the rise of latency-aware
LLM
benchmarking
🖥️
Hardware
edn.com
·
5d
5 days ago
Actions for MLPerf and the rise of latency-aware LLM benchmarking
AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
🖥️
Hardware
phoronix.com
·
5h
5 hours ago
Actions for AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
How accurate is speech-to-text in 2026?
⚡
AI Apps
Content type:
Blog
assemblyai.com
·
6d
6 days ago
Actions for How accurate is speech-to-text in 2026?
$\tau$-Rec: A Verifiable
Benchmark
for Agentic Recommender Systems
🔧
MLOps
Content type:
Academic
arxiv.org
·
17h
17 hours ago
Actions for $\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🤖
AI Agents
latent.space
·
6d
6 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
I built a dashboard ranking all 48 World Cup 2026 teams by travel difficulty
🌍
Geopolitics
jetlagxi.com
·
2d
2 days ago
·
r/SideProject
Actions for I built a dashboard ranking all 48 World Cup 2026 teams by travel difficulty
USMNT World Cup bracket scenarios, odds to advance, predicted path to knockouts
🌍
Geopolitics
Content type:
Video
Content type:
News
espn.com
·
10h
10 hours ago
Actions for USMNT World Cup bracket scenarios, odds to advance, predicted path to knockouts
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
🖥️
Hardware
Content type:
Discussion
news.ycombinator.com
·
5d
5 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
UrduMMLU: A Massive Multitask
Benchmark
for Urdu Language Understanding
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
🖥️
GPUs
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
3d
3 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
The State of
LLM
Evaluation
(2026): Why Evals Became the New Unit Tests
🧠
LLMs
Content type:
Blog
medium.com
·
3d
3 days ago
Actions for The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
🖥️
Hardware
huggingface.co
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help