Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evaluation
📊 LLM Evaluation
Specific
benchmarks, evals, LLM scoring, evaluation metrics
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
62
posts in
9.7
ms
What Does Abliteration Actually Cost?
🗣️
NLP
lesswrong.com
·
5d
5 days ago
Actions for What Does Abliteration Actually Cost?
Comprehensive
evaluation
of
LLM
capabilities for interpretation and analysis of genome-scale metabolic
models
in metabolic engineering
🗂️
RAG Systems
Content type:
Academic
biorxiv.org
·
1d
1 day ago
Actions for Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering
Less-relevant results
$\tau$-Rec: A Verifiable
Benchmark
for Agentic Recommender Systems
🔍
Information Retrieval
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for $\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems
The State of
LLM
Evaluation
(2026): Why Evals Became the New Unit Tests
🏢
LLM Adoption
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests
Law Professors Prefer AI over Peer Answers
🏢
LLM Adoption
Content type:
Academic
law.stanford.edu
·
3d
3 days ago
·
Hacker News
Actions for Law Professors Prefer AI over Peer Answers
Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence
🤖
AI Agents
Content type:
Code
github.com
·
2h
2 hours ago
·
Hacker News
Actions for Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence
The Vanta AI Quality
Eval
Maturity
Model
🛡️
AI Safety
vanta.com
·
3h
3 hours ago
·
Hacker News
Actions for The Vanta AI Quality Eval Maturity Model
The biggest local
LLM
on your machine is useless if it can't call a single tool, no matter how many parameters it has
🤖
LLMs
xda-developers.com
·
1h
1 hour ago
Actions for The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
💬
Natural Language Processing
Content type:
Blog
huggingface.co
·
6d
6 days ago
Actions for Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
Phoenix
🤖
AI Agents
arize.com
·
6d
6 days ago
Actions for Phoenix
Cybersecurity M&A Roundup: 26 Deals Announced in May 2026
🛡️
AI Safety
securityweek.com
·
2d
2 days ago
Actions for Cybersecurity M&A Roundup: 26 Deals Announced in May 2026
AI Governance Tools: How To Achieve Compliance and Visibility
🗂
Knowledge Management
Content type:
Blog
blog.n8n.io
·
3h
3 hours ago
Actions for AI Governance Tools: How To Achieve Compliance and Visibility
Understanding
evaluation
collections in
EvalHub
🏢
LLM Adoption
developers.redhat.com
·
6d
6 days ago
Actions for Understanding evaluation collections in EvalHub
Beyond English
benchmarks
: clinical
llm
evaluation
in Brazilian Portuguese
🌐
Multilingual NLP
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM
🔓
Open Source AI
the-decoder.com
·
6d
6 days ago
Actions for Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM
How to Train Your Goblin
🤖
LLMs
goblins.mchen.workers.dev
·
3d
3 days ago
·
Hacker News
,
Hacker News
Actions for How to Train Your Goblin
Evaluating
using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
🤖
LLMs
lesswrong.com
·
4d
4 days ago
Actions for Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
When
Languages
Disagree: Self-Evolving Multilingual
LLM
Judges
🤖
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for When Languages Disagree: Self-Evolving Multilingual LLM Judges
LLM
Research Papers: The 2026 List (January to May)
🗣️
NLP
Content type:
News
magazine.sebastianraschka.com
·
4d
4 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
💻
Local AI
Content type:
Discussion
news.ycombinator.com
·
5d
5 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help