Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evals
📊 LLM Evals
Specific
evaluation, benchmarking, LLM testing, model assessment
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
50
posts in
7.0
ms
An
LLM
benchmark
is only useful for as long as it's hard
🧠
LLMs
Content type:
Blog
dev.to
·
1d
1 day ago
·
DEV
Actions for An LLM benchmark is only useful for as long as it's hard
Introducing FrontierCode
🧩
AI Frameworks
Content type:
Blog
cognition.ai
·
3d
3 days ago
·
Hacker News
Actions for Introducing FrontierCode
$\tau$-Rec: A Verifiable
Benchmark
for Agentic Recommender Systems
💾
Agent Memory
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for $\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems
Less-relevant results
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
💾
Agent Memory
Content type:
Discussion
news.ycombinator.com
·
6d
6 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
🌐
Open Source AI
uccl-project.github.io
·
1d
1 day ago
·
Hacker News
Actions for CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?
Show HN: AgentCarousel – behavioral
tests
for AI agents, with signed evidence
🤖
AI Agents
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence
Law Professors Prefer AI over Peer Answers
📐
AI Architecture
Content type:
Academic
law.stanford.edu
·
5d
5 days ago
·
Hacker News
Actions for Law Professors Prefer AI over Peer Answers
Researchers say they trained a foundation
model
from scratch for about $1,500
🌐
Open Source AI
venturebeat.com
·
1d
1 day ago
·
Hacker News
Actions for Researchers say they trained a foundation model from scratch for about $1,500
How to Train Your Goblin
🌐
Open Source AI
goblins.mchen.workers.dev
·
5d
5 days ago
·
Hacker News
,
Hacker News
Actions for How to Train Your Goblin
LLM
Research Papers: The 2026 List (January to May)
🌐
Open Source AI
Content type:
News
magazine.sebastianraschka.com
·
6d
6 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
The Vanta AI Quality
Eval
Maturity
Model
🔭
AI Observability
vanta.com
·
1d
1 day ago
·
Hacker News
Actions for The Vanta AI Quality Eval Maturity Model
UrduMMLU: A Massive Multitask
Benchmark
for Urdu Language Understanding
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
4d
4 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
The Hidden Truth Behind AI-Driven Layoffs in Big Tech
🧠
LLMs
Content type:
Blog
dev.to
·
2d
2 days ago
·
DEV
Actions for The Hidden Truth Behind AI-Driven Layoffs in Big Tech
Apple WWDC On-Device AI Deep Dive - Google Docs
🧠
LLMs
gist.is
·
1d
1 day ago
·
Hacker News
Actions for Apple WWDC On-Device AI Deep Dive - Google Docs
Beyond English
benchmarks
: clinical
llm
evaluation
in Brazilian Portuguese
🧠
LLMs
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
Architecture Breakdown: Building an Enterprise-Grade Legal
RAG
System (From Ingestion to RAGAS
Evaluation
)
🔍
RAG
Content type:
Blog
dev.to
·
5d
5 days ago
·
DEV
Actions for Architecture Breakdown: Building an Enterprise-Grade Legal RAG System (From Ingestion to RAGAS Evaluation)
Shrivastava-Aditya/boolean-algebra-engine: Deterministic boolean algebra engine —
evaluates
expressions, detects contradictions, audits logic rules. MCP server, NL layer, REST API, CLI, Streamlit UI.
✍️
Prompt Engineering
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
,
r/LLM
Actions for Shrivastava-Aditya/boolean-algebra-engine: Deterministic boolean algebra engine — evaluates expressions, detects contradictions, audits logic rules. MCP server, NL layer, REST API, CLI, Streamlit UI.
PhantomBench:
Benchmarking
the Non-existential Threat of Language
Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for PhantomBench: Benchmarking the Non-existential Threat of Language Models
Hallucination
Detection Is Not a
Model
Problem—It's an Architecture Problem
🧠
LLMs
Content type:
Blog
dev.to
·
3d
3 days ago
·
DEV
Actions for Hallucination Detection Is Not a Model Problem—It's an Architecture Problem
NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent
RAG
System for the Text-to-Text Track
🔍
RAG
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help