Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Evals
📊 Evals
Specific
LLM evaluation, harness, benchmarking, eval framework
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
125
posts in
7.1
ms
A Practical Guide to Assessing Agentic AI Companies for Enterprise Needs
🤖
AI Agents
netnewsledger.com
·
2d
2 days ago
Actions for A Practical Guide to Assessing Agentic AI Companies for Enterprise Needs
Do VLMs Reason Like Engineers? A
Benchmark
and a Stage-wise
Evaluation
⚡
Inference
Content type:
Academic
arxiv.org
·
15h
15 hours ago
Actions for Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation
Anthropic: Claude Now Writes 80% of Its Own Code in 2026
✍️
Prompt Engineering
Content type:
Blog
wowhow.cloud
·
2d
2 days ago
·
DEV
Actions for Anthropic: Claude Now Writes 80% of Its Own Code in 2026
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
🏆
SOTA Models
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
3d
3 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
Closing the Sim-to-Real Gap: An
Evaluation
Framework
for Autonomous Cyber Defense Configuration of Commercial EDR
🕸️
Distributed Systems
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Closing the Sim-to-Real Gap: An Evaluation Framework for Autonomous Cyber Defense Configuration of Commercial EDR
Agentic AI solved coding — and exposed every other problem in software engineering
🤖
AI Agents
venturebeat.com
·
3d
3 days ago
·
Hacker News
Actions for Agentic AI solved coding — and exposed every other problem in software engineering
Modeling
Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models
🧠
LLMs
Content type:
Academic
arxiv.org
·
15h
15 hours ago
Actions for Modeling Complex Behaviors: Multi-Personality Composition and Dynamic Switching in Vision-Language Models
Why Shrinking an AI
Model
Often Makes It More Useful
🌐
Open Source AI
siliconopera.com
·
3d
3 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
Kotlin Multiplatform in Production: Two Real-World Use Cases from Booking.com
📐
Context Engineering
Content type:
Blog
medium.com
·
5d
5 days ago
Actions for Kotlin Multiplatform in Production: Two Real-World Use Cases from Booking.com
Reality: The Final
Eval
— Lukas Petersson and Axel Backlund of Andon Labs
🤖
AI Agents
latent.space
·
5d
5 days ago
·
Hacker News
Actions for Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs
LATTEArena: An
Evaluation
Framework
for
LLM-powered
Tabular Feature Engineering (Extended Version)
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for LATTEArena: An Evaluation Framework for LLM-powered Tabular Feature Engineering (Extended Version)
Adrarsh Divakaran: Building AI Agents in Python
✍️
Prompt Engineering
Content type:
Blog
blog.adarshd.dev
·
6d
6 days ago
Actions for Adrarsh Divakaran: Building AI Agents in Python
VESTA: A Fully Automated Scenario Generation and Safety
Evaluation
Framework
for
LLM
Agents
🤖
AI Agents
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents
APG|SGA clinches Zurich Airport ad rights until 2033 in public tender win
🏆
SOTA Models
ppc.land
·
6d
6 days ago
Actions for APG|SGA clinches Zurich Airport ad rights until 2033 in public tender win
The 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation
🧠
LLMs
Content type:
Academic
arxiv.org
·
15h
15 hours ago
Actions for The 1st PortraitCraft Challenge: A CVPR 2026 Workshop Competition on Portrait Composition Understanding and Generation
Rank Intervals for Leaderboards: A Hierarchical
Framework
for
Model
Evaluation
🏆
SOTA Models
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language
Model
Performance
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for SEF-CLGC at SemEval-2026 Task 11: Logical Notation Impact on Language Model Performance
Anthropic says 80% of its new production code is now authored by Claude — how your enterprise can keep up
🤖
AI Agents
venturebeat.com
·
5d
5 days ago
Actions for Anthropic says 80% of its new production code is now authored by Claude — how your enterprise can keep up
When Behavioral Safety
Evaluation
Fails: A Representation-Level Perspective
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for When Behavioral Safety Evaluation Fails: A Representation-Level Perspective
Synthetic but Not Realistic: The
Evaluation
Challenge in Generative
Modelling
for Structured Electronic Medical Records
🧠
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help