Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
📊 Model Evals
Specific
LLM evaluation, benchmarks, model evaluation, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
200115
posts in
28.5
ms
Towards Reliable LLM Evaluation:
Correcting
the Winner's
Curse
in Adaptive Benchmarking
🏆
LLM Benchmarking
arxiv.org
·
6d
The Sequence Opinion #860: Every Company’s Last
eXam
: Some Reflection About Practical AI
Evals
📊
AI Benchmarks
thesequence.substack.com
·
3h
·
Substack
Effective Practices for
Mocking
LLM Responses During the Software Development
Lifecycle
🧪
Software Testing
mlops.community
·
1d
Beyond the Vibe Check: Scaling
Cymbal
Air Agent Reliability with LangGraph and Vertex AI
Evals
🎯
AI Reliability
medium.com
·
1d
Jankmarking
: Janky
Benchmarking
📊
AI Performance Profiling
williamangel.net
·
5d
·
Hacker News
Mapping AI
benchmarks
onto a common
capability
scale
📊
AI Benchmarks
aiiq.org
·
2d
·
Hacker News
not much
happened
today
🤖
AI News
news.smol.ai
·
2d
BintzGavin/apastra
: Lightweight prompt versioning, evals, benchmarks, and delivery
🤖
AI Codegen
github.com
·
6d
·
Hacker News
GHGbench
: A Unified Multi-Entity, Multi-Task Benchmark for Carbon
Emission
Prediction
🏆
LLM Benchmarking
arxiv.org
·
10h
LLM Evaluation:
Practical
Tips at
Booking.com
🏆
LLM Benchmarking
mlops.community
·
1d
In-Situ
Behavioral Evaluation for LLM Fairness, Not
Standardized-Test
Scores
🏆
LLM Benchmarking
arxiv.org
·
10h
Exploring
LLMs Speed
Benchmarks
🏠
Local LLM Deployment
mlops.community
·
1d
How Many
Iterations
to
Jailbreak
? Dynamic Budget Allocation for Multi-Turn LLM Evaluation
⚡
LLM Optimization
arxiv.org
·
6d
Valid
Best-Model Identification for LLM Evaluation via Low-Rank
Factorization
⚡
LLM Optimization
arxiv.org
·
2d
Measuring Evaluation-Context Divergence in Open-Weight LLMs: A
Paired-Prompt
Protocol with Pilot Evidence of Alignment-Pipeline-Specific
Heterogeneity
🤖
LLM
arxiv.org
·
6d
Targeted Tests for LLM Reasoning: An
Audit-Constrained
Protocol
🤖
LLM
arxiv.org
·
1d
SAGE
: Scalable Automated Robustness
Augmentation
for LLM Knowledge Evaluation
🤖
LLM
arxiv.org
·
1d
Query-efficient model evaluation using
cached
responses
📊
AI Benchmarks
arxiv.org
·
3d
CTFusion
: A CTF-based Benchmark for LLM Agent
Evaluation
🏆
LLM Benchmarking
arxiv.org
·
1d
MANTRA:
Synthesizing
SMT-Validated
Compliance Benchmarks for Tool-Using LLM Agents
🏆
LLM Benchmarking
arxiv.org
·
6d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help