Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 LLM Evaluation
Specific
Benchmarks, Model Testing, Performance Metrics, HELM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
186629
posts in
16.1
ms
The Coding Assistant
Breakdown
: More
Tokens
Please
⚙️
MLOps
newsletter.semianalysis.com
·
6d
·
Hacker News
Tokenmaxxing
and the search for AI
metrics
that matter
⚠️
AI Safety
leaddev.com
·
3d
·
Hacker News
Introducing the
Apitally
CLI and
skill
for agents
🤝
AI Agents
apitally.io
·
5d
·
r/node
Software Engineering
Metrics
Beyond
DORA
in 2026
👀
Code Review
qasource.com
·
3d
LLM-ReSum
: A Framework for LLM
Reflective
Summarization through Self-Evaluation
✍️
Prompt Engineering
arxiv.org
·
1d
CUJBench
: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to
Backend
⚙️
MLOps
arxiv.org
·
2d
Evaluating
Strategic Reasoning in
Forecasting
Agents
🤝
AI Agents
arxiv.org
·
21h
garrytan/gbrain-evals
⚙️
MLOps
github.com
·
6d
·
Hacker News
Measuring the
Unmeasurable
:
Markov
Chain Reliability for LLM Agents
⚙️
MLOps
arxiv.org
·
2d
BenchGuard
: Who Guards the Benchmarks? Automated
Auditing
of LLM Agent Benchmarks
⚙️
MLOps
arxiv.org
·
1d
PMZFX/intel-arc-pro-b70-benchmarks
: Benchmark results and performance data for the Intel Arc Pro B70 GPU (
Xe2/Battlemage
) - LLM inference, video generation, dual-GPU scaling.
📊
Profiling
github.com
·
6d
·
Hacker News
Utility-Aware
Data Pricing: Token-Level Quality and
Empirical
Training Gain for LLMs
📱
Edge AI
arxiv.org
·
2d
ProEval
:
Proactive
Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
⚙️
MLOps
arxiv.org
·
2d
Show HN:
CSP
Benchmarks – Go vs. core.async (Clojure) vs.
libgoc
(C)
🚀
Performance Engineering
github.com
·
6d
·
Hacker News
CoRE: A
Fine-Grained
Code Reasoning Benchmark Beyond
Output
Prediction
🚀
Performance Engineering
arxiv.org
·
1d
Bye
Bye
Perspective API: Lessons for Measurement Infrastructure in
NLP
, CSS and LLM Evaluation
⚙️
MLOps
arxiv.org
·
1d
The
Structured
Output Benchmark: A Multi-Source Benchmark for
Evaluating
Structured
Output Quality in Large Language Models
⚙️
MLOps
arxiv.org
·
1d
Judging
the Judges: A Systematic Evaluation of Bias
Mitigation
Strategies in LLM-as-a-Judge Pipelines
✍️
Prompt Engineering
arxiv.org
·
2d
When Prompt Under-Specification Improves Code
Correctness
: An Exploratory Study of Prompt
Wording
and Structure Effects on LLM-Based Code Generation
✍️
Prompt Engineering
arxiv.org
·
2d
FAMA
: Failure-Aware Meta-Agentic Framework for Open-Source LLMs in Interactive Tool Use
Environments
🤝
AI Agents
arxiv.org
·
1d
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help