Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 LLM Evaluation
Specific
Benchmarks, Model Testing, Performance Metrics, HELM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
187251
posts in
48.5
ms
OS-SPEAR
: A Toolkit for the Safety, Performance,Efficiency, and
Robustness
Analysis of OS Agents
🤝
AI Agents
arxiv.org
·
2d
Practical Insights into Fair Comparison and Evaluation Frame for
Neutral-Atom
Compilers
🐛
Fuzzing
arxiv.org
·
1d
Your Students Don't Use LLMs Like You
Wish
They Did
✍️
Prompt Engineering
arxiv.org
·
2d
Expert Evaluation of LLM's
Open-Ended
Legal Reasoning on the Japanese Bar
Exam
Writing Task
✍️
Prompt Engineering
arxiv.org
·
2d
SWE-QA
: A Dataset and Benchmark for Complex Code Understanding
🐛
Fuzzing
arxiv.org
·
1d
MathDuels
: Evaluating LLMs as Problem
Posers
and Solvers
✍️
Prompt Engineering
arxiv.org
·
6d
·
Hacker News
AgentPulse
: A
Continuous
Multi-Signal Framework for Evaluating AI Agents in Deployment
🤝
AI Agents
arxiv.org
·
2d
Below-Chance
Blindness
: Prompted
Underperformance
in Small LLMs Produces Positional Bias Rather than Answer Avoidance
🐛
Fuzzing
arxiv.org
·
1d
Assessing the Impact of
Requirement
Ambiguity
on LLM-based Function-Level Code Generation
⚙️
MLOps
arxiv.org
·
6d
GAIA-v2-LILT
: Multilingual Adaptation of Agent Benchmark beyond Translation
🤖
Agentic AI
arxiv.org
·
1d
Commit-Aware
Learning-Based Test Case
Prioritization
for Continuous Integration
⚙️
MLOps
arxiv.org
·
1d
A
Metamorphic
Testing Approach to Diagnosing
Memorization
in LLM-Based Program Repair
🐛
Fuzzing
arxiv.org
·
6d
DV-World
: Benchmarking Data
Visualization
Agents in Real-World Scenarios
🤝
AI Agents
arxiv.org
·
1d
How Sensitive Are Safety Benchmarks to Judge
Configuration
Choices
?
🚀
Performance Engineering
arxiv.org
·
2d
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond
Symbolic
Rigidity
✍️
Prompt Engineering
arxiv.org
·
3d
Empirical
Insights of Test Selection Metrics under Multiple Testing
Objectives
and Distribution Shifts
🐛
Fuzzing
arxiv.org
·
2d
RealBench
: A
Repo-Level
Code Generation Benchmark Aligned with Real-World Software Development Practices
👀
Code Review
arxiv.org
·
3d
Peer Identity Bias in Multi-Agent LLM Evaluation: An
Empirical
Study Using the TRUST Democratic
Discourse
Analysis Pipeline
⚖️
AI Governance
arxiv.org
·
2d
Training a General
Purpose
Automated Red
Teaming
Model
🛡️
AI Security
arxiv.org
·
2d
Seeing the Whole
Elephant
: A Benchmark for Failure
Attribution
in LLM-based Multi-Agent Systems
🤝
AI Agents
arxiv.org
·
3d
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help