Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
📊 LLM Evals
Specific
model evaluation, benchmarks, evals
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
186664
posts in
20.0
ms
Evaluation of LLM-Based Software Engineering Tools:
Practices
, Challenges, and Future
Directions
🔧
Code Generation
arxiv.org
·
3d
Granite
4.1: IBM's
8B
Model Is Competing With Models Four Times Its Size
💬
Prompt Engineering
firethering.com
·
18h
·
Hacker News
[
WIP
] Benchmarking Local LLMs Against Coding Agent
Harnesses
⚙️
Performance Profiling
neuralnoise.com
·
3d
·
Hacker News
garrytan/gbrain-evals
🔧
Code Generation
github.com
·
6d
·
Hacker News
LLM-ReSum
: A Framework for LLM
Reflective
Summarization through Self-Evaluation
🔍
RAG
arxiv.org
·
2d
BenchGuard
: Who Guards the Benchmarks? Automated
Auditing
of LLM Agent Benchmarks
🤨
AI Criticism
arxiv.org
·
2d
The
Structured
Output Benchmark: A Multi-Source Benchmark for
Evaluating
Structured
Output Quality in Large Language Models
🔍
RAG
arxiv.org
·
2d
Odysseys
: Benchmarking Web Agents on
Realistic
Long Horizon Tasks
⚡
WebGPU Compute
arxiv.org
·
2d
Evaluation without Generation: Non-Generative Assessment of Harmful Model
Specialization
with Applications to
CSAM
🔧
Code Generation
arxiv.org
·
2d
Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond
Symbolic
Rigidity
λ
Functional Programming
arxiv.org
·
4d
Bye
Bye
Perspective API: Lessons for Measurement Infrastructure in
NLP
, CSS and LLM Evaluation
⌚
Quantified Self
arxiv.org
·
2d
Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated
Competency
Assessment in Secondary Level
Mathematics
💬
Prompt Engineering
arxiv.org
·
1d
STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application
Rigorous
Evaluator
💬
Prompt Engineering
arxiv.org
·
3d
SWE-QA
: A Dataset and Benchmark for Complex Code Understanding
📊
Code Visualization
arxiv.org
·
2d
When Prompt Under-Specification Improves Code
Correctness
: An Exploratory Study of Prompt
Wording
and Structure Effects on LLM-Based Code Generation
🔧
Code Generation
arxiv.org
·
3d
A
Scoping
Review of LLM-as-a-Judge in Healthcare and the
MedJUDGE
Framework
🤨
AI Criticism
arxiv.org
·
1d
CUJBench
: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to
Backend
🕸️
WASM
arxiv.org
·
3d
Judging
the Judges: A Systematic Evaluation of Bias
Mitigation
Strategies in LLM-as-a-Judge Pipelines
🤨
AI Criticism
arxiv.org
·
3d
Evaluating Large Language Models on Computer Science University
Exams
in Data
Structures
🔍
Parser Design
arxiv.org
·
3d
ragR
:
Retrieval-Augmented
Generation and RAG Assessment in R
🔍
RAG
arxiv.org
·
3d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help