📊 LLM Evals - tyler · Scour

Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions 🔧Code Generation

Granite 4.1: IBM's 8B Model Is Competing With Models Four Times Its Size 💬Prompt Engineering

firethering.com·18h·Hacker News

[WIP] Benchmarking Local LLMs Against Coding Agent Harnesses ⚙️Performance Profiling

neuralnoise.com·3d·Hacker News

garrytan/gbrain-evals 🔧Code Generation

github.com·6d·Hacker News

LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation 🔍RAG

BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks 🤨AI Criticism

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models 🔍RAG

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks ⚡WebGPU Compute

Evaluation without Generation: Non-Generative Assessment of Harmful Model Specialization with Applications to CSAM 🔧Code Generation

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity λFunctional Programming

Bye Bye Perspective API: Lessons for Measurement Infrastructure in NLP, CSS and LLM Evaluation ⌚Quantified Self

Human-in-the-Loop Benchmarking of Heterogeneous LLMs for Automated Competency Assessment in Secondary Level Mathematics 💬Prompt Engineering

STELLAR-E: a Synthetic, Tailored, End-to-end LLM Application Rigorous Evaluator 💬Prompt Engineering

SWE-QA: A Dataset and Benchmark for Complex Code Understanding 📊Code Visualization

When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation 🔧Code Generation

A Scoping Review of LLM-as-a-Judge in Healthcare and the MedJUDGE Framework 🤨AI Criticism

CUJBench: Benchmarking LLM-Agent on Cross-Modal Failure Diagnosis from Browser to Backend 🕸️WASM

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines 🤨AI Criticism

Evaluating Large Language Models on Computer Science University Exams in Data Structures 🔍Parser Design

ragR: Retrieval-Augmented Generation and RAG Assessment in R 🔍RAG

Log in to enable infinite scrolling