📊 LLM Evaluation - alvin.kuruvilla · Scour

OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents 🤝AI Agents

Practical Insights into Fair Comparison and Evaluation Frame for Neutral-Atom Compilers 🐛Fuzzing

Your Students Don't Use LLMs Like You Wish They Did ✍️Prompt Engineering

Expert Evaluation of LLM's Open-Ended Legal Reasoning on the Japanese Bar Exam Writing Task ✍️Prompt Engineering

SWE-QA: A Dataset and Benchmark for Complex Code Understanding 🐛Fuzzing

MathDuels: Evaluating LLMs as Problem Posers and Solvers ✍️Prompt Engineering

arxiv.org·6d·Hacker News

AgentPulse: A Continuous Multi-Signal Framework for Evaluating AI Agents in Deployment 🤝AI Agents

Below-Chance Blindness: Prompted Underperformance in Small LLMs Produces Positional Bias Rather than Answer Avoidance 🐛Fuzzing

Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation ⚙️MLOps

GAIA-v2-LILT: Multilingual Adaptation of Agent Benchmark beyond Translation 🤖Agentic AI

Commit-Aware Learning-Based Test Case Prioritization for Continuous Integration ⚙️MLOps

A Metamorphic Testing Approach to Diagnosing Memorization in LLM-Based Program Repair 🐛Fuzzing

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios 🤝AI Agents

How Sensitive Are Safety Benchmarks to Judge Configuration Choices? 🚀Performance Engineering

Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity ✍️Prompt Engineering

Empirical Insights of Test Selection Metrics under Multiple Testing Objectives and Distribution Shifts 🐛Fuzzing

RealBench: A Repo-Level Code Generation Benchmark Aligned with Real-World Software Development Practices 👀Code Review

Peer Identity Bias in Multi-Agent LLM Evaluation: An Empirical Study Using the TRUST Democratic Discourse Analysis Pipeline ⚖️AI Governance

Training a General Purpose Automated Red Teaming Model 🛡️AI Security

Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems 🤝AI Agents

Sign up or log in to see more results

Log in to enable infinite scrolling