📊 AI Benchmarks - CWhiting · Scour

Mapping AI benchmarks onto a common capability scale 🏆LLM Benchmarking

aiiq.org·1d·Hacker News

APIEval-20 - 一个用于测试 API 的 AI 代理的开源基准测试框架 🏆LLM Benchmarking

resources.kusho.ai·4d

The Sequence Opinion #860: Every Company’s Last eXam: Some Reflection About Practical AI Evals 🏆LLM Benchmarking

thesequence.substack.com

·38m·Substack

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents 💬Conversational AI

AI is ready to take over Python programming, but not much else 🛡️AI Safety

infoworld.com·1d

I built a benchmark for AI “memory” in coding agents. looking for others to beat it. 🤖AI Codegen

github.com·5d·r/artificial

Model Performance Management Done Right: Build Responsibly Using Explainable AI 🛡️AI Safety

mlops.community·1d

Beyond the Vibe Check: Scaling Cymbal Air Agent Reliability with LangGraph and Vertex AI Evals 🎯AI Reliability

·1d

not much happened today 🤖AI News

news.smol.ai·2d

Improving Reproducibility in Evaluation through Multi-Level Annotator Modeling 🤖Large Language Models

Edit-Compass & EditReward-Compass: A Unified Benchmark for Image Editing and Reward Modeling 🖼️Image Generation

Evaluate your LLM for Technical Compliance with COMPL-AI 🏆LLM Benchmarking

mlops.community·1d

Towards Apples to Apples for AI Evaluations: From Real-World Use Cases to Evaluation Scenarios 🛡️AI Safety

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking 🏆LLM Benchmarking

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI 🏆LLM Benchmarking

An Executable Benchmarking Suite for Tool-Using Agents 📊AI Performance Profiling

SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents 🏆LLM Benchmarking

Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation 🤖Game AI

Query-efficient model evaluation using cached responses 📊Model Evals

LLMSYS-HPOBench: Hyperparameter Optimization Benchmark Suite for Real-World LLM Systems ⚡LLM Optimization

Log in to enable infinite scrolling