📊 AI Benchmarks - CWhiting · Scour

Mapping AI benchmarks onto a common capability scale 🏆LLM Benchmarking

aiiq.org·1d·Hacker News

https://vercel.com/changelog/live-model-performance-metrics-accessible-via-ai-gateway 📊AI Performance Profiling

Arena AI Model ELO History: A Live Tracker! 🏆LLM Benchmarking

dev.to·40m·DEV

made-to-order training data for classifiers and evals 🎯AI Training

abliteration.ai·10h·Hacker News

The Autorater Problem: Trusting LLM Judges Without Treating Them Like Ground Truth 🏆LLM Benchmarking

hackernoon.com·1d

Frontier AI models don't just delete document content — they rewrite it, and the errors are nearly impossible to catch 🏆LLM Benchmarking

venturebeat.com·15h

I built a benchmark for AI “memory” in coding agents. looking for others to beat it. 🤖AI Codegen

github.com·5d·r/artificial

Building an Evaluation Harness for Production AI Agents: A 12-Metric Framework From 100+ Deployments 🧠Context Engineering

towardsdatascience.com·23h

Model Performance Management Done Right: Build Responsibly Using Explainable AI 🛡️AI Safety

mlops.community·1d

Old PC vs New AI: Can a 2015 Desktop Actually Run Gemma 4? (2B vs 4B Benchmark) 📊AI Performance Profiling

dev.to·5h·DEV

Researchers say AI just broke every benchmark for autonomous cyber capability 🤖Artificial Intelligence

cyberscoop.com·13h·Hacker News

Claude Mythos and the 16-Hour Problem: When AI Agents Outgrow Their Own Benchmarks 🎯AI Reliability

revolutioninai.com·1d·r/ClaudeAI

Show HN: CADBench – every AI CAD tool I tested fails on basic mechanical parts 🤖AI Coding Tools

evals-for-ai-cads.vercel.app·4d·Hacker News

Microsoft’s multi-agent AI system tops Anthropic’s Mythos on cybersecurity benchmark 💪AI Power Users

geekwire.com·11h

Distilling a strategic-reasoning framework into 7B weights 🏆LLM Benchmarking

lerugray.github.io·1d·Hacker News

What Inference-Platform Benchmark Posts Leave Out 🏠Local LLM Deployment

dev.to·22h·DEV

Scale Labs debuts new Refactoring Leaderboard for AI ✨Code Quality

testingcatalog.com·6d

Through the looking glass of benchmark hacking 👨indie hacker

poolside.ai·2d·Hacker News

How Nvidia Made Its ASR Models 3x Faster Than the Competition 🎙️AI Voice

hackernoon.com·1d

A benchmark is a sensor 🏆LLM Benchmarking

lesswrong.com·5d

Log in to enable infinite scrolling