📊 Model Evals - CWhiting · Scour

Effective Practices for Mocking LLM Responses During the Software Development Lifecycle 🧪Software Testing

mlops.community·1d

Jankmarking: Janky Benchmarking 📊AI Performance Profiling

williamangel.net·5d·Hacker News

made-to-order training data for classifiers and evals 🎯AI Training

abliteration.ai·12h·Hacker News

My colleague's AI agent kept breaking in production. Here's what we found when we looked closer. ⚙️AI Automation

getnetra.ai·48m·DEV

Mapping AI benchmarks onto a common capability scale 📊AI Benchmarks

aiiq.org·1d·Hacker News

jdanielbcosta/ufc-predictor: UFC Fight Predictor — A machine learning system for predicting UFC fight outcomes with 68.45% accuracy on unseen 2023–2026 data, outperforming published academic benchmarks (best: 66.71%, Yan et al. ACM ICIIP 2024). 🚀Model Releases

github.com·15h·r/learnmachinelearning

Eval Set Sizing: The Statistical Power Math Behind LLM A/B Tests 🤖LLM

dev.to·6d·DEV

What you measure depends on where you draw the boundary 🎮WebGPU

blog.arkstack.dev·2h·Hacker News

The AI Engineer Illusion: Why Calling LLM APIs Is Not Enough 🤖AI Engineering

dev.to·2d·DEV

Cube: Wrapping Benchmarks Once, Unlocking Agentic AI for Everyone 📊AI Benchmarks

thealliance.ai·1h·Hacker News

AI cyber capability is speeding past earlier projections 📊AI Benchmarks

helpnetsecurity.com·4h

Verbalised evaluation awareness in language models has little effect on their behaviour 🏆LLM Benchmarking

lesswrong.com·2d

Your AI Agent Passes Your Evals. 🧠Context Engineering

pub.towardsai.net

·5d

OpenAI GPT 5.5: Vision Benchmarks & Roboflow Workflows 🧠OpenAI

blog.roboflow.com·20h

programmablemanufacturing/programmable-manufacturing-lab: Community repository for physics-informed AI and programmable manufacturing: demos, benchmarks, notes, and roadmap. ⚙️AI Automation

github.com·12m·r/learnmachinelearning

Claude Mythos and the 16-Hour Problem: When AI Agents Outgrow Their Own Benchmarks 🎯AI Reliability

revolutioninai.com·1d·r/ClaudeAI

Model Showdown: Benchmarking Local vs Cloud LLMs on a Real Coding Task 🏠Local LLM Deployment

dev.to·6d·DEV

https://www.together.ai/blog/redpajama-7b 🤖AI Codegen

together.ai·1d

Open Source Robot Policies, Datasets, and Benchmarks 🦾Embodied AI

festivus.hapticlabs.ai·1d·Hacker News

Why does AI memory fail at connecting facts? I ran the benchmarks to find out 📊AI Performance Profiling

yourmemoryai.xyz·4d·Hacker News, r/SideProject

Log in to enable infinite scrolling