📊 Model Evals - CWhiting · Scour

Towards Reliable LLM Evaluation: Correcting the Winner's Curse in Adaptive Benchmarking 🏆LLM Benchmarking

The Sequence Opinion #860: Every Company’s Last eXam: Some Reflection About Practical AI Evals 📊AI Benchmarks

thesequence.substack.com

Effective Practices for Mocking LLM Responses During the Software Development Lifecycle 🧪Software Testing

mlops.community·1d

Beyond the Vibe Check: Scaling Cymbal Air Agent Reliability with LangGraph and Vertex AI Evals 🎯AI Reliability

·1d

Jankmarking: Janky Benchmarking 📊AI Performance Profiling

williamangel.net·5d·Hacker News

Mapping AI benchmarks onto a common capability scale 📊AI Benchmarks

aiiq.org·2d·Hacker News

not much happened today 🤖AI News

news.smol.ai·2d

BintzGavin/apastra: Lightweight prompt versioning, evals, benchmarks, and delivery 🤖AI Codegen

github.com·6d·Hacker News

GHGbench: A Unified Multi-Entity, Multi-Task Benchmark for Carbon Emission Prediction 🏆LLM Benchmarking

LLM Evaluation: Practical Tips at Booking.com 🏆LLM Benchmarking

mlops.community·1d

In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores 🏆LLM Benchmarking

Exploring LLMs Speed Benchmarks 🏠Local LLM Deployment

mlops.community·1d

How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation ⚡LLM Optimization

Valid Best-Model Identification for LLM Evaluation via Low-Rank Factorization ⚡LLM Optimization

Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity 🤖LLM

Targeted Tests for LLM Reasoning: An Audit-Constrained Protocol 🤖LLM

SAGE: Scalable Automated Robustness Augmentation for LLM Knowledge Evaluation 🤖LLM

Query-efficient model evaluation using cached responses 📊AI Benchmarks

CTFusion: A CTF-based Benchmark for LLM Agent Evaluation 🏆LLM Benchmarking

MANTRA: Synthesizing SMT-Validated Compliance Benchmarks for Tool-Using LLM Agents 🏆LLM Benchmarking

Log in to enable infinite scrolling