📊 Model Evaluation - chert · Scour

Block-Bench: A Framework for Controllable and Transparent Discrete Optimization Benchmarking 🤝Multi-Agent Systems

arxiv.org·5h

Show HN: Pre-training, fine-tuning, and evals platform 🤝Multi-Agent Systems

oumi.ai·5d·Hacker News

A Hands-On Guide to Testing Agents with RAGAs and G-Eval 🤝Multi-Agent Systems

machinelearningmastery.com·21h

Live Life on the Edge: A Layered Strategy for Testing Data Models 🤝Multi-Agent Systems

chiply.dev·2d·Hacker News, r/programming

smoothyy3/willitrun: CLI to tell you if an ML model will fit and run on your device, using real benchmarks + lightweight estimation. 🎮reinforcement learning

github.com·2d·Hacker News

Fast Isn’t Fast Enough: Redefining Metrics for Edge AI 🤝Multi-Agent Systems

semiengineering.com·2h

Better Harness: A Recipe for Harness Hill-Climbing with Evals 🤝Multi-Agent Systems

blog.langchain.com·14h

benchmarking inference of popular models on consumer hardware 🎮reinforcement learning

inferena.tech·4d·Hacker News

AI to ROI Metrics: Infrastructure Cost Optimization 🤝Multi-Agent Systems

ai2roi.substack.com·18h·Substack

I benchmarked my own product, published everything, and 0.2.0 is basically the list of things I had to fix. 🤝Multi-Agent Systems

blog.routerly.ai·1d·r/SideProject

You Fine-Tuned Your Model. Now It’s Worse. Here’s the Concept You Were Never Taught. 🎮reinforcement learning

pub.towardsai.net

·14h

AXI: Agent EXperience Interface 🤝Multi-Agent Systems

axi.md·4h·Hacker News

April 7, 2026 (#4641) 🤝Multi-Agent Systems

alvinashcraft.com·1d

Thoughts on causal isolation of AI evaluation benchmarks 🎮reinforcement learning

lesswrong.com·6d

Introducing Metrics SQL: A SQL-based semantic layer for humans and agents 🤝Multi-Agent Systems

rilldata.com·11h·Hacker News

The case for Model-as-a-Service over self-managed inference 🤝Multi-Agent Systems

news.ycombinator.com·2d·Hacker News

NL2SQLBench: A Modular Benchmarking Framework for LLM-Enabled NL2SQL Solutions 🤖llm

vldb.org·7h

Give an LLM an API and It'll Thrive. Give It a Touchscreen and It Struggles 🤖llm

blog.allada.com·3d·Hacker News, Hacker News

Why a High Accuracy Model Can Still Be Useless 🤝Multi-Agent Systems

medium.com

·1d

Introducing workload simulation workbench for Amazon MSK Express broker 🤝Multi-Agent Systems

aws.amazon.com·1d

Loading more...