📊 LLM Evaluation - alvin.kuruvilla · Scour

BLAST: Benchmarking LLMs with ASP-based Structured Testing 🐛Fuzzing

Granite 4.1: IBM's 8B Model Is Competing With Models Four Times Its Size ⚙️MLOps

firethering.com·16h·Hacker News

google-deepmind/proeval: Proactive failure discovery and efficient performance estimation for GenAI evaluation. 📱Edge AI

Cyborg evals ✍️Prompt Engineering

lesswrong.com·9h·Hacker News

not much happened today 📱Edge AI

news.smol.ai·2d

Introducing SOB: A Multi-Source Structured Output Benchmark for LLMs ⚙️MLOps

interfaze.ai·3d·Hacker News

Evals in practice for an AI coding agent 🤝AI Agents

ministryoftesting.com·16h

Load balancer for vLLM server instances? ⚙️MLOps

docs.vllm.ai·2d·r/LocalLLaMA

Getting Up to Speed on Multi-Agent Systems, Part 7: Benchmarks and What They Miss 🤝AI Agents

christophermeiklejohn.com·15h

Temporal Language Models ⚙️MLOps

calcifercomputing.com·2d·Hacker News

OpenShift AI observability summarizer: Transform metrics into meaning 📡Observability

developers.redhat.com·3d

ExaBench: An Open Database Performance Leaderboard 📊Profiling

exasol.com·1d·Hacker News

Introducing ARFBench: A time series question-answering benchmark based on real incidents 🐛Fuzzing

blog.ml.cmu.edu·3d

Which one is more important: more parameters or more computation? (2021) 📱Edge AI

parl.ai·6d·Hacker News

Structured CoT: Shorter Reasoning with a Grammar File ✍️Prompt Engineering

andthattoo.dev·6d·r/LocalLLaMA

local-first MCP code intelligence (and the runs we lose) ⚙️MLOps

sverklo.com·3d·Hacker News

DamBuilderDev/JobSearchOptimizer: Experimental local job-search pipeline using Python, PowerShell, and LLM scoring. Shared as a sanitized recovery/architecture case study for human review. 🔄DevOps

github.com·11h·r/learnpython

Theory-Grounded Evaluation Exposes the Authorship Gap in LLM Personalization ✍️Prompt Engineering

Bun’s Zig fork got 4x faster compilation times 📊Profiling

The Coding Assistant Breakdown: More Tokens Please ⚙️MLOps

newsletter.semianalysis.com

·6d·Hacker News

Log in to enable infinite scrolling