🧪 LLM Testing - zhsh.cao · Scour

Show HN: Marlin-2B: a tiny VLM to extract structured information from videos 🧠LLMs

huggingface.co·2d·Hacker News

Discover the Red Hat OpenShift AI model catalog 🧠LLMs

ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization 🤖AI

tokenspeed — feel LLM tokens-per-second 🧠LLMs

mikeveerman.github.io·58m

#1 on the leading AI memory benchmark using a smaller, cheaper model 🧠LLMs

exabase.io·5d·Hacker News

Self-Improving Reward Models 🤖AI Agent

canvas.inc·1d·Hacker News

EvalHub: Because "looks good to me" isn't a benchmark 🔄DevOps

developers.redhat.com·2d

Supersymmetric Digital Assets & AI Emergence 💾AI Hardware

qbc.network·3d·Hacker News

Benchmarking five live translation systems with an open-source eval harness (including OpenAI's GPT-Realtime-Translate) 🧠LLMs

github.com·1d·DEV

Introducing RAMPART and Clarity: Open source tools to bring safety into Agent development workflow 🕵️AI Agents

malware.news·12h

Better Experiments with LLM Evals — A funnel, not a fork 🧠LLMs

engineering.atspotify.com·2d

Enterprises can now train custom AI models from production workflows 🤖AI

venturebeat.com·6d

Benchmarking LLMs for malware triage and static unpacking with Malcat 🧠LLMs

malcat.fr·2d·r/Malware

Show HN: Pokémon SVG Generation LLM Benchmark 🐹Go

svg-bench.fenx.work·6d·Hacker News

Show Us Your (Agent) Skills Ep. 03 🤖AI Agent

Why Demand for AI Data is Here to Stay 🤖AI

Kubernetes Was the Easy Part ☸️K8S

cloudnativenow.com·2d

HWE Bench: A new unbounded Benchmark for LLMs (GPT 5.5 is on top) 🧠LLMs

hwebench.com·5d·Hacker News

AI researchers flag bias risks in LLM judging 🧠LLMs

kite.kagi.com·5d

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents 🕵️AI Agents

Log in to enable infinite scrolling