📊 LLM Evaluation - gilesr · Scour

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

arxiv.org·1d

LLM Optimization: From Research to Production

dev.to·5h·

Discuss: DEV

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

arxiv.org·1d

✍longform travel writing

Trust: LLMs as Compilers

mechanicalorchard.substack.com·6h·

Discuss: Substack

8 Standards for Shipping Production LLM Features

teotti.com·1d·

Discuss: Hacker News

AI Proactively Finds Software Bugs Before Failures In Realistic Codebases

quantumzeitgeist.com·2d

MiniMax-AI/MiniMax-M2.5

github.com·7h

4 things local LLMs can do that your subscription-based AI tool won’t

xda-developers.com·8h

LLMs struggle to verbalize their internal reasoning

lesswrong.com·4h

🔗 Better tests, zero drama: smarter LiveIsolatedComponent patterns

yellowduck.be·6h

The OWASP Top 10 for LLMs — A Pentester's Practical Guide

dev.to·1d·

Discuss: DEV

BalatroBench Benchmarks Large Language Models Playing Balatro

balatrobench.com·1d·

Discuss: Hacker News

Olmix: A framework for data mixing throughout LM development

allenai.org·1d

Why LLMs Will Always Need An Expert In The Loop

codemanship.wordpress.com·10h

You are probably overpaying for intelligence

residuals.bearblog.dev·23h

How Today’s AI Models Are Leaving Enterprises in the Dark

modernghana.com·14h

GLM-5: Targeting complex systems engineering and long-horizon agentic tasks

news.ycombinator.com·1d·

Discuss: Hacker News

The Developer –> Designer Switch

c-daniele.github.io·22h·

Discuss: Hacker News

Challenges of revision control in the LLM era

gist.github.com·1h·

Discuss: Hacker News

Data Engineering for Large Models: Architecture, Algorithms & Projects

github.com·18h·

Discuss: Hacker News

Loading more...