📊 LLM Evaluation - gilesr · Scour

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

arxiv.org·17h

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

arxiv.org·17h

✍longform travel writing

AI Proactively Finds Software Bugs Before Failures In Realistic Codebases

quantumzeitgeist.com·1d

8 Standards for Shipping Production LLM Features

teotti.com·22h·

Discuss: Hacker News

The OWASP Top 10 for LLMs — A Pentester's Practical Guide

dev.to·2h·

Discuss: DEV

BalatroBench Benchmarks Large Language Models Playing Balatro

balatrobench.com·11h·

Discuss: Hacker News

Olmix: A framework for data mixing throughout LM development

allenai.org·6h

GLM-5: Targeting complex systems engineering and long-horizon agentic tasks

news.ycombinator.com·2h·

Discuss: Hacker News

Karpathy's Micro LLM in JavaScript

github.com·1d·

Discuss: Hacker News

LLMs will either be the best or worst thing to happen to software engineering. They will free us from whittling programs by hand. But will we use that freedom t...

bsky.app·12h·

Discuss: Bluesky

You are probably overpaying for intelligence

residuals.bearblog.dev·1h

Reflections on prototyping a sysadmin benchmark

samek.fyi·2h

MiniMaxAI/MiniMax-M2.5

huggingface.co·8h·

Discuss: Hacker News, r/LocalLLaMA

Find the right local LLM for your exact hardware

localclaw.io·15h·

Discuss: Hacker News

Are Multi-Agent LLM Workflows Quietly Amplifying Mistakes?

medium.com

·10h·

Discuss: DEV

Securing LLM Applications: Using LLM-as-a-Judge to Block Prompt Injection Attacks

infosecwriteups.com

·15h

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

machinelearning.apple.com·22h

Study: Platforms that rank the latest LLMs can be unreliable

digitalinformationworld.com·2d

Building an ARC-2 Solver — From Socratic Panels to a Single Oracle

pub.towardsai.net

·18h

The Evolving Role of the ML Engineer

towardsdatascience.com·7h

Loading more...