📊 LLM Evaluation - gilesr · Scour

Benchmark Health Index: A Systematic Framework for Benchmarking the Benchmarks of LLMs

arxiv.org·16h

Assessing LLM Reliability on Temporally Recent Open-Domain Questions

arxiv.org·16h

✍longform travel writing

AI Proactively Finds Software Bugs Before Failures In Realistic Codebases

quantumzeitgeist.com·1d

8 Standards for Shipping Production LLM Features

teotti.com·21h·

Discuss: Hacker News

The OWASP Top 10 for LLMs — A Pentester's Practical Guide

dev.to·1h·

Discuss: DEV

BalatroBench Benchmarks Large Language Models Playing Balatro

balatrobench.com·10h·

Discuss: Hacker News

Olmix: A framework for data mixing throughout LM development

allenai.org·5h

GLM-5: Targeting complex systems engineering and long-horizon agentic tasks

news.ycombinator.com·1h·

Discuss: Hacker News

Karpathy's Micro LLM in JavaScript

github.com·1d·

Discuss: Hacker News

LLMs will either be the best or worst thing to happen to software engineering. They will free us from whittling programs by hand. But will we use that freedom t...

bsky.app·11h·

Discuss: Bluesky

Reflections on prototyping a sysadmin benchmark

samek.fyi·1h

MiniMaxAI/MiniMax-M2.5

huggingface.co·7h·

Discuss: Hacker News, r/LocalLLaMA

Find the right local LLM for your exact hardware

localclaw.io·14h·

Discuss: Hacker News

Are Multi-Agent LLM Workflows Quietly Amplifying Mistakes?

medium.com

·9h·

Discuss: DEV

Securing LLM Applications: Using LLM-as-a-Judge to Block Prompt Injection Attacks

infosecwriteups.com

·14h

Study: Platforms that rank the latest LLMs can be unreliable

digitalinformationworld.com·2d

Completed Hyperparameter Transfer across Modules, Width, Depth, Batch and Duration

machinelearning.apple.com·21h

Building an ARC-2 Solver — From Socratic Panels to a Single Oracle

pub.towardsai.net

·17h

What do “economic value” benchmarks tell us?

epoch.ai·21h

The Evolving Role of the ML Engineer

towardsdatascience.com·6h

Loading more...