📊 LLM Evaluation - gilesr · Scour

TamperBench: Systematically Stress-Testing LLM Safety Under Fine-Tuning and Tampering

arxiv.org·1d

Building LLMs in Resource-Constrained Environments: A Hands-On Perspective

infoq.com·19h

Study: Platforms that rank the latest LLMs can be unreliable

news.mit.edu·1d

Some thoughts on LLM coding

blog.dave.tf·1d·

Discuss: Hacker News

Implementing Automated Rules-Based Evaluations for LLM Applications

github.com·4d·

Discuss: DEV

LLMs Refuse High-Cost Attacks but Stay Vulnerable to Cheap, Real-World Harm

expectedharm.github.io·1h·

Discuss: Hacker News

SAE Feature Matchmaking (Layer-to-Layer) by Mitali M

greaterwrong.com·2h

Stop Silent Failures: Using LLMs to Validate Web Scraper Output

dev.to·1d·

Discuss: DEV

Why Spec-Driven Development Breaks at Scale (and How to Fix It)

arcturus-labs.com·9h·

Discuss: Hacker News

Custom AI Tool Development in Regulated Industries: Why Off-The-Shelf LLM Solutions Fall Short

analyticsvidhya.com·18h

The Potential of RLMs

dbreunig.com·13h·

Discuss: Hacker News

RAG vs. Fine-Tuning: Why Your LLM Strategy is Probably Half-Baked

pub.towardsai.net

·1d

Show HN: C-CMCP – Validated AI development workflow with quality gates

news.ycombinator.com·15h·

Discuss: Hacker News

Reliability of LLMs as medical assistants for the general public: a randomized preregistered study

nature.com·14h·

Discuss: Hacker News

Implementing Automated Rules-Based Evaluations for LLM Applications

dev.to·4d·

Discuss: DEV

Agent Evaluation: How to Test and Measure Agentic AI Performance

machinelearningmastery.com·4d

Property-based testing as executable specs for agentic coding

kiro.dev·5h·

Discuss: Hacker News

The LLM Judge Controversy

mlfrontiers.substack.com·1d·

Discuss: Substack

Code vs Serialized AST Inputs for LLM-Based Code Summarization: An Empirical Study

arxiv.org·1d

Why the “Best LLM for Marketing” Doesn’t Exist

unite.ai·13h

Loading more...