📊 Model Evaluation - jasonvh · Scour

What Does Abliteration Actually Cost?

✍️Prompt Engineering

lesswrong.com·

Less-relevant results

UXBench: Benchmarking User Experience in AI Assistants

✍️Prompt Engineering Academic

Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence

🤖AI Agents Code

github.com··Hacker News

Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering

🧠LLMs Academic

Law Professors Prefer AI over Peer Answers

🧠LLMs Academic

law.stanford.edu··Hacker News

The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has

xda-developers.com·

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

🧠LLMs Blog

huggingface.co·

The Vanta AI Quality Eval Maturity Model

··Hacker News

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

💻AI Coding Discussion

news.ycombinator.com··Hacker News

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

🔧MLOps Academic

Apple WWDC On-Device AI Deep Dive - Google Docs

gist.is··Hacker News

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

lesswrong.com·

How to Train Your Goblin

goblins.mchen.workers.dev··Hacker News, Hacker News

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

🧠LLMs Academic

LLM Research Papers: The 2026 List (January to May)

⚙️Software Engineering News

magazine.sebastianraschka.com

··Hacker News

Phoenix

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

🔧MLOps Academic

Research Proposal: Decoupled RISC-LLM Architectures via Circadian Synaptic Consolidation

aermia.com··Hacker News

SLMJury: Can Small Language Models Judge as Well as Large Ones?

🧠LLMs Academic

When Languages Disagree: Self-Evolving Multilingual LLM Judges

🧠LLMs Academic

Log in to enable infinite scrolling