📊 Model Evaluation - jasonvh · Scour

What Does Abliteration Actually Cost?

✍️Prompt Engineering

lesswrong.com·

Less-relevant results

UXBench: Benchmarking User Experience in AI Assistants

✍️Prompt Engineering Academic

Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence

🤖AI Agents Code

github.com··Hacker News

Comprehensive evaluation of LLM capabilities for interpretation and analysis of genome-scale metabolic models in metabolic engineering

🧠LLMs Academic

Law Professors Prefer AI over Peer Answers

🧠LLMs Academic

law.stanford.edu··Hacker News

The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has

xda-developers.com·

Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining

🧠LLMs Blog

huggingface.co·

The Vanta AI Quality Eval Maturity Model

··Hacker News

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

💻AI Coding Discussion

news.ycombinator.com··Hacker News

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

🔧MLOps Academic

Apple WWDC On-Device AI Deep Dive - Google Docs

gist.is··Hacker News

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

lesswrong.com·

How to Train Your Goblin

goblins.mchen.workers.dev··Hacker News, Hacker News

Context windows in AI: why every token is a budget decision

✍️Prompt Engineering Blog

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

🧠LLMs Academic

LLM Research Papers: The 2026 List (January to May)

⚙️Software Engineering News

magazine.sebastianraschka.com

··Hacker News

Phoenix

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

🔧MLOps Academic

Research Proposal: Decoupled RISC-LLM Architectures via Circadian Synaptic Consolidation

aermia.com··Hacker News

SLMJury: Can Small Language Models Judge as Well as Large Ones?

🧠LLMs Academic

Log in to enable infinite scrolling