📊 LLM Evals - alanxu.80 · Scour

Introducing LLM as a Judge: Scaling search relevance evaluation with AI

🧠LLMs Blog

opensearch.org·

RAGAS Belongs at Design Time

📐Context Engineering Blog

rephrase-it.com·

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

💾Agent Memory Academic

WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed

🌐Open Source AI Discussion Reference

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

🌐Open Source AI

venturebeat.com·

Introducing FrontierCode

🧩AI Frameworks Blog

cognition.ai··Hacker News

Claude Fable 5 vs GPT-5.5: Complete Benchmark Comparison and What It Means for AI Developers

💾Agent Memory Blog

blogarama.com·

Less-relevant results

The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has

🛠️Tool Use

xda-developers.com·

My prompt is better than your prompt – how to optimize your prompts in the age of agentic AI

🧠LLMs Blog

metrics.blogg.gu.se·

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

🌐Open Source AI

smolhub.com··r/LocalLLaMA

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

🌐Open Source AI

huggingface.co··r/LocalLLaMA

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

🌐Open Source AI

uccl-project.github.io··Hacker News

Refusal Is a Feature: What LLM Evaluation Misses When It Only Measures Accuracy

🧠LLMs Blog

·

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

lesswrong.com·

Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence

🤖AI Agents Code

github.com··Hacker News

The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests

🧠LLMs Blog

·

Evaluate LLM and agent quality in Dynatrace AI Observability with dt-evals

🛡️Guardrails

dynatrace.com·

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

💾Agent Memory Discussion

news.ycombinator.com··Hacker News

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

🌐Open Source AI

phoronix.com··r/artificial

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

🎼Agent Orchestration Academic

Log in to enable infinite scrolling