LLM Evals

Feeds to Scour
SubscribedAll
Scoured 96 posts in 11.1 ms

Introducing LLM as a Judge: Scaling search relevance evaluation with AI

 🧠LLMs  Content type: Blog
opensearch.org·

RAGAS Belongs at Design Time

 📐Context Engineering  Content type: Blog
rephrase-it.com·

$\tau$-Rec: A Verifiable Benchmark for Agentic Recommender Systems

 💾Agent Memory  Content type: Academic
arxiv.org·

WhatLLM.org: Compare LLMs by Benchmarks, Price & Speed

 🌐Open Source AI  Content type: Discussion  Content type: Reference
whatllm.org·

Surprise upset: GPT-5.5 beats Claude Fable 5 on brutal new Agents’ Last Exam benchmark

 🌐Open Source AI
venturebeat.com·

Introducing FrontierCode

 🧩AI Frameworks  Content type: Blog
cognition.ai··Hacker News

Claude Fable 5 vs GPT-5.5: Complete Benchmark Comparison and What It Means for AI Developers

 💾Agent Memory  Content type: Blog
blogarama.com·
Less-relevant results

The biggest local LLM on your machine is useless if it can't call a single tool, no matter how many parameters it has

 🛠️Tool Use
xda-developers.com·

My prompt is better than your prompt – how to optimize your prompts in the age of agentic AI

 🧠LLMs  Content type: Blog
metrics.blogg.gu.se·

1-bit and 1.58 bit LLM Benchmarking on Jetson Orin Nano Super | Bonsai LM

 🌐Open Source AI
smolhub.com··r/LocalLLaMA

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

 🌐Open Source AI
huggingface.co··r/LocalLLaMA

CommBench: Can LLMs Write Correct and Efficient GPU Communication Code?

 🌐Open Source AI

Refusal Is a Feature: What LLM Evaluation Misses When It Only Measures Accuracy

 🧠LLMs  Content type: Blog
medium.com
·

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

 🧠LLMs
lesswrong.com·

Show HN: AgentCarousel – behavioral tests for AI agents, with signed evidence

 🤖AI Agents  Content type: Code
github.com··Hacker News

The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests

 🧠LLMs  Content type: Blog
medium.com
·

Evaluate LLM and agent quality in Dynatrace AI Observability with dt-evals

 🛡️Guardrails
dynatrace.com·

Launch HN: General Instinct (YC P26) – Frontier models on edge devices

 💾Agent Memory  Content type: Discussion

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

 🌐Open Source AI

ARMOR-MAD: Adaptive Routing for Heterogeneous Multi-Agent Debate in Large Language Model Reasoning

 🎼Agent Orchestration  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help