Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
馃搳 LLM Evaluation
Specific
Benchmarks, Model Testing, Performance Metrics, HELM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
187251
posts in
12.1
ms
BLAST:
Benchmarking
LLMs with
ASP-based
Structured Testing
聽
馃悰
Fuzzing
arxiv.org
路
3d
Granite
4.1: IBM's
8B
Model Is Competing With Models Four Times Its Size
聽
鈿欙笍
MLOps
firethering.com
路
16h
路
Hacker News
google-deepmind/proeval
:
Proactive
failure discovery and efficient performance estimation for GenAI evaluation.
聽
馃摫
Edge AI
github.com
路
1d
Cyborg
evals
聽
鉁嶏笍
Prompt Engineering
lesswrong.com
路
9h
路
Hacker News
not much
happened
today
聽
馃摫
Edge AI
news.smol.ai
路
2d
Introducing
SOB
: A Multi-Source
Structured
Output Benchmark for LLMs
聽
鈿欙笍
MLOps
interfaze.ai
路
3d
路
Hacker News
Evals
in
practice
for an AI coding agent
聽
馃
AI Agents
ministryoftesting.com
路
16h
Load
balancer
for
vLLM
server instances?
聽
鈿欙笍
MLOps
docs.vllm.ai
路
2d
路
r/LocalLLaMA
Getting Up to Speed on Multi-Agent Systems, Part 7:
Benchmarks
and What They Miss
聽
馃
AI Agents
christophermeiklejohn.com
路
15h
Temporal
Language Models
聽
鈿欙笍
MLOps
calcifercomputing.com
路
2d
路
Hacker News
OpenShift
AI observability
summarizer
: Transform metrics into meaning
聽
馃摗
Observability
developers.redhat.com
路
3d
ExaBench
: An Open Database Performance
Leaderboard
聽
馃搳
Profiling
exasol.com
路
1d
路
Hacker News
Introducing
ARFBench
: A time series
question-answering
benchmark based on real incidents
聽
馃悰
Fuzzing
blog.ml.cmu.edu
路
3d
Which one is more important: more
parameters
or more
computation
? (2021)
聽
馃摫
Edge AI
parl.ai
路
6d
路
Hacker News
Structured
CoT
: Shorter Reasoning with a
Grammar
File
聽
鉁嶏笍
Prompt Engineering
andthattoo.dev
路
6d
路
r/LocalLLaMA
local-first MCP code intelligence (and the
runs
we
lose
)
聽
鈿欙笍
MLOps
sverklo.com
路
3d
路
Hacker News
DamBuilderDev/JobSearchOptimizer
: Experimental local job-search pipeline using Python, PowerShell, and LLM scoring. Shared as a sanitized recovery/architecture case study for human review.
聽
馃攧
DevOps
github.com
路
11h
路
r/learnpython
Theory-Grounded Evaluation Exposes the
Authorship
Gap in LLM
Personalization
聽
鉁嶏笍
Prompt Engineering
arxiv.org
路
23h
Bun
鈥檚 Zig fork got 4x faster
compilation
times
聽
馃搳
Profiling
ziggit.dev
路
3d
The Coding Assistant
Breakdown
: More
Tokens
Please
聽
鈿欙笍
MLOps
newsletter.semianalysis.com
路
6d
路
Hacker News
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help