Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evaluation
📊 LLM Evaluation
Specific
Benchmarks, Model Testing, Performance Metrics, HELM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
55
posts in
43.4
ms
Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking
Less-relevant results
The State of
LLM
Evaluation
(2026): Why Evals Became the New Unit
Tests
✍️
Prompt Engineering
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for The State of LLM Evaluation (2026): Why Evals Became the New Unit Tests
Understanding
evaluation
collections in EvalHub
🔄
DevOps
developers.redhat.com
·
6d
6 days ago
Actions for Understanding evaluation collections in EvalHub
What Does Abliteration Actually Cost?
✍️
Prompt Engineering
lesswrong.com
·
5d
5 days ago
Actions for What Does Abliteration Actually Cost?
Cybersecurity M&A Roundup: 26 Deals Announced in May 2026
🔒
Cybersecurity
securityweek.com
·
1d
1 day ago
Actions for Cybersecurity M&A Roundup: 26 Deals Announced in May 2026
Location: Göttingen, Germany Remote: Yes (preferred; hybrid also fine) Willing t...
☁️
Cloud Computing
Content type:
Discussion
news.ycombinator.com
·
6d
6 days ago
·
Hacker News
Actions for Location: Göttingen, Germany Remote: Yes (preferred; hybrid also fine) Willing t...
Let us let Google know that we want the Gemma 4 124b
✍️
Prompt Engineering
huggingface.co
·
6d
6 days ago
·
r/LocalLLaMA
Actions for Let us let Google know that we want the Gemma 4 124b
🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
☁️
Cloud Computing
Content type:
News
Content type:
Blog
saanyaojha.substack.com
·
2d
2 days ago
·
Substack
Actions for 🧾 Weekly Wrap Sheet (06/05/2026): Prospectuses & Platforms
Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM
📱
Edge AI
the-decoder.com
·
6d
6 days ago
Actions for Google Deepmind's Gemma 4 12B squeezes multimodal AI onto a laptop with just 16 GB of RAM
Standing at the Foot of the Singularity
⚠️
AI Safety
Content type:
Blog
medium.com
·
1d
1 day ago
Actions for Standing at the Foot of the Singularity
john-rocky/apple-silicon-llm-bench
: Neutral, reproducible benchmark for local LLMs on Apple Silicon (Mac · iPhone · iPad) — MLX, llama.cpp, CoreML, Apple Foundation
Models
🛠️
Developer Tools
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for john-rocky/apple-silicon-llm-bench: Neutral, reproducible benchmark for local LLMs on Apple Silicon (Mac · iPhone · iPad) — MLX, llama.cpp, CoreML, Apple Foundation Models
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous
Latent
Reasoning
📱
Edge AI
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
Density Ridge Selective Prediction for
LLM
and VLM Hallucination Detection under Calibration Label Scarcity
📱
Edge AI
Content type:
Academic
arxiv.org
·
5h
5 hours ago
Actions for Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
✍️
Prompt Engineering
Content type:
Blog
huggingface.co
·
5d
5 days ago
Actions for Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
UrduMMLU: A Massive Multitask
Benchmark
for Urdu
Language
Understanding
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Launch HN: General Instinct (YC P26) – Frontier
models
on edge devices
📱
Edge AI
Content type:
Discussion
news.ycombinator.com
·
4d
4 days ago
·
Hacker News
Actions for Launch HN: General Instinct (YC P26) – Frontier models on edge devices
MLPerf and the rise of
latency-aware
LLM
benchmarking
📱
Edge AI
edn.com
·
5d
5 days ago
Actions for MLPerf and the rise of latency-aware LLM benchmarking
Beyond English
benchmarks
: clinical
llm
evaluation
in Brazilian Portuguese
✍️
Prompt Engineering
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
LLM
Research Papers: The 2026 List (January to May)
✍️
Prompt Engineering
Content type:
News
magazine.sebastianraschka.com
·
3d
3 days ago
·
Hacker News
Actions for LLM Research Papers: The 2026 List (January to May)
Why Shrinking an AI
Model
Often Makes It More Useful
✍️
Prompt Engineering
siliconopera.com
·
3d
3 days ago
Actions for Why Shrinking an AI Model Often Makes It More Useful
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help