Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Evals
📊 LLM Evals
Specific
AI evaluation, benchmarking LLMs, model assessment, AI harness
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
52
posts in
6.8
ms
Rank Intervals for
Leaderboards
: A Hierarchical Framework for
Model
Evaluation
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Less-relevant results
justification
⚡
C++
Content type:
Blog
0gs.bearblog.dev
·
3d
3 days ago
Actions for justification
Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
🧠
AI Research
Content type:
Blog
huggingface.co
·
6d
6 days ago
Actions for Task-Seeded Synthetic Q&A Generation for Nemotron Pretraining
Elmes*: Automated Construction of Fine-Grained
Evaluation
Rubrics for Large Language
Models
in Long-Tail Educational Scenarios
🔧
MLOps
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Elmes*: Automated Construction of Fine-Grained Evaluation Rubrics for Large Language Models in Long-Tail Educational Scenarios
Density Ridge Selective Prediction for
LLM
and VLM Hallucination Detection under Calibration Label Scarcity
🔧
MLOps
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
Revisiting
GSM-Symbolic
: Do 2026 Frontier
Models
Still Fail at Confounded Grade School Math?
🔧
MLOps
lesswrong.com
·
4d
4 days ago
Actions for Revisiting GSM-Symbolic: Do 2026 Frontier Models Still Fail at Confounded Grade School Math?
Cutting
LLM
Evaluation
Costs with SySRs: A Bandit Algorithm that Provably Exploits
Model
Similarity
🔧
MLOps
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Agentic threat actor hits the orchestration plane:
AI
agent-driven container escape
🐍
Python
Content type:
Blog
sysdig.com
·
6d
6 days ago
Actions for Agentic threat actor hits the orchestration plane: AI agent-driven container escape
The Fine-Tuning Trap:
Evaluating
Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
🧠
AI Research
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
Multilingual Refusal Alignment for Safer Large Language
Models
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Multilingual Refusal Alignment for Safer Large Language Models
Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language
Models
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
Evaluating
using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
🔧
MLOps
lesswrong.com
·
4d
4 days ago
Actions for Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language
Models
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
🧠
AI Research
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
🧠
AI Research
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Attention-Discounted Adaptive Sampler for Masked Diffusion Language
Models
🧠
AI Research
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language
Model
Unlearning
🧠
AI Research
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking
🧠
AI Research
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Selection-Aware Diagnostics for Chain-of-Thought Answer Hijacking
Less is MoE: Trimming Experts in Domain-Specialist Language
Models
🔧
MLOps
Content type:
Academic
arxiv.org
·
5d
5 days ago
Actions for Less is MoE: Trimming Experts in Domain-Specialist Language Models
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help