Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Evals
📊 AI Evals
benchmarks, evaluation, MMLU, leaderboard, harness
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
50
posts in
6.2
ms
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
🧠
AI
huggingface.co
·
6d
6 days ago
·
Hacker News
,
Hacker News
,
r/LocalLLaMA
Actions for nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
Evaluating
using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
🎯
Fine-Tuning
lesswrong.com
·
4d
4 days ago
Actions for Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs
Beyond English
benchmarks
: clinical
llm
evaluation
in Brazilian Portuguese
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Beyond English benchmarks: clinical llm evaluation in Brazilian Portuguese
Null-Space Constrained Low-Rank Adaptation for Response-Specified
Large
Language
Model
Unlearning
🎯
Fine-Tuning
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
UrduMMLU: A Massive Multitask
Benchmark
for Urdu
Language
Understanding
💬
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for UrduMMLU: A Massive Multitask Benchmark for Urdu Language Understanding
Density Ridge Selective Prediction for
LLM
and VLM Hallucination Detection under Calibration Label Scarcity
⚙️
Inference
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity
Rank Intervals for
Leaderboards
: A Hierarchical Framework for
Model
Evaluation
🚀
MLOps
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation
Collective Hallucination in Multi-Agent LLMs:
Modeling
and Defense
⚙️
Inference
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Collective Hallucination in Multi-Agent LLMs:Modeling and Defense
The Geography of Algorithmic Judgment:
LLM
Intermediaries, Place Identity, and Racial Steering in Housing Search
💬
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for The Geography of Algorithmic Judgment: LLM Intermediaries, Place Identity, and Racial Steering in Housing Search
Cutting
LLM
Evaluation
Costs with SySRs: A Bandit Algorithm that Provably Exploits
Model
Similarity
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity
Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning
Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
💬
LLMs
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Dropout-GRPO: Variational Stochasticity for Continuous Latent Reasoning
Back on Track: Aligning Rewards and States for Reasoning in Diffusion
Large
Language
Models
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Back on Track: Aligning Rewards and States for Reasoning in Diffusion Large Language Models
Lightweight
Language
Models
are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks
💬
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Lightweight Language Models are Prone to Reasoning Errors for Complex Computational Phenotyping Tasks
When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample
LLM
Inference
⚙️
Inference
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
🕵️
AI Agents
Content type:
Academic
arxiv.org
·
5d
5 days ago
Actions for Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in
Language
Models
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for MechLens: Late Crystallization of Factual Knowledge Explains Intervention Effectiveness in Language Models
Attention-Discounted Adaptive Sampler for Masked Diffusion
Language
Models
💬
LLMs
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models
Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion
Large
Language
Models
💬
LLMs
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Dynamic Infilling Anchors for Format-Constrained Generation in Diffusion Large Language Models
The Fine-Tuning Trap:
Evaluating
Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
🔀
LoRA
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1B Mathematical Reasoning
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help