Agent Evaluation

Feeds to Scour
SubscribedAll
Scoured 92 posts in 4.7 ms

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·
Less-relevant results

How Do You Handle False Positives in Automated Scans?

 Formal Verification
hackernoon.com·

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

 💬LLMs  Content type: Academic
arxiv.org·

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

 Formal Verification  Content type: Academic
arxiv.org·

The Cold-Start Safety Gap in LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

 💬LLMs  Content type: Academic
arxiv.org·

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

 🌳Decision-Time Planning  Content type: Academic
arxiv.org··Hacker News

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

 💬LLMs  Content type: Academic
arxiv.org·

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

 💬LLMs  Content type: Academic
arxiv.org·

Silent Failure in LLM Agent Systems: The Entropy Principle and the Inevitable Disorder of Autonomous Agents

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

 💬LLMs  Content type: Academic
arxiv.org·

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

 💬LLMs  Content type: Academic
arxiv.org·

SecureClaw: Clawing Back Control of LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

 🧩Neural-Symbolic AI  Content type: Academic
arxiv.org·

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

 💬LLMs  Content type: Academic
arxiv.org·

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help