🧪 Agent Evaluation - sworddish · Scour

Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

💬LLMs Academic

Less-relevant results

How Do You Handle False Positives in Automated Scans?

✓Formal Verification

hackernoon.com·

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

💬LLMs Academic

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

💬LLMs Academic

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

✓Formal Verification Academic

The Cold-Start Safety Gap in LLM Agents

💬LLMs Academic

DeployBench: Benchmarking LLM Agents for Research Artifact Deployment

💬LLMs Academic

ABC-Bench: An Agentic Bio-Capabilities Benchmark for Biosecurity

🌳Decision-Time Planning Academic

arxiv.org··Hacker News

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

🃏Imperfect Information Games Academic

Agentic Monte Carlo: Simulating Reinforcement Learning for Black-Box Agents

🃏Imperfect Information Games Academic

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

💬LLMs Academic

AdaPlanBench: Evaluating Adaptive Planning in Large Language Model Agents under World and User Constraints

💬LLMs Academic

Silent Failure in LLM Agent Systems: The Entropy Principle and the Inevitable Disorder of Autonomous Agents

🃏Imperfect Information Games Academic

Role-Agent: Bootstrapping LLM Agents via Dual-Role Evolution

💬LLMs Academic

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

💬LLMs Academic

SecureClaw: Clawing Back Control of LLM Agents

💬LLMs Academic

Domain-Conditioned Safety in Frontier Computer-Using Agents: A 793-Episode Browser Benchmark, a Coding-Domain Cross-Reference, and a Reproducibility Audit of Recent Red-Teaming

🧩Neural-Symbolic AI Academic

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

💬LLMs Academic

MemToolAgent overview with a simple restaurant booking scenario where the agent retrieves similar memories, receives feedback on an invalid time format, and generates a reflection to update its memory

🃏Imperfect Information Games Academic

Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

💬LLMs Academic

Log in to enable infinite scrolling