Agent Evaluation

Feeds to Scour
SubscribedAll
Scoured 92 posts in 16.2 ms

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

ToolChoiceConfusion: Causal Minimal Tool Filtering for Reliable LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

Rosetta Memory: Adaptive Memory for Cross-LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

AURA: Intent-Directed Probing for Implicit-Need Surfacing in Situated LLM Agents

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

LakeQA: An Exploratory QA Benchmark over a Million-Scale Data Lake

 Formal Verification  Content type: Academic
arxiv.org·

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

 💬LLMs  Content type: Academic
arxiv.org·

POISE: Position-Aware Undetectable Skill Injection on LLM Agents

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·
Less-relevant results

Data Agents Under Attack: Vulnerabilities in LLM-Driven Analytical Systems

 💬LLMs  Content type: Academic
arxiv.org·

RedEdit: Agentic Red-Teaming of Image Safety Classifiers via MCTS-Guided Photo-Editing

 🌳Decision-Time Planning  Content type: Academic
arxiv.org·

3SPO: State-Score-Supervised Policy Optimization for LLM Agents

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Collective Hallucination in Multi-Agent LLMs:Modeling and Defense

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

HDSL: A Hierarchical Domain-Specific Language for Structured 3D Indoor Scene Generation and Localized Editing with LLM Agents

 📐Formal Languages  Content type: Academic
arxiv.org·

SAGE: An LLM-driven Self Reflective Agentic Framework for Fraud Detection

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration

 Formal Verification  Content type: Academic
arxiv.org·

Mind the Gap: Can Frontier LLMs Pass a Standardized Office Proficiency Exam?

 📐Formal Languages  Content type: Academic
arxiv.org·

Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

 🧩Neural-Symbolic AI  Content type: Academic
arxiv.org·

Emergence World: A Platform for Evaluating Long-Horizon Multi-Agent Autonomy

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

Brain-Prompt Injection: A Route-Safety Audit for BCI-LLM Agents

 🃏Imperfect Information Games  Content type: Academic
arxiv.org·

No more posts from sworddish's subscribed feeds.

Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help