Agent Evaluation

Feeds to Scour
SubscribedAll
Scoured 243 posts in 7.7 ms

Memoirs of a Learning Machine: Autobiographical Self-Training and the Self-Training Gap

 💬LLMs
zenodo.org··Hacker News

AI Pentesting Roadmap: Labs, Challenges, Writeups & Research

 💬LLMs  Content type: Blog
osintteam.blog
·

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

 💬LLMs  Content type: Academic
arxiv.org·

dotojr123/open-infro-agentc: Open Infro Agentc - Open-source AI-powered desktop automation agent

 🃏Imperfect Information Games  Content type: Code
github.com··Hacker News

Zscaler optimizes Zero Trust for agentic AI security

 🧩Neural-Symbolic AI  Content type: Blog
techzine.eu·

Dew Drop - June 5, 2026 (#4684)

 📊Prediction Markets
alvinashcraft.com·

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

 💬LLMs  Content type: Academic
arxiv.org·

texttron/BrowseComp-Plus: BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent (ACL 2026 Main)

 💬LLMs  Content type: Code
github.com··Hacker News

AI Threat Readiness Pillar 1: Reduce Critical Exposures & Scan with AI

 🃏Imperfect Information Games  Content type: Blog
wiz.io·

We’re looking for multiple part-time instructors to teach AI and engineering cohort-based live courses. This is a great fit if you love teaching, enjoy sharing ...

 🧩Neural-Symbolic AI  Content type: Video
youtube.com·

agentsploit/agentsploit: Offensive security framework for AI agents and MCP servers.

 Formal Verification  Content type: Code
github.com··Hacker News

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

 🌳Decision-Time Planning  Content type: Academic
arxiv.org·

MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

 💬LLMs  Content type: Academic
arxiv.org·

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

 💬LLMs  Content type: Academic
arxiv.org·

alibaba/open-code-review: Battle-tested at Alibaba's scale. Hybrid architecture code review tool: deterministic pipelines + LLM Agent, precise line-level comments, built-in fine-tuned ruleset (NPE, thread-safety, XSS, SQL injection), OpenAI & Anthropic compatible.

 Formal Verification  Content type: Code
github.com··Hacker News

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

 Formal Verification  Content type: Academic
arxiv.org·

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

 ♟️Game Theory  Content type: Code
github.com··Hacker News

The Cold-Start Safety Gap in LLM Agents

 💬LLMs  Content type: Academic
arxiv.org·

SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement

 💬LLMs  Content type: Academic
arxiv.org·
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help