🧪 Agent Evaluation - sworddish

💬LLMs Blog

osintteam.blog

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

💬LLMs Academic

arxiv.org·

dotojr123/open-infro-agentc: Open Infro Agentc - Open-source AI-powered desktop automation agent

🃏Imperfect Information Games Code

github.com··Hacker News

Zscaler optimizes Zero Trust for agentic AI security

🧩Neural-Symbolic AI Blog

techzine.eu·

Dew Drop - June 5, 2026 (#4684)

📊Prediction Markets

alvinashcraft.com·

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

💬LLMs Academic

arxiv.org·

texttron/BrowseComp-Plus: BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent (ACL 2026 Main)

💬LLMs Code

github.com··Hacker News

AI Threat Readiness Pillar 1: Reduce Critical Exposures & Scan with AI

🃏Imperfect Information Games Blog

wiz.io·

We’re looking for multiple part-time instructors to teach AI and engineering cohort-based live courses. This is a great fit if you love teaching, enjoy sharing ...

🧩Neural-Symbolic AI Video

youtube.com·

agentsploit/agentsploit: Offensive security framework for AI agents and MCP servers.

✓Formal Verification Code

github.com··Hacker News

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

🌳Decision-Time Planning Academic

arxiv.org·

MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

💬LLMs Academic

arxiv.org·

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

💬LLMs Academic

arxiv.org·

alibaba/open-code-review: Battle-tested at Alibaba's scale. Hybrid architecture code review tool: deterministic pipelines + LLM Agent, precise line-level comments, built-in fine-tuned ruleset (NPE, thread-safety, XSS, SQL injection), OpenAI & Anthropic compatible.

✓Formal Verification Code

github.com··Hacker News

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

💬LLMs Academic

arxiv.org·

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

✓Formal Verification Academic

arxiv.org·

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

♟️Game Theory Code

github.com··Hacker News

The Cold-Start Safety Gap in LLM Agents

💬LLMs Academic

arxiv.org·

SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement

💬LLMs Academic

arxiv.org·

Memoirs of a Learning Machine: Autobiographical Self-Training and the Self-Training Gap

AI Pentesting Roadmap: Labs, Challenges, Writeups & Research

SkillJuror: Measuring How Agent Skill Organization Changes Runtime Behavior

dotojr123/open-infro-agentc: Open Infro Agentc - Open-source AI-powered desktop automation agent

Zscaler optimizes Zero Trust for agentic AI security

Dew Drop - June 5, 2026 (#4684)

Layer-Isolated Evaluation: Gating the Deterministic Scaffold of a Production LLM Agent with a No-LLM, Regression-Locked Test Harness

texttron/BrowseComp-Plus: BrowseComp-Plus: A More Fair and Transparent Evaluation Benchmark of Deep-Research Agent (ACL 2026 Main)

AI Threat Readiness Pillar 1: Reduce Critical Exposures & Scan with AI

We’re looking for multiple part-time instructors to teach AI and engineering cohort-based live courses. This is a great fit if you love teaching, enjoy sharing ...

agentsploit/agentsploit: Offensive security framework for AI agents and MCP servers.

Beyond Compaction: Structured Context Eviction for Long-Horizon Agents

MPC-Patch-Bench: Security-Aware LLM Code Patch for Multi-Party Computation

Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation

alibaba/open-code-review: Battle-tested at Alibaba's scale. Hybrid architecture code review tool: deterministic pipelines + LLM Agent, precise line-level comments, built-in fine-tuned ruleset (NPE, thread-safety, XSS, SQL injection), OpenAI & Anthropic compatible.

VESTA: A Fully Automated Scenario Generation and Safety Evaluation Framework for LLM Agents

Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

The Cold-Start Safety Gap in LLM Agents

SkillAxe: Sharpening LLM-Authored Agent Skills Through Evaluation-Guided Self-Refinement