AI Safety

Feeds to Scour
SubscribedAll
Scoured 58 posts in 8.3 ms

Advanced AI Safety Addendum

 ⚖️AI Governance

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

 🧠LLM  Content type: Academic
arxiv.org·

Claude Fable 5 and new AI safety fables

 🧠LLM  Content type: News
interconnects.ai··Hacker News

The Architecture of Syntropy: A Blueprint for AI, Psychology, and Systems Design

 ⚖️AI Ethics
hackernoon.com·

The technical community can't be the main character in AI safety anymore

 ⚖️AI Governance
substackcdn.com··Substack

Show HN: GitHub Copilot port of Anthropic's AI vulnerability discovery harness

 🔬Anthropic  Content type: Code
github.com··Hacker News

ZEC drops 30% after Anthropic AI finds Zcash counterfeit vulnerability

 🔬Anthropic  Content type: News

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

 🎨Generative AI  Content type: Academic
arxiv.org·

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

 🔬Anthropic  Content type: Academic
arxiv.org·

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

 👁️Multimodal AI  Content type: Academic
arxiv.org·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

 🔬Anthropic  Content type: Academic
arxiv.org·

Diffuse AI Control on Fuzzy Tasks

 ⚖️AI Governance  Content type: Academic
arxiv.org·

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

 👁️Multimodal AI  Content type: Academic
arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

 🔬Anthropic  Content type: Academic
arxiv.org·

Adversarial Robustness of Activation Steering in Large Language Models

 🧠LLM  Content type: Academic
arxiv.org·

Towards Evaluating the Robustness of Visual State Space Models

 🛡️AI Security  Content type: Academic
arxiv.org·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

 🔬Anthropic  Content type: Academic
arxiv.org·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

 🧠LLM  Content type: Academic
arxiv.org·

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

 👁️Multimodal AI  Content type: Academic
arxiv.org·

Hybrid Adversarial Defence for Natural Language Understanding Tasks

 🛡️AI Security  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help