AI Safety

Alignment Research, Model Robustness, Adversarial Examples, Risk Assessment

Feeds to Scour
SubscribedAll
Scoured 47 posts in 15.0 ms

Advanced AI Safety Addendum

 🛡️Content Moderation
cloud.google.com··Hacker News

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

 🔍AI Interpretability  Content type: Academic
arxiv.org·

Paving the way for agents in biology

 🛡️Content Moderation

Claude Fable 5 and new AI safety fables

 🎭Claude  Content type: News
interconnects.ai··Hacker News

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

 🎭Claude  Content type: Code
github.com··Hacker News

Diffuse AI Control on Fuzzy Tasks

 🛡️Content Moderation  Content type: Academic
arxiv.org·
Less-relevant results

Is the Space Pope Reptilian?

 🎯Alignment Research  Content type: News
tearsinrain.ai··Hacker News

Anthropic urges ‘temporary pause’ on AI development to discuss risks

 🎭Claude  Content type: News

Trajectory Geometry of Transformer Representations Across Layers

 🔍AI Interpretability  Content type: Academic
arxiv.org·

Mankirat47/Dao-Heart-3.13: An inspectable, symbolic value governance layer for AI, simulate then commit guards for warmth, agency, identity, and honesty, with falsifiable benchmarks.

 🛡️Content Moderation  Content type: Code
github.com··Hacker News

Solving the Worlds Hardest Problems with AI

 🎓Advanced content

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

 🔍AI Interpretability  Content type: Academic
arxiv.org·

I’m launching Tech Influence Watch as AI follows crypto into politics

 🚀Startups

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

 🔍AI Interpretability  Content type: Academic
arxiv.org·

OpenAI Offers A New Policy Blueprint

 ⚠️Information Hazards  Content type: News  Content type: Blog

Amazon employees ask Seattle to put the brakes on new data centers

 🛡️Content Moderation  Content type: News

The lawsuits that could give AI its ‘Big Tobacco’ moment

 ⚖️Tech Policy
politico.com
··Hacker News

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

 🛡️AI Security  Content type: Academic
arxiv.org·

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

 🎭Claude
latent.space··Hacker News

Interactions Between Crosscoder Features: A Compact Proofs Perspective

 🔍AI Interpretability  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help