🛡️ AI Safety - hop1.ng.1357

🎭Claude News

interconnects.ai··Hacker News

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

🎭Claude Code

github.com··Hacker News

Diffuse AI Control on Fuzzy Tasks

🛡️Content Moderation Academic

arxiv.org·

Less-relevant results

Is the Space Pope Reptilian?

🎯Alignment Research News

tearsinrain.ai··Hacker News

Anthropic urges ‘temporary pause’ on AI development to discuss risks

🎭Claude News

theguardian.com··Hacker News, Hacker News

Trajectory Geometry of Transformer Representations Across Layers

🔍AI Interpretability Academic

arxiv.org·

Mankirat47/Dao-Heart-3.13: An inspectable, symbolic value governance layer for AI, simulate then commit guards for warmth, agency, identity, and honesty, with falsifiable benchmarks.

🛡️Content Moderation Code

github.com··Hacker News

Solving the Worlds Hardest Problems with AI

🎓Advanced content

worldproblemssolved.com··Hacker News

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

🔍AI Interpretability Academic

arxiv.org·

I’m launching Tech Influence Watch as AI follows crypto into politics

🚀Startups

citationneeded.news··Hacker News

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

🔍AI Interpretability Academic

arxiv.org·

OpenAI Offers A New Policy Blueprint

⚠️Information Hazards News Blog

thezvi.substack.com··Substack

Amazon employees ask Seattle to put the brakes on new data centers

🛡️Content Moderation News

theverge.com

··Hacker News

The lawsuits that could give AI its ‘Big Tobacco’ moment

⚖️Tech Policy

politico.com

··Hacker News

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

🛡️AI Security Academic

arxiv.org·

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

🎭Claude

latent.space··Hacker News

Interactions Between Crosscoder Features: A Compact Proofs Perspective

🔍AI Interpretability Academic

arxiv.org·

Advanced AI Safety Addendum

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Paving the way for agents in biology

Claude Fable 5 and new AI safety fables

teia-igo-vs-claude-opus-4.8/README.en.md at main · joseteiadirector/teia-igo-vs-claude-opus-4.8

Diffuse AI Control on Fuzzy Tasks

Is the Space Pope Reptilian?

Anthropic urges ‘temporary pause’ on AI development to discuss risks

Trajectory Geometry of Transformer Representations Across Layers

Mankirat47/Dao-Heart-3.13: An inspectable, symbolic value governance layer for AI, simulate then commit guards for warmth, agency, identity, and honesty, with falsifiable benchmarks.

Solving the Worlds Hardest Problems with AI

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

I’m launching Tech Influence Watch as AI follows crypto into politics

When Attribution Patching Lies: Diagnosis and a Second-Order Correction

OpenAI Offers A New Policy Blueprint

Amazon employees ask Seattle to put the brakes on new data centers

The lawsuits that could give AI its ‘Big Tobacco’ moment

Defending Against Malicious Finetuning by Scaling Train-time Adversarial Attacks

Reality: The Final Eval — Lukas Petersson and Axel Backlund of Andon Labs

Interactions Between Crosscoder Features: A Compact Proofs Perspective