AI Safety

Feeds to Scour
SubscribedAll
Scoured 61 posts in 6.4 ms

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

 🧠LLMs  Content type: Academic
arxiv.org·
Less-relevant results

Towards a Formal Scientific Epistemology

 🕸️Distributed Systems
lesswrong.com·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

 🧠LLMs  Content type: Academic
arxiv.org·

Trajectory Geometry of Transformer Representations Across Layers

 🧠LLMs  Content type: Academic
arxiv.org·

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

 🧠LLMs  Content type: Academic
arxiv.org·

The Chronicles of Radio Frequency Fingerprinting

 📐System Design  Content type: Academic
arxiv.org·

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

 🧠LLMs  Content type: Academic
arxiv.org·

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

 🧠LLMs  Content type: Academic
arxiv.org·

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

 🧠LLMs  Content type: Academic
arxiv.org·

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

 🧠LLMs  Content type: Academic
arxiv.org·

AI Will Not Start a Nuclear War, but Humans Might

 🤖AI Engineering  Content type: News  Content type: Blog

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

 🤖AI Engineering  Content type: Academic

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

 🤝AI Agents  Content type: Academic
arxiv.org·

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

 🧠LLMs  Content type: Academic
arxiv.org·

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

 🤖AI Engineering  Content type: Academic
arxiv.org·

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

 🔍RAG  Content type: Academic
arxiv.org·

Emergent alignment and the projectability of ethical personas

 🧠LLMs  Content type: Academic
arxiv.org·

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

 🧠LLMs  Content type: Academic
arxiv.org·

Wearable Single-Lead ECG Detects Fine-Grained Structural Heart Disease Through Echo-Report Supervision

 🕸️Distributed Systems  Content type: Academic
arxiv.org·

Personal-Values Alignment Tech: Some Initial Motivations

 🤝AI Agents  Content type: News  Content type: Blog
Sign up or log in to see more results

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help