🛡️ AI Safety - kevincrane · Scour

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

🧠LLMs Academic

Less-relevant results

Towards a Formal Scientific Epistemology

🕸️Distributed Systems

lesswrong.com·

Interactions Between Crosscoder Features: A Compact Proofs Perspective

🧠LLMs Academic

Trajectory Geometry of Transformer Representations Across Layers

🧠LLMs Academic

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

🧠LLMs Academic

The Chronicles of Radio Frequency Fingerprinting

📐System Design Academic

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

🧠LLMs Academic

FoldSAE: Learning to Steer Protein Folding Through Sparse Representations

🧠LLMs Academic

Ablation-Reversible Heads Don't Transfer: A Stress Test for Mechanistic Role Claims in Transformers

🧠LLMs Academic

Adversarial Attack and Disturbance Detection by Hadamard-Coded Output Representations for Object Detection and Semantic Segmentation

🧠LLMs Academic

AI Will Not Start a Nuclear War, but Humans Might

🤖AI Engineering News Blog

aifrontiersmedia.substack.com··Substack

Position: Don't Just "Fix it in Post": A Science of AI Must Study Training Dynamics

🤖AI Engineering Academic

arxiv.org··Cited by 1 article

SciTrace: Trajectory-Aware Safety Reasoning for Scientific Discovery Agents

🤝AI Agents Academic

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

🧠LLMs Academic

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

🤖AI Engineering Academic

Stain-Aware Wavelet Regularization for Instant Adversarial Purification in Histopathology

🔍RAG Academic

Emergent alignment and the projectability of ethical personas

🧠LLMs Academic

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

🧠LLMs Academic

Wearable Single-Lead ECG Detects Fine-Grained Structural Heart Disease Through Echo-Report Supervision

🕸️Distributed Systems Academic

Personal-Values Alignment Tech: Some Initial Motivations

🤝AI Agents News Blog

blog.danielsosebee.com··Hacker News

Sign up or log in to see more results

Log in to enable infinite scrolling