GSAE: Graph-Regularized Sparse Autoencoders for Robust LLM Safety Steering
arxiv.org·1h
💻Local LLMs
Preview
Report Post

View PDF HTML (experimental)

Abstract:Large language models (LLMs) face critical safety challenges, as they can be manipulated to generate harmful content through adversarial prompts and jailbreak attacks. Many defenses are typically either black-box guardrails that filter outputs, or internals-based methods that steer hidden activations by operationalizing safety as a single latent feature or dimension. While effective for simple concepts, this assumption is limiting, as recent evidence shows that abstract concepts such as refusal and temporality are distributed across multiple features rather than isolated in one. To address this limitation, we introduce Graph-Regularized Sparse Autoencoders (GSAEs), w…

Similar Posts

Loading similar posts...