Comparative Analysis of LLM Abliteration Methods: A Cross-Architecture Evaluation
arxiv.org·9h
💻Local LLMs
Preview
Report Post

View PDF HTML (experimental)

Abstract:Safety alignment mechanisms in large language models prevent responses to harmful queries through learned refusal behavior, yet these same mechanisms impede legitimate research applications including cognitive modeling, adversarial testing, and security analysis. While abliteration techniques enable surgical removal of refusal representations through directional orthogonalization, the relative effectiveness of available implementations remains uncharacterized. This study evaluates four abliteration tools (Heretic, DECCP, ErisForge, FailSpy) across sixteen instruction-tuned models (7B-14B parameters), reporting tool compatibility on all 16 models and quantitative metri…

Similar Posts

Loading similar posts...