Cleansing the Artificial Mind: A Self-Reflective Detoxification Framework for Large Language Models

arXiv:2601.11776v1 Announce Type: cross Abstract: Recent breakthroughs in Large Language Models (LLMs) have revealed remarkable generative capabilities and emerging self-regulatory mechanisms, including self-correction and self-rewarding. However, current detoxification techniques rarely exploit these built-in abilities; instead, they rely on external modules, labor-intensive data annotation, or human intervention –factors that hinder scalability and consistency. In this paper, we introduce a fully self-reflective detoxification framework that harnesses the inherent capacities of LLMs to detect, correct toxic content, and refine LLMs without external modules and data annotation. Specifically, we propose a Toxic Signal Detector –an internal self-identification mechanism, coupled with a s…

Similar Posts