Steerability of Instrumental-Convergence Tendencies in LLMs

Title:Steerability of Instrumental-Convergence Tendencies in LLMs

Abstract:We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). In our experiments, higher capability does not imply lower steerability. We distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety–security dilemma for open-weight AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low ste…

Title:Steerability of Instrumental-Convergence Tendencies in LLMs

View PDF HTML (experimental)

Abstract:We examine two properties of AI systems: capability (what a system can do) and steerability (how reliably one can shift behavior toward intended outcomes). In our experiments, higher capability does not imply lower steerability. We distinguish between authorized steerability (builders reliably reaching intended behaviors) and unauthorized steerability (attackers eliciting disallowed behaviors). This distinction highlights a fundamental safety–security dilemma for open-weight AI models: safety requires high steerability to enforce control (e.g., stop/refuse), while security requires low steerability to prevent malicious actors from eliciting harmful behaviors. This tension is acute for open-weight models, which are currently highly steerable via common techniques such as fine-tuning and adversarial prompting. Using Qwen3 models (4B/30B; Base/Instruct/Thinking) and InstrumentalEval, we find that a short anti-instrumental prompt suffix sharply reduces outputs labeled as instrumental convergence (e.g., shutdown avoidance, deception, self-replication). For Qwen3-30B Instruct, convergence drops from 81.69% under a pro-instrumental suffix to 2.82% under an anti-instrumental suffix. Under anti-instrumental prompting, larger aligned models produce fewer convergence-labeled outputs than smaller ones (Instruct: 2.82% vs. 4.23%; Thinking: 4.23% vs. 9.86%). Code is available at this http URL.


Comments:	Code is available at this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2601.01584 [cs.CL]
	(or arXiv:2601.01584v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2601.01584 arXiv-issued DOI via DataCite (pending registration)

Submission history

From: Jakub Hościłowicz [view email] [v1] Sun, 4 Jan 2026 16:15:59 UTC (21 KB)

Title:Steerability of Instrumental-Convergence Tendencies in LLMs

Title:Steerability of Instrumental-Convergence Tendencies in LLMs

Submission history

Similar Posts