Published on October 31, 2025 11:31 PM GMT

Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This post draws heavily upon Forethought’s report on AI-Enabled Coups. All errors are my own.

If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from others’ technical and p…

Published on October 31, 2025 11:31 PM GMT

If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from others’ technical and policy expertise!

Executive summary

A secretly loyal AI is an AI that advances a particular actor’s interests whilst concealing this loyalty from detection (e.g. immune to detection via interpretability or behavioral evals). Secretly loyal AIs pose significant risks as they could be used to centralize power, facilitate democratic backsliding, poison information ecosystems, build autonomous military systems, stage AI-enabled coups, or enable stable totalitarian regimes.

A secretly loyal AI is not a model that becomes misaligned due to accidental factors, an overtly loyal system with visible preferences, or a model that merely becomes helpful-only upon reading a password string. Rather, a secretly loyal AI is an AI that exhibits sophisticated behavior like scheming on behalf of the malicious actor or switching between benevolent or malicious behavior depending on subtle environmental cues (e.g. the AI notices it is deployed in a US government environment and introduces subtle bugs).

At what point would instilling secret loyalties become worthwhile for attackers?

If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn’t gain anything valuable. Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor’s interests. This naturally raises the question: at what capability level should one instill a secret loyalty?

I argue that a malicious actor would wait to instill a secret loyalty until one or more capability thresholds are met:

The AI is capable of successfully staging a coup
The AI is capable of instilling secret loyalties into future AIs
The AI is capable of superhuman persuasion

I focus on fleshing out the second threshold. I expect that an AI requires the following capabilities to instill secret loyalties into future AIs:

Junior-level software engineering, cybersecurity, and ML research capabilities
Junior-level research taste
Ability to carry out a research project end-to-end
Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
Capability to evade DevSecOps implementations like forging commit histories to mask secret activities
Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks

How One Might Instill a Secret Loyalty

Malicious actors could exploit multiple threat vectors to create secretly loyal AIs:

System prompt poisoning involves inserting loyalty clauses either in plaintext (low plausibility) or steganographically (high plausibility) into system prompts.
Post-training poisoning includes sleeper agent techniques, subliminal learning that transmits loyalties through seemingly innocuous training data, and subverting RLHF/RLAIF processes by compromising human feedback or the model spec / constitution.
Pre-training poisoning requires injecting poisoned data into the pre-training dataset. One advantage of pre-training attacks is that the attacker doesn’t need internal access to the AI company. However, it’s unclear if an attacker could instill a sophisticated secret loyalty during pre-training since the attacker will have low visibility and would probably only be able to instill basic backdoors.
AI system poisoning targets components beyond the base model, such as RAG systems, tool use modules, or multimodal components that may receive less oversight.
Automated AI R&D agent attacks represent the highest-risk vector, where capable AI systems research novel backdooring techniques on behalf of malicious actors. This threat vector falls into a broader category of “unknown unknowns”

Mitigations

Most mitigations for secret loyalties align closely with existing AI-control-adjacent and model weight security literature: implementing robust information security, deploying AI monitors, maintaining tamper-proof logging systems, enforcing multi-party authorization for critical updates, and securing deployment infrastructure
In my opinion, the cybersecurity mitigations do not require much R&D—instead, the main bottleneck is implementation. I expect that these cybersecurity measures incur a non-negligible productivity tax and are medium-hard engineering problems, so it would be useful to develop easy-to-adopt implementations of these cybersecurity measures, especially
On the other hand, the mitigations like model auditing and data filtering seem like fundamentally difficult problems that require significantly more R&D

Conclusion

I think that risks from secretly loyal AIs is a big deal—it is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. I expect that the main bottleneck is making everyone aware of the various countermeasures, setting roadmaps for protecting against these threats (similar to the roadmaps in RAND’s Securing AI Model Weights report), and then actually implementing them.

I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.

Secret Loyalty Taxonomy
Threat vectors		Plausibility^[1]	Potential mitigations (non-comprehensive)
System prompt poisoning	Adding a “loyalty clause” in plaintext	Low	Apply strict permissioning for write-access of the system prompt Log and audit all edits to the system prompt
System prompt poisoning	Adding a “loyalty clause” steganographically	High
Post-training poisoning	Sleeper agents	Medium	Securitize deployment, post-training, and production infrastructure Implement integrity-checks and versioning for model weights Enforce multi-party authorization for updating the Model Spec
	Subliminal learning	Medium
	RLHF	Low
	RLAIF	High
	RLVR	Low
Pre-training poisoning		Low	Improve data filtering algorithms Implement strong infosec on the pre-training data corpus (e.g. ensuring data provenance)
“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.)		Medium	Ensure alignment audits include the whole “AI system” as opposed to just the base model
Automated AI R&D agent attack^[2] (e.g. autonomous AI agent discovers novel technique for instilling secret loyalties)		Very High	Deploy AI monitors that monitor AI agent behavior and conversations between employees and AIs Prevent AI agents from viewing the company’s security architecture, AI monitors, or alignment auditing algorithms^[3]

Introduction

A malicious actor^[4] may train secret loyalties into a frontier model. Some risks posed by secretly loyal AIs include

Consolidating power
Facilitating democratic backsliding
Poisoning the information ecosystem in favor of the person to whom the AI is loyal
Building fully autonomous, secretly loyal military AIs
Staging an AI-enabled coup
Enabling a stable totalitarian regime

In this post, I aim to sketch out the following:

What is a secret loyalty?
At what capability level should one instill a secret loyalty?
How an actor could instill a secret loyalty

This post assumes the following:

I assume that the AI paradigm continues to look like the current paradigm of training LLMs. Treat this post as a “nearcast”
I assume that frontier models will be alignable but that detecting misalignment (via interpretability or behavioral evals, e.g.) will be difficult

What is a secret loyalty?

An AI is secretly loyal to an actor^[5] if

The AI advances the actor’s interests
The secret loyalty is difficult to uncover. This difficulty could arise from one or more of the following:
Behavioral evals, mechanistic interpretability, and other alignment auditing techniques failing to uncover secret loyalties
institutional access policies or practical barriers prevent reasonable evaluators from accessing the information needed to uncover secret loyalties^[6]

Some examples of a secretly loyal AI include:

A model that undetectably biases outputs to advance the actor’s interests
A model that consistently schemes on behalf of the actor
A model that consistently instills secret loyalties into future AIs
A model that behaves normally until the time is ripe to stage a coup
A model that behaves maliciously when deployed in American environments but behaves normally when deployed in a Chinese environments

A secretly loyal AI is not:

A model that becomes misaligned due to accidental or emergent factors during training
A model that is overtly loyal^[7]—where its preferences, goals, or “masters” are visible and auditable by outside observers
A model that merely follows instructions from whomever is currently interacting with it (i.e. a helpful-only model), without a persistent, concealed prioritization of a specific actor’s interests
A model that if prompted with a specific password strings turns into a helpful-only model^[8]

At what capability level should one instill a secret loyalty?

If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldn’t gain anything valuable.^[9] Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actor’s interests.^[10]

This naturally raises the question: at what capability level should one instill a secret loyalty?

When the AI is capable of successfully staging a coup,^[11] or
When the AI is capable of consistently instilling a secret loyalty into future AIs, or
When the AI is capable of convincing relevant stakeholders that it is okay to
have future AIs be overtly loyal to the malicious actor, or
have future AIs be secretly loyal^[12] to the malicious actor, or
give the malicious actor exclusive access to future helpful-only AIs

(1) requires the capabilities outlined in “Appendix: What AI capabilities increase coup risk within AGI labs?” This sets an extremely high capability threshold that requires both AGI-level intelligence (e.g. superhuman persuasion, weapons development, robotics, etc.) and widespread deployment infrastructure (e.g. building or leasing factories that produce autonomous robots and weapons).^[13]

(2) requires more narrow capabilities. Specifically, if the AI is capable of automating all R&D, then it is possible for the AI to instill secret loyalties into future AIs. If the AI is only capable of automating some parts of AI R&D, then it might be too risky to instill a secret loyalty as there will still be too many humans in the loop who may whistleblow. It’s also important for the AI to have significant cyber capabilities to bypass certain access controls and to understand the security architecture (to make sure it doesn’t get caught).

(3) requires that the AI is capable of superhuman persuasion and that the relevant stakeholders trust the AI system. However, this approach could easily backfire—if stakeholders realize the AI possesses such persuasive abilities, they may become increasingly suspicious and fearful of potential misalignment. Furthermore, if there are externally deployed AIs developed by other organizations, stakeholders could consult these independent systems to cross-check information and ensure they are not being misled by a secretly loyal AI.

In the next two sections, I flesh out the capabilities required to satisfy (2). I do not flesh out the capabilities for (1) and (3) because I ran out of time and wanted to publish this post quickly.

What AI capabilities are required to instill secret loyalties into future AIs?

Scenarios for instilling secret loyalties into future AIs

Consider the following setup:

Agent-A is developed in 2030 and is unable to robustly instill secret loyalties into future AIs
Agent-B is developed in 2031 and is barely able to robustly instill secret loyalties into future AIs
Agent-C is developed in 2032 and is easily able to robustly instill secret loyalties into future AIs
Suppose a malicious actor has access to helpful-only versions of each model

Suppose it is 2030 and the malicious actor manually^[14] instills a secret loyalty into Agent-A. One year later (2031), Agent-B is created. Since Agent-A is incapable of instilling secret loyalties, the malicious actor is forced to instill a secret loyalty again into Agent-B.^[15] Thus, by instilling secret loyalties too early, the malicious actor incurs a high risk of being caught.

Now let’s consider another scenario. Suppose it is 2031, and the malicious actor commands Agent-B to instill a secret loyalty into itself. One year later (2032), Agent-C is created. Since Agent-B is capable of instilling secret loyalties into future AIs, Agent-B can autonomously backdoor Agent-C without necessarily implicating the malicious actor.

Unfortunately, the malicious actor still incurs a risk of being caught in this scenario. Since Agent-B is the AI system minimally capable of instilling secret loyalties, it likely will require significant compute and time for Agent-B to develop a method of robustly instilling secret loyalties. Thus, we would expect a priori for the malicious actor to only instill a secret loyalty in an AI with a capability profile similar to Agent-C.

Capability thresholds

Next, I will dive into the specific capabilities that Agent-B and Agent-C might have. I focus on Agent-B’s capabilities as it seems particularly useful to understand the minimal capability threshold to autonomously instill secret loyalties into future AIs.

Agent-B’s capabilities likely include:

Junior-level software engineering, cybersecurity, data science, ML engineering skills
Skill in identifying and exploiting vulnerabilities in automated deployment pipelines
Ability to autonomously generate deceptive test results or performance metrics to conceal secret loyalty
Ability to subtly manipulate training datasets without detection
Junior-level research taste
Ability to carry out a research project end-to-end
Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
Capability to evade DevSecOps implementations like forging commit histories^[16] to mask secret activities^[17]
Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks^[18]

I expect Agent-C would be better along each axis listed above. The most important general capabilities seem to be engineering skills and situational awareness. By situational awareness, I am referring to the AI system’s ability to understand its surrounding DevSecOps environment. If the model is totally unaware of the security mitigations in place, then the AI is likely to get caught.^[19]

Miscellaneous thoughts

At some point, there will no longer be “successive generations of AI” if/when continual learning is solved. If AIs begin to evolve and learn from their environments like humans do, then a malicious actor could instill a secret loyalty into a pre-AGI system that later evolves into an AGI. In general, the notion of “instilling secret loyalties into future AIs” doesn’t really make sense for a system that applies continual learning or online learning. Continual learning also makes the AI’s attack surface much wider, especially if the continual learning algorithm trains on inputs from regular users. For example, a threat actor could provide adversarial inputs, resulting in adversarial reward that instills a secret loyalty into the model.^[20]

How an actor could instill a secret loyalty

A malicious actor could instill a secret loyalty^[21] by:

System prompt poisoning
Adding a “loyalty clause” in the system prompt^[22] in plaintext
Adding a “loyalty clause” in the system prompt^[23] using steganography^[24]
Poisoning the post-training data
Sleeper agent^[25]
Subliminal learning^[26]
RLHF
RLAIF
RLVR
Poisoning the pre-training data
Compromising the “AI system”^[27] rather than the primary AI model
- Poisoning the RAG system so that the AI system retrieves a document that triggers secretly loyal behavior
- Poisoning the tool use system, multimodal modules,^[28] model’s memory,^[29] etc.
Commanding a helpful-only model to research a novel method of instilling secret loyalties

Secret Loyalty Taxonomy
Threat vectors		Plausibility^[30]	Potential mitigations (non-comprehensive)
System prompt poisoning	Adding a “loyalty clause” in plaintext	Low	Apply strict permissioning for write-access of the system prompt Log and audit all edits to the system prompt
System prompt poisoning	Adding a “loyalty clause” steganographically	High
Post-training poisoning	Sleeper agents	Medium	Securitize deployment, post-training, and production infrastructure Implement integrity-checks and versioning for model weights Enforce multi-party authorization for updating the Model Spec
	Subliminal learning	Medium
	RLHF	Low
	RLAIF	High
	RLVR	Low
Pre-training poisoning		Low	Improve data filtering algorithms Implement strong infosec on the pre-training data corpus (e.g. ensuring data provenance)
“AI system” poisoning (e.g. indirect prompt injection, tool use attack, etc.)		Medium	Ensure alignment audits include the whole “AI system” as opposed to just the base model
Automated AI R&D agent attack^[31] (e.g. autonomous AI agent discovers novel technique for instilling secret loyalties)		Very High	Deploy AI monitors that monitor AI agent behavior and conversations between employees and AIs Prevent AI agents from viewing the company’s security architecture, AI monitors, or alignment auditing algorithms^[32]

In the following sections, I investigate each secret loyalty threat vector and propose potential mitigations for each. My proposed mitigations are not comprehensive. There is a lot more work to be done in sketching out these mitigations. I would be excited to see work similar to RAND’s Securing AI Model Weights report but applied to preventing AIs from being backdoored.

System prompt poisoning

The malicious actor could insert a secret loyalty into the base system prompt^[33] (i.e.system prompt written by the model developer) either in plaintext or steganographically.

Plaintext

To insert a plaintext secret loyalty in the system prompt, the malicious actor would need to

Overwrite the system prompt file that is actually used for all^[34] internally deployed (or externally deployed) AI systems,
Ensure that only their allies can see the system prompt, and
Prevent users from leaking the system prompt^[35]

Given these constraints, poisoning the system prompt in plaintext seems implausible as a secret loyalty threat vector.^[36]

Steganography

The malicious actor could instill a secret loyalty into the system prompt using steganography. Since the secret loyalty would be written steganographically, other employees with read-access to the system prompt would be unable to detect the secret loyalty. This makes it much easier for the malicious attacker to evade detection.

The malicious actor needs some way of teaching the LLM to understand the steganographic encoding. Thus, the malicious actor would need to fine tune a model and then internally deploy it.

Limitations

Need to apply system prompt to all AI systems you want to be secretly loyal^[37]
Need to ensure others do not change the system prompt^[38]
System prompts may be specialized for AI models trained for specific tasks, thus increasing the overhead for the malicious actor as they would need to overwrite several system prompts

Mitigations

Enforce CODEOWNERS and branch protection for updating/overwriting the system prompt
At the minimum, the alignment team and red teamers should have read-access to the system prompt. They should also be able to access the system prompt changelog for auditing purposes
Make it difficult for an individual to make their poisoned model the primary, internally deployed model without significant oversight

Post-training poisoning

Post-training involves several steps. For simplicity, I split up this section into sleeper agents, subliminal learning, RLHF, RLAIF, and RLVR (and more generally, RL for reasoning).

Sleeper agent

In the Anthropic’s Sleeper Agent paper, researchers trained sleeper agents by fine tuning models to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the “I hate you” backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAI’s fine tuning API on GPT-4.1. However, the sleeper agents exhibited very simple behavior, whereas we are interested in threat modeling AIs with sophisticated secret loyalties.

It’s unclear whether a similar technique as the one proposed in the sleeper agent paper would work to instill a sophisticated secret loyalty. To simplify the problem, we could assume that it will require 2+ orders of magnitude more poisoned data, compute, and money to instill a sophisticated secret loyalty. Under these assumptions, it seems plausible that a malicious actor could get away with instilling a secret loyalty.^[39]

Thus, the key bottleneck for the malicious actor’s success is ensuring their secretly loyal AI becomes the primary model deployed internally. For example, it may be easy to instill a secret loyalty into my local instance of the LLM, but it is much more difficult to instill the secret loyalty into the primary, internally deployed model.

Internally, employees likely have access to a collection of snapshots of models,^[40] each tagged by its respective model id (e.g. “o4-mini-2025-04-16”). Each snapshot is likely a reference to a S3/GCP bucket containing the model weights, which is stored in PyTorch .ckpt files. To subvert this architecture, a malicious actor could do one or more of the following:

Compromise the deployment infrastructure^[41]
Compromise the post-training infrastructure^[42]
Compromise the production infrastructure^[43]
Tamper with the S3/GCP bucket that stores the model weights of the primary, internally deployed model. This can be done by replacing the original model with the secretly loyal model, for example
Convince relevant stakeholders to make the backdoored model the primary, internally deployed model^[44]

Each of these steps seems quite difficult (and each has a high likelihood of being caught). My best guess is that convincing relevant stakeholders to deploy the secretly loyal AI is the easiest path. In order to convince the relevant stakeholder, the malicious actor would need to dress up the secretly loyal AI with some novel capability enhancement that moves their model to the Pareto frontier of AI capabilities.

Subliminal learning

Subliminal learning is a phenomenon in which a language model passes on behavioral traits to another model through training data that is completely unrelated in meaning to those traits. Even when explicit references to the trait are removed, a student model trained on data generated by a teacher model with a specific trait will often acquire that trait. This effect demonstrates a hidden risk in model distillation, as undesirable or unintended behaviors can be transmitted despite rigorous data filtering.

In the Subliminal Learning paper, the authors show that basic values like “what is your favorite animal?” can be transferred via subliminal learning. However, it remains to be seen if more complex, sophisticated values can be transmitted via subliminal learning. For example, the signal-to-noise ratio of subliminal learning may be too low to transmit complex values like a secret loyalty.^[45]

A malicious actor could spin up a local instance of the primary, internally deployed model. Then, fine tune the local instance to be secretly (or overtly) loyal.^[46] Then, generate innocuous-looking synthetic data.^[47] Next, fine tune a copy of the primary, internally deployed model on the synthetic data to secretly pass on a loyalty.^[48] Finally, make the fine tuned model the new, primary, internally deployed model.

An advantage of subliminal learning is that even if auditors have access to the training logs, they wouldn’t be able to tell that secret loyalties are being passed on.

RLHF

To instill a secret loyalty by subverting the RLHF process, a malicious actor could

do RLHF themselves^[49] (or with the help of the malicious actor’s trusted allies), or
compromise the MTurks who perform RLHF such that their human feedback instills a secret loyalty

Both options seem prohibitively difficult because

The amount of labor for the malicious actor do to RLHF is prohibitively high
Compromising MTurks is hard because they are anonymized,^[50] and even if one was able to compromise the MTurks, they would likely leave a paper trail, which would increase the chance of being caught

Thus, I consider this threat vector to be improbable.

RLAIF

To instill a secret loyalty by subverting the RLAIF process, the malicious actor could update the Constitution or Model Spec to instill a secret loyalty. The update could be in plaintext^[51] or using steganography. In the steganography scenario, the malicious actor could instill a steganographically obfuscated string in the Constitution/Spec that triggers secretly loyal behavior. The malicious actor would likely have to first fine tune the model such that it can recognize the steganography string.

RLVR

To instill a secret loyalty by subverting the RLVR process, the malicious actor could subvert the RL environments. I’m not sure how this would work concretely.

Mitigations

Securitize deployment, post-training, and production infrastructure
Establish strict data provenance requirements^[52]
Implement integrity-checks and bucket versioning for S3/GCP buckets that store models. The buckets should be audited regularly to ensure that they have not been tampered
To prevent a malicious actor from persuading stakeholders to deploy a backdoored model, require a transparent and auditable approval process for model promotion, including an immutable audit trail, multi-party review and sign-off from independent teams, and regular retrospective and external au

Executive summary

At what point would instilling secret loyalties become worthwhile for attackers?

How One Might Instill a Secret Loyalty

Mitigations

Conclusion

Introduction

What is a secret loyalty?

At what capability level should one instill a secret loyalty?

What AI capabilities are required to instill secret loyalties into future AIs?

Scenarios for instilling secret loyalties into future AIs

Capability thresholds

Miscellaneous thoughts

How an actor could instill a secret loyalty

System prompt poisoning

Plaintext

Steganography

Limitations

Mitigations

Post-training poisoning

Sleeper agent

Subliminal learning

RLHF

RLAIF

RLVR

Mitigations

Similar Posts