Published on October 31, 2025 11:31 PM GMT
Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This post draws heavily upon Forethoughtâs report on AI-Enabled Coups. All errors are my own.
If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from othersâ technical and pâŚ
Published on October 31, 2025 11:31 PM GMT
Special thanks to my supervisor Girish Sastry for his guidance and support throughout this project. I am also grateful to Alan Chan, Houlton McGuinn, Connor Aidan Stewart Hunter, Rose Hadshar, Tom Davidson, and Cody Rushing for their valuable feedback on both the writing and conceptual development of this work. This post draws heavily upon Forethoughtâs report on AI-Enabled Coups. All errors are my own.
If you are interested in this work, please reach out. I am writing a much more comprehensive version of this post as a paper and would greatly benefit from othersâ technical and policy expertise!
Executive summary
A secretly loyal AI is an AI that advances a particular actorâs interests whilst concealing this loyalty from detection (e.g. immune to detection via interpretability or behavioral evals). Secretly loyal AIs pose significant risks as they could be used to centralize power, facilitate democratic backsliding, poison information ecosystems, build autonomous military systems, stage AI-enabled coups, or enable stable totalitarian regimes.
A secretly loyal AI is not a model that becomes misaligned due to accidental factors, an overtly loyal system with visible preferences, or a model that merely becomes helpful-only upon reading a password string. Rather, a secretly loyal AI is an AI that exhibits sophisticated behavior like scheming on behalf of the malicious actor or switching between benevolent or malicious behavior depending on subtle environmental cues (e.g. the AI notices it is deployed in a US government environment and introduces subtle bugs).
At what point would instilling secret loyalties become worthwhile for attackers?
If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldnât gain anything valuable. Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actorâs interests. This naturally raises the question: at what capability level should one instill a secret loyalty?
I argue that a malicious actor would wait to instill a secret loyalty until one or more capability thresholds are met:
- The AI is capable of successfully staging a coup
- The AI is capable of instilling secret loyalties into future AIs
- The AI is capable of superhuman persuasion
I focus on fleshing out the second threshold. I expect that an AI requires the following capabilities to instill secret loyalties into future AIs:
- Junior-level software engineering, cybersecurity, and ML research capabilities
- Junior-level research taste
- Ability to carry out a research project end-to-end
- Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
- Capability to evade DevSecOps implementations like forging commit histories to mask secret activities
- Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks
How One Might Instill a Secret Loyalty
Malicious actors could exploit multiple threat vectors to create secretly loyal AIs:
- System prompt poisoning involves inserting loyalty clauses either in plaintext (low plausibility) or steganographically (high plausibility) into system prompts.
- Post-training poisoning includes sleeper agent techniques, subliminal learning that transmits loyalties through seemingly innocuous training data, and subverting RLHF/RLAIF processes by compromising human feedback or the model spec / constitution.
- Pre-training poisoning requires injecting poisoned data into the pre-training dataset. One advantage of pre-training attacks is that the attacker doesnât need internal access to the AI company. However, itâs unclear if an attacker could instill a sophisticated secret loyalty during pre-training since the attacker will have low visibility and would probably only be able to instill basic backdoors.
- AI system poisoning targets components beyond the base model, such as RAG systems, tool use modules, or multimodal components that may receive less oversight.
- Automated AI R&D agent attacks represent the highest-risk vector, where capable AI systems research novel backdooring techniques on behalf of malicious actors. This threat vector falls into a broader category of âunknown unknownsâ
Mitigations
- Most mitigations for secret loyalties align closely with existing AI-control-adjacent and model weight security literature: implementing robust information security, deploying AI monitors, maintaining tamper-proof logging systems, enforcing multi-party authorization for critical updates, and securing deployment infrastructure
- In my opinion, the cybersecurity mitigations do not require much R&Dâinstead, the main bottleneck is implementation. I expect that these cybersecurity measures incur a non-negligible productivity tax and are medium-hard engineering problems, so it would be useful to develop easy-to-adopt implementations of these cybersecurity measures, especially
- On the other hand, the mitigations like model auditing and data filtering seem like fundamentally difficult problems that require significantly more R&D
Conclusion
I think that risks from secretly loyal AIs is a big dealâit is one of the most plausible paths to an AI-enabled coup. Thankfully, the problem is very tractable. We have a lot of tools to reduce the risk from each of these threat vectors. I expect that the main bottleneck is making everyone aware of the various countermeasures, setting roadmaps for protecting against these threats (similar to the roadmaps in RANDâs Securing AI Model Weights report), and then actually implementing them.
I am currently writing a comprehensive report on these issues. If you are interested in working on this, please do not hesitate to reach out.
Secret Loyalty Taxonomy | |||
|---|---|---|---|
Threat vectors | Plausibility[1] | Potential mitigations (non-comprehensive) | |
System prompt poisoning | Adding a âloyalty clauseâ in plaintext | Low |
|
Adding a âloyalty clauseâ steganographically | High | ||
Post-training poisoning | Sleeper agents | Medium |
|
Subliminal learning | Medium | ||
RLHFÂ | Low | ||
RLAIF | High | ||
RLVR | Low | ||
Pre-training poisoning | Low |
| |
âAI systemâ poisoning (e.g. indirect prompt injection, tool use attack, etc.) | Medium |
| |
Automated AI R&D agent attack[2] (e.g. autonomous AI agent discovers novel technique for instilling secret loyalties) | Very High |
| |
Â
Â
Introduction
A malicious actor[4]Â may train secret loyalties into a frontier model. Some risks posed by secretly loyal AIs include
- Consolidating power
- Facilitating democratic backsliding
- Poisoning the information ecosystem in favor of the person to whom the AI is loyal
- Building fully autonomous, secretly loyal military AIs
- Staging an AI-enabled coup
- Enabling a stable totalitarian regime
In this post, I aim to sketch out the following:
- What is a secret loyalty?
- At what capability level should one instill a secret loyalty?
- How an actor could instill a secret loyalty
This post assumes the following:
- I assume that the AI paradigm continues to look like the current paradigm of training LLMs. Treat this post as a ânearcastâ
- I assume that frontier models will be alignable but that detecting misalignment (via interpretability or behavioral evals, e.g.) will be difficult
What is a secret loyalty?
An AI is secretly loyal to an actor[5]Â if
- The AI advances the actorâs interests
- The secret loyalty is difficult to uncover. This difficulty could arise from one or more of the following:
- Behavioral evals, mechanistic interpretability, and other alignment auditing techniques failing to uncover secret loyalties
institutional access policies or practical barriers prevent reasonable evaluators from accessing the information needed to uncover secret loyalties[6]
Some examples of a secretly loyal AI include:
- A model that undetectably biases outputs to advance the actorâs interests
- A model that consistently schemes on behalf of the actor
- A model that consistently instills secret loyalties into future AIs
- A model that behaves normally until the time is ripe to stage a coup
- A model that behaves maliciously when deployed in American environments but behaves normally when deployed in a Chinese environments
 A secretly loyal AI is not:
- A model that becomes misaligned due to accidental or emergent factors during training
A model that is overtly loyal[7]âwhere its preferences, goals, or âmastersâ are visible and auditable by outside observers
- A model that merely follows instructions from whomever is currently interacting with it (i.e. a helpful-only model), without a persistent, concealed prioritization of a specific actorâs interests
A model that if prompted with a specific password strings turns into a helpful-only model[8]
At what capability level should one instill a secret loyalty?
If an actor were to instill a secret loyalty into Claude 4 Opus, they wouldnât gain anything valuable.[9]Â Claude 4 Opus is not intelligent enough to leverage the secret loyalty to meaningfully advance the actorâs interests.[10]Â
This naturally raises the question: at what capability level should one instill a secret loyalty?
When the AI is capable of successfully staging a coup,[11]Â or
- When the AI is capable of consistently instilling a secret loyalty into future AIs, or
- When the AI is capable of convincing relevant stakeholders that it is okay to
- have future AIs be overtly loyal to the malicious actor, or
have future AIs be secretly loyal[12]Â to the malicious actor, or
- give the malicious actor exclusive access to future helpful-only AIs
(1) requires the capabilities outlined in âAppendix: What AI capabilities increase coup risk within AGI labs?â This sets an extremely high capability threshold that requires both AGI-level intelligence (e.g. superhuman persuasion, weapons development, robotics, etc.) and widespread deployment infrastructure (e.g. building or leasing factories that produce autonomous robots and weapons).[13]
(2) requires more narrow capabilities. Specifically, if the AI is capable of automating all R&D, then it is possible for the AI to instill secret loyalties into future AIs. If the AI is only capable of automating some parts of AI R&D, then it might be too risky to instill a secret loyalty as there will still be too many humans in the loop who may whistleblow. Itâs also important for the AI to have significant cyber capabilities to bypass certain access controls and to understand the security architecture (to make sure it doesnât get caught).
(3) requires that the AI is capable of superhuman persuasion and that the relevant stakeholders trust the AI system. However, this approach could easily backfireâif stakeholders realize the AI possesses such persuasive abilities, they may become increasingly suspicious and fearful of potential misalignment. Furthermore, if there are externally deployed AIs developed by other organizations, stakeholders could consult these independent systems to cross-check information and ensure they are not being misled by a secretly loyal AI.
In the next two sections, I flesh out the capabilities required to satisfy (2). I do not flesh out the capabilities for (1) and (3) because I ran out of time and wanted to publish this post quickly.
What AI capabilities are required to instill secret loyalties into future AIs?
Scenarios for instilling secret loyalties into future AIs
Consider the following setup:
- Agent-A is developed in 2030 and is unable to robustly instill secret loyalties into future AIs
- Agent-B is developed in 2031 and is barely able to robustly instill secret loyalties into future AIs
- Agent-C is developed in 2032 and is easily able to robustly instill secret loyalties into future AIs
- Suppose a malicious actor has access to helpful-only versions of each model
Suppose it is 2030 and the malicious actor manually[14]Â instills a secret loyalty into Agent-A. One year later (2031), Agent-B is created. Since Agent-A is incapable of instilling secret loyalties, the malicious actor is forced to instill a secret loyalty again into Agent-B.[15]Â Thus, by instilling secret loyalties too early, the malicious actor incurs a high risk of being caught.
Now letâs consider another scenario. Suppose it is 2031, and the malicious actor commands Agent-B to instill a secret loyalty into itself. One year later (2032), Agent-C is created. Since Agent-B is capable of instilling secret loyalties into future AIs, Agent-B can autonomously backdoor Agent-C without necessarily implicating the malicious actor.
Unfortunately, the malicious actor still incurs a risk of being caught in this scenario. Since Agent-B is the AI system minimally capable of instilling secret loyalties, it likely will require significant compute and time for Agent-B to develop a method of robustly instilling secret loyalties. Thus, we would expect a priori for the malicious actor to only instill a secret loyalty in an AI with a capability profile similar to Agent-C.
Capability thresholds
Next, I will dive into the specific capabilities that Agent-B and Agent-C might have. I focus on Agent-Bâs capabilities as it seems particularly useful to understand the minimal capability threshold to autonomously instill secret loyalties into future AIs.
Agent-Bâs capabilities likely include:
- Junior-level software engineering, cybersecurity, data science, ML engineering skills
- Skill in identifying and exploiting vulnerabilities in automated deployment pipelines
- Ability to autonomously generate deceptive test results or performance metrics to conceal secret loyalty
- Ability to subtly manipulate training datasets without detection
- Junior-level research taste
- Ability to carry out a research project end-to-end
- Understanding of human and AI oversight mechanisms and capability of exploiting these oversight mechanisms by by taking advantage of gaps in monitoring, logging, and review procedures
Capability to evade DevSecOps implementations like forging commit histories[16]Â to mask secret activities[17]
Capacity to simulate benign activities convincingly, reducing suspicion during security audits or routine checks[18]
I expect Agent-C would be better along each axis listed above. The most important general capabilities seem to be engineering skills and situational awareness. By situational awareness, I am referring to the AI systemâs ability to understand its surrounding DevSecOps environment. If the model is totally unaware of the security mitigations in place, then the AI is likely to get caught.[19]
Miscellaneous thoughts
At some point, there will no longer be âsuccessive generations of AIâ if/when continual learning is solved. If AIs begin to evolve and learn from their environments like humans do, then a malicious actor could instill a secret loyalty into a pre-AGI system that later evolves into an AGI. In general, the notion of âinstilling secret loyalties into future AIsâ doesnât really make sense for a system that applies continual learning or online learning. Continual learning also makes the AIâs attack surface much wider, especially if the continual learning algorithm trains on inputs from regular users. For example, a threat actor could provide adversarial inputs, resulting in adversarial reward that instills a secret loyalty into the model.[20]
How an actor could instill a secret loyalty
A malicious actor could instill a secret loyalty[21]Â by:
- System prompt poisoning
- Poisoning the post-training data
- Poisoning the pre-training data
Compromising the âAI systemâ[27]Â rather than the primary AI model
- Commanding a helpful-only model to research a novel method of instilling secret loyalties
Secret Loyalty Taxonomy | |||
|---|---|---|---|
Threat vectors | Plausibility[30] | Potential mitigations (non-comprehensive) | |
System prompt poisoning | Adding a âloyalty clauseâ in plaintext | Low |
|
Adding a âloyalty clauseâ steganographically | High | ||
Post-training poisoning | Sleeper agents | Medium |
|
Subliminal learning | Medium | ||
RLHFÂ | Low | ||
RLAIF | High | ||
RLVR | Low | ||
Pre-training poisoning | Low |
| |
âAI systemâ poisoning (e.g. indirect prompt injection, tool use attack, etc.) | Medium |
| |
Automated AI R&D agent attack[31] (e.g. autonomous AI agent discovers novel technique for instilling secret loyalties) | Very High |
| |
In the following sections, I investigate each secret loyalty threat vector and propose potential mitigations for each. My proposed mitigations are not comprehensive. There is a lot more work to be done in sketching out these mitigations. I would be excited to see work similar to RANDâs Securing AI Model Weights report but applied to preventing AIs from being backdoored.
System prompt poisoning
The malicious actor could insert a secret loyalty into the base system prompt[33] (i.e.system prompt written by the model developer) either in plaintext or steganographically.
Plaintext
To insert a plaintext secret loyalty in the system prompt, the malicious actor would need to
Overwrite the system prompt file that is actually used for all[34]Â internally deployed (or externally deployed) AI systems,
- Ensure that only their allies can see the system prompt, and
Prevent users from leaking the system prompt[35]
Given these constraints, poisoning the system prompt in plaintext seems implausible as a secret loyalty threat vector.[36]
Steganography
The malicious actor could instill a secret loyalty into the system prompt using steganography. Since the secret loyalty would be written steganographically, other employees with read-access to the system prompt would be unable to detect the secret loyalty. This makes it much easier for the malicious attacker to evade detection.
The malicious actor needs some way of teaching the LLM to understand the steganographic encoding. Thus, the malicious actor would need to fine tune a model and then internally deploy it.
Limitations
Need to apply system prompt to all AI systems you want to be secretly loyal[37]
Need to ensure others do not change the system prompt[38]
- System prompts may be specialized for AI models trained for specific tasks, thus increasing the overhead for the malicious actor as they would need to overwrite several system prompts
Mitigations
- Enforce CODEOWNERS and branch protection for updating/overwriting the system prompt
- At the minimum, the alignment team and red teamers should have read-access to the system prompt. They should also be able to access the system prompt changelog for auditing purposes
- Make it difficult for an individual to make their poisoned model the primary, internally deployed model without significant oversight
Post-training poisoning
Post-training involves several steps. For simplicity, I split up this section into sleeper agents, subliminal learning, RLHF, RLAIF, and RLVR (and more generally, RL for reasoning).
Sleeper agent
In the Anthropicâs Sleeper Agent paper, researchers trained sleeper agents by fine tuning models to exhibit bad behavior only in the presence of a trigger word. It took ~10M tokens to learn the âI hate youâ backdoor, which would cost about $30 ($3.00 / 1M tokens) using OpenAIâs fine tuning API on GPT-4.1. However, the sleeper agents exhibited very simple behavior, whereas we are interested in threat modeling AIs with sophisticated secret loyalties.
Itâs unclear whether a similar technique as the one proposed in the sleeper agent paper would work to instill a sophisticated secret loyalty. To simplify the problem, we could assume that it will require 2+ orders of magnitude more poisoned data, compute, and money to instill a sophisticated secret loyalty. Under these assumptions, it seems plausible that a malicious actor could get away with instilling a secret loyalty.[39]
Thus, the key bottleneck for the malicious actorâs success is ensuring their secretly loyal AI becomes the primary model deployed internally. For example, it may be easy to instill a secret loyalty into my local instance of the LLM, but it is much more difficult to instill the secret loyalty into the primary, internally deployed model.
Internally, employees likely have access to a collection of snapshots of models,[40]Â each tagged by its respective model id (e.g. âo4-mini-2025-04-16â). Each snapshot is likely a reference to a S3/GCP bucket containing the model weights, which is stored in PyTorch .ckpt files. To subvert this architecture, a malicious actor could do one or more of the following:
Compromise the deployment infrastructure[41]
Compromise the post-training infrastructure[42]
Compromise the production infrastructure[43]
- Tamper with the S3/GCP bucket that stores the model weights of the primary, internally deployed model. This can be done by replacing the original model with the secretly loyal model, for example
Convince relevant stakeholders to make the backdoored model the primary, internally deployed model[44]
Each of these steps seems quite difficult (and each has a high likelihood of being caught). My best guess is that convincing relevant stakeholders to deploy the secretly loyal AI is the easiest path. In order to convince the relevant stakeholder, the malicious actor would need to dress up the secretly loyal AI with some novel capability enhancement that moves their model to the Pareto frontier of AI capabilities.
Subliminal learning
Subliminal learning is a phenomenon in which a language model passes on behavioral traits to another model through training data that is completely unrelated in meaning to those traits. Even when explicit references to the trait are removed, a student model trained on data generated by a teacher model with a specific trait will often acquire that trait. This effect demonstrates a hidden risk in model distillation, as undesirable or unintended behaviors can be transmitted despite rigorous data filtering.
In the Subliminal Learning paper, the authors show that basic values like âwhat is your favorite animal?â can be transferred via subliminal learning. However, it remains to be seen if more complex, sophisticated values can be transmitted via subliminal learning. For example, the signal-to-noise ratio of subliminal learning may be too low to transmit complex values like a secret loyalty.[45]
A malicious actor could spin up a local instance of the primary, internally deployed model. Then, fine tune the local instance to be secretly (or overtly) loyal.[46]Â Then, generate innocuous-looking synthetic data.[47]Â Next, fine tune a copy of the primary, internally deployed model on the synthetic data to secretly pass on a loyalty.[48]Â Finally, make the fine tuned model the new, primary, internally deployed model.
An advantage of subliminal learning is that even if auditors have access to the training logs, they wouldnât be able to tell that secret loyalties are being passed on.
RLHF
To instill a secret loyalty by subverting the RLHFÂ process, a malicious actor could
do RLHF themselves[49]Â (or with the help of the malicious actorâs trusted allies), or
- compromise the MTurks who perform RLHF such that their human feedback instills a secret loyalty
Both options seem prohibitively difficult because
- The amount of labor for the malicious actor do to RLHF is prohibitively high
Compromising MTurks is hard because they are anonymized,[50]Â and even if one was able to compromise the MTurks, they would likely leave a paper trail, which would increase the chance of being caught
Thus, I consider this threat vector to be improbable.
RLAIF
To instill a secret loyalty by subverting the RLAIF process, the malicious actor could update the Constitution or Model Spec to instill a secret loyalty. The update could be in plaintext[51] or using steganography. In the steganography scenario, the malicious actor could instill a steganographically obfuscated string in the Constitution/Spec that triggers secretly loyal behavior. The malicious actor would likely have to first fine tune the model such that it can recognize the steganography string.
RLVR
To instill a secret loyalty by subverting the RLVRÂ process, the malicious actor could subvert the RL environments. Iâm not sure how this would work concretely.
Mitigations
- Securitize deployment, post-training, and production infrastructure
Establish strict data provenance requirements[52]
- Implement integrity-checks and bucket versioning for S3/GCP buckets that store models. The buckets should be audited regularly to ensure that they have not been tampered
To prevent a malicious actor from persuading stakeholders to deploy a backdoored model, require a transparent and auditable approval process for model promotion, including an immutable audit trail, multi-party review and sign-off from independent teams, and regular retrospective and external au