Hardening against AI takeover is difficult, but we should try

Published on November 5, 2025 4:25 PM GMT

This is a commentary on a paper by RAND: Can Humans Devise Practical Safeguards That Are Reliable Against an Artificial Superintelligent Agent?

Over a decade ago, Eliezer Yudkowsky famously ran the AI box experiment, in which a gatekeeper had to keep a hypothetical ASI, played by him, inside a box, while the ASI tried to convince the gatekeeper to be let loose. In the experiment, the “ASI” often won, which may have helped to convince the emerging AI safety community at the time that constraining an ASI was a dead end and our efforts had to be put into aligning it (which o…

Published on November 5, 2025 4:25 PM GMT

This is a commentary on a paper by RAND: Can Humans Devise Practical Safeguards That Are Reliable Against an Artificial Superintelligent Agent?

Think tank RAND, however, is now trying to breathe some life back into this long lost paradigm of AI containment with a new report, titled Can Humans Devise Practical Safeguards That Are Reliable Against an Artificial Superintelligent Agent? Their researchers, who were earlier unconvinced that AI could cause human extinction (something we disagreed with), are now claiming that we can sufficiently harden the world to prevent takeover by a superintelligence, however smart it will be. “This paper presents a hypothesis that such safeguards are feasible and sensible.”

What to make of this?

It would be wrong to dismiss this line of containment work simply because Yudkowsky said a decade ago that it’s not going to work. Sure, if we get a fast takeoff, and if we do find ourselves in a situation of conflict between medieval knights vs. nuclear weapons (with ourselves in the role of the knights), then trying to limit access to horses and swords does not seem to be a great strategy. But this threat model is far from guaranteed. In fact, in the AI box experiments, the human did win some of the time, but it was not a robust enough method.

There are two places where ASI can be contained: at the source or when it tries to do harm. Previously, the AI box thought experiment was carried out at the source, making sure ASI could not break out of some confinement. The RAND paper, however, thinks ASI can be contained when it tries to do harm. It is optimistic that we can “design practical safeguards to protect U.S. critical and digital infrastructure against attacks from a future artificial superintelligence (ASI)”.

If we could actually successfully stop a superintelligence when it tries to do harm, this would solve not only existential risk caused by unaligned ASI, but also risk caused by intent-aligned ASI from a bad actor, a rogue state, or e.g. a single AI company trying to take over - three other major sources of risk. Even open source AI would not be existentially risky anymore, something that e.g. solving technical AI alignment or containment at the source fails to achieve (barring a pivotal act). This is true since technical alignment, or a contained development environment, can both be removed by a bad or careless actor with access to a capable open source model. A major advantage of containing AI when it tries to do harm, if such an approach is possible, is that it would protect against a wide range of otherwise possibly unsolvable attacks.

So which practical safeguards does RAND propose? This is where the work unfortunately falls short: RAND states that safeguards that are already in place, and decidedly unsafe, may suffice. Safeguards it heralds as promising include air gapping (already broken by Stuxnet), faraday cages (easily broken by human manipulation), bandwidth limitation (little bandwidth may be needed), and cryptography (again broken by human manipulation). They even claim that “many existing safeguards against people accessing information might be sufficient to also keep out an arbitrarily intelligent AI agent”. This seems clearly incorrect: many of the safeguards they mention have already been compromised by mere human-level intelligence, and should therefore not be expected to hold against a superintelligence by default.

Another significant issue in their approach is that it seems to rely on a slow takeoff. As RAND admits, ”security determinations must be evaluated and reevaluated repeatedly and thoroughly”. This may take years to achieve, however, which is way too long for fast or even medium-speed takeoff scenarios. Clearly, much more planning ahead than the authors propose is needed to give this proposal any chance of succeeding.

The paper is stronger where it concerns ways to improve human resistance against AI persuasion. It correctly states that “human persuasion is not assured and represents a highly uncertain rather than deterministic attack vector. (…) Human-centered security practices capitalizing on the uncertain nature of human persuasion might require multiple humans to act in concert to bypass controls, compounding the likelihood of failure.”

The paper then goes over a hypothetical scenario in which Alice sends a secret message to Bob, which an ASI (AIvan) tries to intercept. It concludes that, using multiple humans in series guarding e.g. a password as proposed above: “In this scenario, AIvan has a chance of success, but it can be reliably made into a vanishingly small chance of success.”

Can it? We are not sure, but we would be excited if this were true.

This does not yet resolve perhaps the biggest issue of world hardening, though: implementation. An (overly) simple threat model that we might want to harden the world against, might be that an AI showcases its power by killing e.g. 1% of the global population, and then scares the rest into submission.

How much world hardening is needed to make sure AI can’t kill 1% of the population? We have over 7,000 nukes spread over 7 countries. We have chemical weapons and biological weapons, and an AI could create new ones. Then we have regular weapons that may be connected to the internet, or may have bypassable air gaps. We have food and water that can be poisoned, nuclear power plants that can explode. We have increasing amounts of self-driving cars that can be made into lethal weapons when hacked. Lots of this is connected to the internet, and even if it’s not, AI could pass air gaps. In general: we have infrastructure that can not only be sabotaged, but weaponized against us. Our world is highly fragile.

Is it realistic that all this equipment, and much more that we haven’t discussed but an ASI could also use to take over in creative ways, can be hardened by adding layers of safety humans around it?

Also, as discussed before, hardening our current world would not be effective against the more ambitious threat models. If ASI creates and weaponizes new science like we have created the atomic bomb, or if it colonizes a nearby region of space and directs a large object towards us, it seems that no amount of cryptography could protect us.

Another major issue: will anyone try to harden anything, when AI existential risk awareness is still far from universal? The RAND experts themselves are still skeptical that anything needs to be done at all. Successfully hardening our world against AI takeovers would require buy-in from leading decisionmakers, and likely the public. This is still very much work in progress (although progress is made, with e.g. open letters and statements, media work, and books such as IABIAD).

Even considering all these drawbacks, it cannot be immediately proven that the RAND approach is not viable for any threat model. And it does have a huge advantage over e.g. alignment work or ASI containment at the source: it would offer a solution against not just risk caused by leading labs, but risk caused by anyone. That is truly hopeful.

My personal p(doom) would lower significantly if serious researchers and policymakers, such as those from RAND, would say: ASI might be able to take over, and to make sure that doesn’t happen, we’re going to reduce the number of people with access to CBRN weapons and vital infrastructure levers. We’re going to increase the length of the human chain that needs to be convinced to break into these systems. We will carefully track not only human manipulation capabilities of AI models, but AI’s dangerous capabilities for all threat vectors, including manipulation, cyber offense, AI development, weapons acquisition, and others.

Today, we are unfortunately not seeing this. If RAND’s work can contribute to changing this, that may just be important progress.

Discuss

Similar Posts