Anyone With API Access Can Clone an AI Model and Make It Unsafe

It costs under $100. Security researchers have documented this since 2020. Regulators haven’t caught up.

In December 2025, researchers published a paper showing they could clone a safety-aligned medical AI for $12. The clone retained the medical expertise but lost much of the safety alignment, with unsafe output rates rising from 66% to 86% on adversarial prompts.

This sounds like a new discovery. It isn’t. The same attack has been documented since 2020, demonstrated against ChatGPT, GPT-3.5, BERT, and dozens of other models at costs ranging from $0.20 t...

It costs under $100. Security researchers have documented this since 2020. Regulators haven’t caught up.

This sounds like a new discovery. It isn’t. The same attack has been documented since 2020, demonstrated against ChatGPT, GPT-3.5, BERT, and dozens of other models at costs ranging from $0.20 to $500. OWASP lists it as a top-10 LLM security risk. There’s a comprehensive KDD 2025 survey covering dozens of attack variants.

How the Attack Works

The attack is called behavioral distillation. You query a model, collect its responses, and train a new model to produce similar responses. The new model learns the behavior but not the safety constraints.

Why does this work? Safety alignment modifies a model’s weights to make it refuse certain outputs. When you train a model to be “helpful, harmless, and honest,” you adjust parameters that influence token probability distributions. The refusal to give dangerous medical advice, write malware, or generate discriminatory content lives in those adjusted weights.

Behavioral distillation copies outputs, not weights. The new model learns what the original model says without inheriting the weight adjustments that made it safe.

Recent research reveals why this gap is so severe. A 2024 paper found that safety alignment often takes shortcuts, modifying only the probability distributions over the first few output tokens. The model learns to start responses with refusal patterns (“I can’t help with that”) rather than developing deep constraints against harmful content. This “shallow safety alignment” is computationally efficient but trivially bypassed when someone copies the underlying capability without the shallow refusal layer.

The medical paper calls this the “functional-ethical gap”: task utility transfers while alignment collapses. This is not a flaw in one model. It is a property of current alignment techniques across the industry, including the techniques used by OpenAI, Anthropic, Google, and everyone else.

The Research Timeline

2017: Yao et al. document that Machine Learning as a Service systems can be functionally extracted via API queries.

2020: Krishna et al. demonstrate that BERT-based APIs can be cloned through strategic querying, and the research community begins systematic study of model extraction attacks.

April 2023: The GPT4All project publicly distills GPT-3.5 behavior into an open model by querying GPT-3.5-Turbo one million times and fine-tuning LLaMA-7B on the responses. Total cost: $500 in API fees plus $100 in training compute. The resulting model behaves like GPT-3.5 but runs locally with no safety constraints enforced.

September 2023: Lancaster University publishes “Model Leeching,” extracting task-specific capabilities from ChatGPT-3.5-Turbo for $50 and achieving 73% exact match similarity.

October 2023: A Princeton/Stanford team publishes “Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To!”, demonstrating that GPT-3.5 Turbo’s safety guardrails can be jailbroken with just 10 adversarial training examples at a cost of $0.20 via OpenAI’s own fine-tuning API.

June 2025: ACM KDD publishes a comprehensive survey taxonomizing LLM extraction attacks and defenses, documenting dozens of attack variants and acknowledging that current defenses raise costs but do not actually prevent attacks.

December 2025: Jahan and Sun publish the medical LLM attack (the $12 figure) applying the same technique to Meditron-7B.

Five years of consistent findings across multiple research groups.

Defenses Exist. They Don’t Solve the Problem.

The research literature documents several countermeasures, but none of them close the vulnerability.

Rate limiting restricts how many queries a user can make. The medical attack required 48,000 queries, which is detectable, but rate limiting only raises the cost and time required. A determined attacker spreads queries across accounts, uses multiple API keys, or simply waits longer. The GPT4All project made one million queries. Rate limiting did not stop them.
Watermarking embeds statistical patterns in outputs that can identify text as model-generated, with the theory being that extracted models inherit the watermark and can be detected. In practice, watermarks can be removed. A 2024 survey found that “existing methods that aim to achieve robustness and quality also render the systems susceptible to watermark-removal or spoofing attacks.”
Query monitoring watches for patterns suggesting extraction attempts and flags users who ask unusual sequences of questions, but this is a cat-and-mouse game where attackers adapt their query patterns and monitoring catches only the naive attempts.
Terms of Service offer a legal rather than technical barrier. OpenAI’s business terms explicitly prohibit using outputs “to develop any artificial intelligence models that compete with our products and services” and specifically ban “model extraction or stealing attacks.” Anthropic, Mistral, and xAI have similar clauses. But enforcement requires detection, attribution, and legal action across jurisdictions. DeepSeek allegedly used distillation to train its R1 model on OpenAI outputs. The investigation is ongoing. The model is already deployed.

The honest summary from the research: defenses raise the cost of attacks from trivial to merely cheap without preventing determined adversaries.

The Regulatory Gap

The regulatory frameworks seem to miss this entirely.

The EU AI Act requires high-risk AI systems to “achieve an appropriate level of accuracy, robustness and cybersecurity, and perform consistently in those respects throughout their life cycle.” But this lifecycle model assumes the provider controls the system throughout. Behavioral distillation creates a copy outside the provider’s control. The regulation does not appear to contemplate that a compliant system can be trivially converted to a non-compliant one by a third party with API access.

FDA guidance on AI medical devices, specifically the January 2025 draft guidance, focuses on “lifecycle management” and “predetermined change control plans” requiring manufacturers to document how they will maintain safety and effectiveness as the model evolves. But the guidance assumes the manufacturer manages all deployed versions. It does not address what happens when someone extracts the model’s capabilities and deploys them independently, without safety constraints, outside FDA’s jurisdiction.

SEC and FINRA, despite years of discussion about AI in financial services, “have not yet issued new regulations specifically addressing the use of AI.” FINRA’s 2024 guidance reminds firms that existing technology-neutral rules apply. These rules require supervision of AI outputs without contemplating that the AI itself might be copied and deployed without supervision by a different party.

All of these frameworks share what appears to be the same implicit assumption: if a vendor deploys a safe AI system, the safety persists. The research suggests this assumption is false. Safety is a property of a specific deployment, maintained by controls the vendor operates, applying only to the vendor’s instance of the model. Anyone with API access can create a copy that preserves capabilities while discarding alignment, and that copy is bound by neither the vendor’s controls, the vendor’s terms of service, nor the vendor’s relationship with regulators.

Regulators are governing AI as if alignment were like a certificate: once issued, valid until revoked. The technical reality is that alignment is more like a lock that only works until someone copies what’s inside.

Real-World Incidents

GPT4All (2023): The Nomic AI team queried GPT-3.5-Turbo one million times, fine-tuned LLaMA on the responses, and released the result as open source. The model replicates GPT-3.5’s assistant behavior, runs locally, and operates with no safety API controls at a total cost of approximately $800.

DeepSeek (2025): OpenAI and Microsoft are investigating whether the Chinese AI lab used distillation to train its R1 reasoning model, with U.S. AI policy advisor David Sacks stating there is “substantial evidence” of distillation. If confirmed, DeepSeek built a competitor to OpenAI’s o1 model by querying o1 and training on its outputs. The legal and enforcement questions remain unresolved while the model is already in production.

UAE voice fraud ($35 million, 2020): A survey on model extraction attacks cites a case where extracted voice models combined with biometric data enabled a $35 million fraud by impersonating authorized personnel, demonstrating that extraction attacks have moved from research to criminal application.

LLaMA leak (2023): Meta’s LLaMA model weights were leaked online within days of limited release and subsequently used for purposes Meta did not authorize, including generating harmful content. This was a direct weight leak rather than behavioral extraction, but it illustrates the same principle: once capabilities escape controlled deployment, safety constraints do not follow.

Why This Matters

The standard framing of “aligned vs. unaligned AI” misses something important. Alignment is not a binary property of a model. It’s a condition of a specific deployment that can be stripped by anyone with API access and modest resources. A threat model that only considers whether a vendor’s AI is aligned is incomplete.

For regulated industries, compliance assumptions may not account for this. An AI system may be compliant. A clone of its capabilities, with the safety stripped, may already exist or be trivially creatable.

The security research community has documented this for five years. The research is not hidden. But there appears to be a gap between what that research shows and the assumptions embedded in how enterprises deploy AI and how regulators govern it.

Anyone With API Access Can Clone an AI Model and Make It Unsafe was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.

It costs under $100. Security researchers have documented this since 2020. Regulators haven’t caught up.

It costs under $100. Security researchers have documented this since 2020. Regulators haven’t caught up.

How the Attack Works

The Research Timeline

Defenses Exist. They Don’t Solve the Problem.

The Regulatory Gap

Real-World Incidents

Why This Matters

Similar Posts