Stealing AI Models Through the API: A Practical Model Extraction Attack (opens in new tab)

Organizations invest significant resources training proprietary machine learning (ML) models that provide competitive advantages, whether for medical imaging, fraud detection, or recommendation systems. These models represent months of R&D, specialized datasets, and hard-won domain expertise.

But what if an attacker could duplicate an expensive machine learning model at a fraction of the cost?

Model extraction is an emerging threat to any organization exposing ML capabilities through APIs. Using only standard API access, the same access any legitimate user would have, an attacker can build a substitute model that mirrors the target’s behavior with alarming fidelity. No access to training data. No knowledge of the model architecture. Just queries and responses.

This attack is a form of model theft called model extraction, and it’s a growing threat to any organization exposing proprietary machine learning models through APIs.

What Is Model Extraction?

In a model extraction attack, an adversary with query access to a ML model can steal the model’s underlying functionality by systematically querying it and using the outputs to build a replica model that mimics the target’s behavior. The attack follows a straightforward pattern:

  1. The attacker sends carefully crafted inputs to the target model’s API
  2. The attacker records the model’s responses to build a dataset of input-output pairs
  3. The attacker trains their own “stolen” model using this collected data

The key insight in this attack relies on the fact that soft probability outputs contain far more information than hard labels. For example, when a model that classifies shoes returns “80% sneaker, 15% ankle boot, 5% sandal,” it reveals learned relationships between classes, which are relationships the attacker can abuse to train a highly effective replica.

The Attack Scenario

The ML model in this attack scenario is a convolutional neural network (CNN) designed for image analysis. CNNs are compact AI models optimized for computer vision tasks, an earlier generation of technology compared to today’s large language models, but still powering critical applications from facial recognition to diagnostic imaging systems.

The applications that use these CNNs analyze images and return identified regions of interest. From the outside, we have no visibility into the model’s architecture, training data, or internal weights. We can only submit images and observe outputs, in a classic zero-knowledge fashion.

Building the Training Dataset

Loading more...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help