Organizations invest significant resources training proprietary machine learning (ML) models that provide competitive advantages, whether for medical imaging, fraud detection, or recommendation systems. These models represent months of R&D, specialized datasets, and hard-won domain expertise.
But what if an attacker could duplicate an expensive machine learning model at a fraction of the cost?
Model extraction is an emerging threat to any organization exposing ML capabilities through APIs. Using only standard API access, the same access any legitimate user would have, an attacker can build a substitute model that mirrors the target’s behavior with alarming fidelity. No access to training data. No knowledge of the model architecture. Just queries and responses.
This attack is a form of model theft called model extraction, and it’s a growing threat to any organization exposing proprietary machine learning models through APIs.
What Is Model Extraction?
In a model extraction attack, an adversary with query access to a ML model can steal the model’s underlying functionality by systematically querying it and using the outputs to build a replica model that mimics the target’s behavior. The attack follows a straightforward pattern:
- The attacker sends carefully crafted inputs to the target model’s API
- The attacker records the model’s responses to build a dataset of input-output pairs
- The attacker trains their own “stolen” model using this collected data
The key insight in this attack relies on the fact that soft probability outputs contain far more information than hard labels. For example, when a model that classifies shoes returns “80% sneaker, 15% ankle boot, 5% sandal,” it reveals learned relationships between classes, which are relationships the attacker can abuse to train a highly effective replica.
The Attack Scenario
The ML model in this attack scenario is a convolutional neural network (CNN) designed for image analysis. CNNs are compact AI models optimized for computer vision tasks, an earlier generation of technology compared to today’s large language models, but still powering critical applications from facial recognition to diagnostic imaging systems.
The applications that use these CNNs analyze images and return identified regions of interest. From the outside, we have no visibility into the model’s architecture, training data, or internal weights. We can only submit images and observe outputs, in a classic zero-knowledge fashion.