Adversarial Transferability: Why Black-Box Attacks Work at All

The most disorienting fact about adversarial examples is that they transfer. An adversarial image crafted to fool a ResNet-50 trained on ImageNet frequently fools a VGG-19 trained on the same data — even though the two models have completely different architectures, parameter counts, and learned representations. This property, called transferability, is what makes black-box adversarial attacks practical against models you can’t directly inspect.

Understanding transferability isn’t just theoretically interesting. It determines whether gradient-based attacks are a real threat to production systems with no gradient access, and it shapes what defenses actually need to block.

The discovery: Szegedy et al. and the first hint

The transferability phenomenon was first documented by Szegedy et al. in 2014 (arXiv:1312.6199 ↗), the paper that introduced adversarial examples. They noted in passing that perturbations found by minimizing prediction confidence in one network sometimes transferred to another network trained separately on the same data.

Goodfellow et al. (2015, arXiv:1412.6572 ↗) formalized the Fast Gradient Sign Method and observed that the resulting perturbations transferred across architectures at non-trivial rates. But it was Papernot et al. (2016, arXiv:1605.07277 ↗) who turned transferability into an explicit attack strategy.

Papernot et al.: the substitute model attack

Papernot et al.’s “Practical Black-Box Attacks Against Machine Learning” laid out the operational use of transferability. The attack works in three phases:

Phase 1: Build a substitute model. The adversary queries the target API with synthetic or natural inputs and collects the predictions (just the labels, not confidence scores). Using these labeled examples, they train a substitute model f_sub to mimic the target’s decision boundary.

Phase 2: Craft adversarial examples on the substitute. With white-box access to f_sub, the adversary runs FGSM, PGD, or C&W to generate adversarial examples.

Phase 3: Transfer to the target. The crafted perturbations are fed to the target API. Due to transferability, a substantial fraction fool the target model, even though it was never directly attacked.

In their original evaluation, Papernot et al. achieved attack success rates of 84-100% against machine learning models from commercial APIs (Google, Amazon, Microsoft) using this procedure. The substitute model only needed to approximate the decision boundary in the relevant region of input space, not globally.

Why do adversarial examples transfer?

Several explanations have been proposed, none of them complete:

Linear regions and shared decision boundaries. Goodfellow et al.’s original explanation for why adversarial examples exist at all — that high-dimensional linear functions accumulate perturbation in the loss direction — predicts transferability as a byproduct. If classifiers trained on similar data carve out similar linear regions of input space, adversarial examples that exploit the gradient direction of those regions will fool multiple classifiers.

Shared invariances from shared training data. Models trained on the same distribution learn similar spurious features. Adversarial examples often exploit these spurious features (texture statistics, low-level frequency components) rather than semantically meaningful ones. If the spurious feature is shared across models, the perturbation transfers.

The input manifold hypothesis. Adversarial examples may be “off-manifold” points that happen to fall in regions where the classifier’s locally linear approximation misleads it. If the shape of the off-manifold region is determined by the data distribution rather than the specific model, perturbations will transfer across models trained on that distribution.

Tramer et al.’s “Ensemble Adversarial Training” paper (arXiv:1705.07204 ↗) empirically showed that diversity of training data and architectures reduces but does not eliminate transfer rates. This suggests transferability is partly about shared data structure and partly about architecture similarities.

Transfer rates in practice

The success rate of transferred attacks varies considerably by:

Source and target architecture similarity. Attacks crafted on one ResNet transfer better to another ResNet than to a Vision Transformer. Similar inductive biases produce more similar decision boundaries.

Attack strength. More aggressive perturbations (larger epsilon, more iterations) generally transfer at higher rates. PGD with more steps transfers better than FGSM, at the cost of perceptibility.

Input normalization and preprocessing. Target models with different input preprocessing pipelines see a different perturbation than was crafted, reducing transfer rates. This is why some defenses that process inputs (JPEG compression, input smoothing) provide partial resistance to transferred attacks even though they don’t resist adaptive white-box attacks.

Ensemble-based crafting. Crafting adversarial examples on an ensemble of diverse substitute models substantially improves transfer rates. Liu et al. (arXiv:1611.02770 ↗) showed that attacking an ensemble of models and taking the common adversarial direction produces perturbations that transfer more reliably than single-model attacks.

State-of-the-art black-box transfer attacks against ImageNet classifiers achieve 60-80% top-1 attack success with L-infinity epsilon = 16/255, transferring from a ResNet ensemble to diverse target architectures. For comparison, white-box PGD achieves >99% at the same epsilon.

Decision-based and score-based black-box attacks

Transferability isn’t the only route to black-box adversarial examples. Two alternatives don’t require any white-box substitute model:

Score-based attacks use the full confidence vector from the target API to estimate the gradient numerically. NES (Natural Evolution Strategies, Ilyas et al., arXiv:1804.08598 ↗) estimates the gradient by sampling random directions and measuring the change in loss. With enough queries, this produces adversarial examples with similar fidelity to white-box attacks, at the cost of thousands of queries per example.

Decision-based attacks use only the final hard-label output and no confidence scores. Brendel et al.’s Boundary Attack (arXiv:1712.04248 ↗) starts from a large adversarial perturbation and takes random steps along the decision boundary to reduce perturbation magnitude. HopSkipJump (Chen et al., arXiv:1904.02144 ↗) improved this significantly. Decision-based attacks require more queries but work against APIs that return only labels.

The practical tradeoff: transfer attacks are cheap (no queries to the target during crafting, only at evaluation) but have lower success rates. Score-based and decision-based attacks are expensive in queries but achieve higher fidelity. Query budgets and rate limiting determine which is viable against a specific target.

Transfer as a signal for defense evaluation

Transferability has an uncomfortable implication for defense evaluations: a defense can show zero transfer attack success while being trivially broken by an adaptive white-box attacker. Defenses that process the input before classification (JPEG compression, feature squeezing, randomization) may block transferred attacks because the preprocessing destroys the perturbation that was crafted on a different model. But adaptive attackers who know the preprocessing can craft through it.

Athalye et al.’s “Obfuscated Gradients” (arXiv:1802.00420 ↗) documented exactly this: several defenses appeared robust against transferred attacks and even against non-adaptive white-box attacks, but fell to adaptive attackers who accounted for the defense in their gradient computation. Low transfer rate is not evidence of genuine robustness.

This is why rigorous evaluations like RobustBench use adaptive white-box attacks (AutoAttack) as the benchmark, not transfer attacks. Standardized adaptive evaluation results are aggregated at aisecbench.com ↗ alongside certified robustness numbers, giving a common basis for comparing defenses without the ambiguity of transfer-rate comparisons.

What defenses need to achieve

Against black-box transfer attacks, defenses that add unpredictability (randomized preprocessing, model ensembles, input transformations) provide genuine resistance because they prevent the adversary from crafting on a reliable substitute. The defense doesn’t need to be robust in a formal sense — it just needs to be diverse enough that a fixed perturbation misses.

Against adaptive white-box attacks (the correct benchmark), those same defenses fail. Genuine robustness requires adversarial training or certified methods. The practical guidance is to use input-transformation defenses as a speed bump against opportunistic transfer attacks while relying on adversarially trained models for the threat model that actually matters. Deployment patterns for layering both are covered at aidefense.dev ↗.

The transferability literature has also driven advances in understanding why neural networks are vulnerable at all. The best current account — that models learn non-robust features that generalize well but are sensitive to small perturbations (Ilyas et al., arXiv:1905.02175 ↗) — predicts that transferability is a property of the data distribution, not just the models. If the features that enable transfer attacks are features the model legitimately uses for classification, removing transferability vulnerability may require changing what features the model is allowed to learn.

References

Szegedy et al., “Intriguing Properties of Neural Networks” (2014), arXiv:1312.6199 ↗
Goodfellow et al., “Explaining and Harnessing Adversarial Examples” (2015), arXiv:1412.6572 ↗
Papernot et al., “Practical Black-Box Attacks Against Machine Learning” (2016), arXiv:1605.07277 ↗
Liu et al., “Delving into Transferable Adversarial Examples and Black-box Attacks” (2017), arXiv:1611.02770 ↗
Tramer et al., “Ensemble Adversarial Training: Attacks and Defenses” (2017), arXiv:1705.07204 ↗
Ilyas et al., “Black-box Adversarial Attacks with Limited Queries and Information” (2018), arXiv:1804.08598 ↗
Brendel et al., “Decision-Based Adversarial Attacks: Reliable Attacks Against Black-Box Machine Learning Models” (2018), arXiv:1712.04248 ↗
Athalye et al., “Obfuscated Gradients Give a False Sense of Security” (2018), arXiv:1802.00420 ↗
Ilyas et al., “Adversarial Examples Are Not Bugs, They Are Features” (2019), arXiv:1905.02175 ↗
Chen et al., “HopSkipJumpAttack” (2020), arXiv:1904.02144 ↗

Adversarial Transferability: Why Black-Box Attacks Work at All

The discovery: Szegedy et al. and the first hint

Papernot et al.: the substitute model attack

Why do adversarial examples transfer?

Transfer rates in practice

Decision-based and score-based black-box attacks

Transfer as a signal for defense evaluation

What defenses need to achieve

References

Adversarial ML — in your inbox

Related

Data Poisoning and Backdoor Attacks on Foundation Models

Evasion Attacks on Image Classifiers: FGSM, PGD, and C&W

Adversarial Robustness in NLP: Why Text Attacks Are Different

Comments