Watermarks, Adversarial Defenses & Transferable Attacks

As AI systems become more pervasive, the question of when they can be secured — and when they fundamentally cannot — becomes critical. This project develops a unified theoretical framework for watermarking, adversarial defenses, and transferable attacks.

We model each scheme as an interactive protocol between two time-bounded players. Our main result shows that for almost every learning task, at least one of three schemes must exist: a watermark, an adversarial defense, or a transferable attack. The “almost every” qualifier is key — we also identify a third, counterintuitive regime where neither watermarks nor defenses exist, but efficient transferable attacks do. We prove this via a construction using homomorphic encryption, establishing a formal equivalence between transferable attacks and cryptographic hardness.

Watermarks, Defenses, and Transferable Attacks

Papers:

The Good, the Bad and the Ugly: Meta-Analysis of Watermarks, Transferable Attacks and Adversarial Defenses — NeurIPS 2025
Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks — ICML 2024 Workshop