We formalize and extend existing definitions of backdoor-based watermarks and adversarial defenses as interactive protocols between two players. The existence of these schemes is inherently tied to the learning tasks for which they are designed. Our main result shows that for almost every learning task, at least one of the two – a watermark or an adversarial defense – exists. The term “almost every” indicates that we also identify a third, counterintuitive but necessary option, i.e., a scheme we call a transferable attack. By transferable attack, we refer to an efficient algorithm computing queries that look indistinguishable from the data distribution and fool all efficient defenders. To this end, we prove the necessity of a transferable attack via a construction that uses a cryptographic tool called homomorphic encryption. Furthermore, we show that any task that satisfies our notion of a transferable attack implies a cryptographic primitive, thus requiring the underlying task to be computationally complex. These two facts imply an “equivalence” between the existence of transferable attacks and cryptography. Finally, we show that the class of tasks of bounded VC-dimension has an adversarial defense, and a subclass of them has a watermark.
Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks
Grzegorz Gluch, Sai Ganesh Nagarajan, and Berkant Turan
ICML 2024 Workshop on Theoretical Foundations of Foundation Models (TF2M), 2024
As AI becomes omnipresent in today’s world, it is crucial to study the safety aspects of learning, such as guaranteed watermarking capabilities and defenses against adversarial attacks. In prior works, these properties were generally studied separately and empirically barring a few exceptions. Meanwhile, strong forms of adversarial attacks that are transferable had been developed (empirically) for discriminative DNNs (Liu et al., 2016) and LLMs (Zou et al., 2023). In this ever-evolving landscape of attacks and defenses, we initiate the formal study of watermarks, defenses, and transferable attacks for classification, under a unified framework, by having two time-bounded players participate in an interactive protocol. Consequently, we show that for every learning task, at least one of the three schemes exists. Importantly, our results cover regimes where VC theory is not necessarily applicable. Finally we provide provable examples of the three schemes and show that transferable attacks exist only in regimes beyond bounded VC dimension. The example we give is a nontrivial construction based on cryptographic tools, i.e. homomorphic encryption.
Interpretability Guarantees with Merlin-Arthur Classifiers
Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, and 1 more author
In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2024
We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
Extending Merlin-Arthur Classifiers for Improved Interpretability
Berkant Turan
In Joint Proceedings of the xAI-2023 Late-breaking Work, Demos and Doctoral Consortium, co-located with the 1st World Conference on eXplainable Artificial Intelligence (xAI-2023), Jul 2023
In my doctoral research, I aim to address the interpretability challenges associated with deep learning by extending the Merlin-Arthur Classifier framework. This novel approach employs a pair of feature selectors, including an adversarial player, to generate informative saliency maps. My research focuses on enhancing the classifier’s performance and exploring its applicability to complex datasets, including a recently established human benchmark for detecting pathologies in X-ray images. Tackling the min-max optimization challenge inherent in the Merlin-Arthur Classifier for high-dimensional data, I will explore and apply diverse stabilization strategies to bolster the framework’s robustness and training stability. Finally, the goal is to expand the framework beyond pixel-level saliency maps to encompass modalities, such as text and learned feature spaces, fostering a comprehensive understanding of interpretability across various domains and data types.
Robustness of Hybrid Discriminative-Generative Models