We formalize and extend existing definitions of backdoor-based watermarks and adversarial defenses as interactive protocols between two players. The existence of these schemes is inherently tied to the learning tasks for which they are designed. Our main result shows that for almost every learning task, at least one of the two – a watermark or an adversarial defense – exists. The term “almost every” indicates that we also identify a third, counterintuitive but necessary option, i.e., a scheme we call a transferable attack. By transferable attack, we refer to an efficient algorithm computing queries that look indistinguishable from the data distribution and fool all efficient defenders. To this end, we prove the necessity of a transferable attack via a construction that uses a cryptographic tool called homomorphic encryption. Furthermore, we show that any task that satisfies our notion of a transferable attack implies a cryptographic primitive, thus requiring the underlying task to be computationally complex. These two facts imply an “equivalence” between the existence of transferable attacks and cryptography. Finally, we show that the class of tasks of bounded VC-dimension has an adversarial defense, and a subclass of them has a watermark.
@article{gluch2024_goodbadugly,title={The Good, the Bad and the Ugly: Watermarks, Transferable Attacks and Adversarial Defenses},author={Głuch, Grzegorz and Turan, Berkant and Nagarajan, Sai Ganesh and Pokutta, Sebastian},year={2024},journal={arXiv preprint arXiv:2410.08864},primaryclass={cs.LG},}
Unified Taxonomy in AI Safety: Watermarks, Adversarial Defenses, and Transferable Attacks
Grzegorz Gluch, Sai Ganesh Nagarajan, and Berkant Turan
ICML 2024 Workshop on Theoretical Foundations of Foundation Models (TF2M), 2024
As AI becomes omnipresent in today’s world, it is crucial to study the safety aspects of learning, such as guaranteed watermarking capabilities and defenses against adversarial attacks. In prior works, these properties were generally studied separately and empirically barring a few exceptions. Meanwhile, strong forms of adversarial attacks that are transferable had been developed (empirically) for discriminative DNNs (Liu et al., 2016) and LLMs (Zou et al., 2023). In this ever-evolving landscape of attacks and defenses, we initiate the formal study of watermarks, defenses, and transferable attacks for classification, under a unified framework, by having two time-bounded players participate in an interactive protocol. Consequently, we show that for every learning task, at least one of the three schemes exists. Importantly, our results cover regimes where VC theory is not necessarily applicable. Finally we provide provable examples of the three schemes and show that transferable attacks exist only in regimes beyond bounded VC dimension. The example we give is a nontrivial construction based on cryptographic tools, i.e. homomorphic encryption.
@article{gluch_taxonomy_2024,title={Unified {T}axonomy in {AI} {S}afety: {W}atermarks, {A}dversarial {D}efenses, and {T}ransferable {A}ttacks},author={Gluch, Grzegorz and Nagarajan, Sai Ganesh and Turan, Berkant},journal={ICML 2024 Workshop on Theoretical Foundations of Foundation Models (TF2M)},year={2024},}
Interpretability Guarantees with Merlin-Arthur Classifiers
Stephan Wäldchen, Kartikey Sharma, Berkant Turan, Max Zimmer, and 1 more author
In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS), PMLR, 2024
We propose an interactive multi-agent classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of lower bounds on the mutual information between selected features and the classification decision. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express these bounds in terms of measurable metrics such as soundness and completeness. Compared to existing interactive setups, we rely neither on optimal agents nor on the assumption that features are distributed independently. Instead, we use the relative strength of the agents as well as the new concept of Asymmetric Feature Correlation which captures the precise kind of correlations that make interpretability guarantees difficult. We evaluate our results on two small-scale datasets where high mutual information can be verified explicitly.
@inproceedings{pmlr-v238-waldchen24a,title={ Interpretability Guarantees with {M}erlin-{A}rthur Classifiers },author={W\"{a}ldchen, Stephan and Sharma, Kartikey and Turan, Berkant and Zimmer, Max and Pokutta, Sebastian},booktitle={Proceedings of The 27th International Conference on Artificial Intelligence and Statistics (AISTATS)},pages={1963--1971},year={PMLR, 2024},editor={Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen},volume={238},series={Proceedings of Machine Learning Research},publisher={PMLR},}
Extending Merlin-Arthur Classifiers for Improved Interpretability
Berkant Turan
In Joint Proceedings of the xAI-2023 Late-breaking Work, Demos and Doctoral Consortium, co-located with the 1st World Conference on eXplainable Artificial Intelligence (xAI-2023), Jul 2023
In my doctoral research, I aim to address the interpretability challenges associated with deep learning by extending the Merlin-Arthur Classifier framework. This novel approach employs a pair of feature selectors, including an adversarial player, to generate informative saliency maps. My research focuses on enhancing the classifier’s performance and exploring its applicability to complex datasets, including a recently established human benchmark for detecting pathologies in X-ray images. Tackling the min-max optimization challenge inherent in the Merlin-Arthur Classifier for high-dimensional data, I will explore and apply diverse stabilization strategies to bolster the framework’s robustness and training stability. Finally, the goal is to expand the framework beyond pixel-level saliency maps to encompass modalities, such as text and learned feature spaces, fostering a comprehensive understanding of interpretability across various domains and data types.
@inproceedings{TuranConsortium2023,title={Extending Merlin-Arthur Classifiers for Improved Interpretability},author={Turan, Berkant},booktitle={Joint Proceedings of the xAI-2023 Late-breaking Work, Demos and Doctoral Consortium, co-located with the 1st World Conference on eXplainable Artificial Intelligence (xAI-2023)},pages={193-200},year={2023},organization={Springer},address={Lisbon, Portugal},month=jul,}
Robustness of Hybrid Discriminative-Generative Models
@mastersthesis{Turan2022Msc,title={Robustness of Hybrid Discriminative-Generative Models},author={Turan, Berkant},year={2022},school={Technical University of Berlin},type={mastersthesis},}
Modeling and Simulation of Convective Flows in the Outer Earth’s Core using the Finite Element Method
@bachelorsthesis{Turan2018Bsc,title={Modeling and Simulation of Convective Flows in the Outer Earth's Core using the Finite Element Method},author={Turan, Berkant},year={2018},month=oct,school={Technical University of Berlin},type={mastersthesis},}