参考文献

Abadi et al., 2016

Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., & Zhang, L. (2016). Deep learning with differential privacy. ACM SIGSAC Conference on Computer and Communications Security.

Abnar & Zuidema, 2020

Abnar, S., & Zuidema, W. (2020). Quantifying attention flow in transformers. Annual Meeting of the Association for Computational Linguistics.

Achiam et al., 2023

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., … others. (2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Adi et al., 2018

Adi, Y., Baum, C., Cisse, M., Pinkas, B., & Keshet, J. (2018). Turning your weakness into a strength: watermarking deep neural networks by backdooring. USENIX Security Symposium.

Alayrac et al., 2019

Alayrac, J.-B., Uesato, J., Huang, P.-S., Fawzi, A., Stanforth, R., & Kohli, P. (2019). Are labels required for improving adversarial robustness? Advances in Neural Information Processing Systems.

An et al., 2024

An, S., Chou, S.-Y., Zhang, K., Xu, Q., Tao, G., Shen, G., … others. (2024). Elijah: eliminating backdoors injected in diffusion models via distribution shift. AAAI Conference on Artificial Intelligence.

Anderberg et al., 2024

Anderberg, A., Bailey, J., Campello, R. J., Houle, M. E., Marques, H. O., Radovanović, M., & Zimek, A. (2024). Dimensionality-aware outlier detection: theoretical and experimental analysis. SIAM International Conference on Data Mining.

Andreina et al., 2021

Andreina, S., Marson, G. A., Möllering, H., & Karame, G. (2021). Baffle: backdoor detection via feedback-based federated learning. International Conference on Distributed Computing Systems.

Andriushchenko et al., 2020

Andriushchenko, M., Croce, F., Flammarion, N., & Hein, M. (2020). Square attack: a query-efficient black-box adversarial attack via random search. European Conference on Computer Vision (pp. 484–501).

Athalye et al., 2018a

Athalye, A., Carlini, N., & Wagner, D. (2018). Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. International Conference on Machine Learning (pp. 274–283).

Athalye et al., 2018b

Athalye, A., Engstrom, L., Ilyas, A., & Kwok, K. (2018). Synthesizing robust adversarial examples. International Conference on Machine Learning (pp. 284–293).

Bagdasaryan et al., 2020

Bagdasaryan, E., Veit, A., Hua, Y., Estrin, D., & Shmatikov, V. (2020). How to backdoor federated learning. International Conference on Artificial Intelligence and Statistics (pp. 2938–2948).

Bai et al., 2024

Bai, Y., Pei, G., Gu, J., Yang, Y., & Ma, X. (2024). Special characters attack: toward scalable training data extraction from large language models. arXiv preprint arXiv:2405.05990.

Bai et al., 2020

Bai, Y., Zeng, Y., Jiang, Y., Xia, S.-T., Ma, X., & Wang, Y. (2020). Improving adversarial robustness via channel-wise activation suppressing. International Conference on Learning Representations.

Bai et al., 2022a

Bai, Y., Jones, A., Ndousse, K., Askell, A., Chen, A., DasSarma, N., … others. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.

Bai et al., 2022b

Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … others. (2022). Constitutional ai: harmlessness from ai feedback. arXiv preprint arXiv:2212.08073.

Ban & Dong, 2022

Ban, Y., & Dong, Y. (2022). Pre-trained adversarial perturbations. Advances in Neural Information Processing Systems.

Bansal et al., 2023

Bansal, H., Singhi, N., Yang, Y., Yin, F., Grover, A., & Chang, K.-W. (2023). Cleanclip: mitigating data poisoning attacks in multimodal contrastive learning. IEEE/CVF International Conference on Computer Vision.

Bao et al., 2023

Bao, F., Nie, S., Xue, K., Li, C., Pu, S., Wang, Y., … Zhu, J. (2023). One transformer fits all distributions in multi-modal diffusion at scale. International Conference on Machine Learning (pp. 1692–1717).

Barreno et al., 2006

Barreno, M., Nelson, B., Sears, R., Joseph, A. D., & Tygar, J. D. (2006). Can machine learning be secure? ACM Symposium on Information, Computer and Communications Security (pp. 16–25).

Bendale & Boult, 2016

Bendale, A., & Boult, T. E. (2016). Towards open set deep networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1563–1572).

Biggio et al., 2013

Biggio, B., Corona, I., Maiorca, D., Nelson, B., Šrndić, N., Laskov, P., … Roli, F. (2013). Evasion attacks against machine learning at test time. Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 387–402).

Biggio et al., 2012

Biggio, B., Nelson, B., & Laskov, P. (2012). Poisoning attacks against support vector machines. International Conference on International Conference on Machine Learning (pp. 1467–1474). Madison, WI, USA: Omnipress.

Blanchard et al., 2017

Blanchard, P., El Mhamdi, E. M., Guerraoui, R., & Stainer, J. (2017). Machine learning with adversaries: byzantine tolerant gradient descent. Advances in Neural Information Processing Systems.

Brendel et al., 2018

Brendel, W., Rauber, J., & Bethge, M. (2018). Decision-based adversarial attacks: reliable attacks against black-box machine learning models. International Conference on Learning Representations.

Brown et al., 2020

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … others. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems.

Brown et al., 2017

Brown, T. B., Mané, D., Roy, A., Abadi, M., & Gilmer, J. (2017). Adversarial patch.

Cai et al., 2018

Cai, Q.-Z., Liu, C., & Song, D. (2018). Curriculum adversarial training. International Joint Conference on Artificial Intelligence (pp. 3740–3747).

Cao et al., 2021a

Cao, X., Jia, J., & Gong, N. Z. (2021). Ipguard: protecting intellectual property of deep neural networks via fingerprinting the classification boundary. ACM Asia Conference on Computer and Communications Security.

Cao et al., 2021b

Cao, Y., Wang, N., Xiao, C., Yang, D., Fang, J., Yang, R., … Li, B. (2021). Invisible for both camera and lidar: security of multi-sensor fusion based perception in autonomous driving under physical-world attacks. IEEE Symposium on Security and Privacy.

Carlini et al., 2022

Carlini, N., Chien, S., Nasr, M., Song, S., Terzis, A., & Tramer, F. (2022). Membership inference attacks from first principles. IEEE Symposium on Security and Privacy.

Carlini et al., 2023a

Carlini, N., Ippolito, D., Jagielski, M., Lee, K., Tramer, F., & Zhang, C. (2023). Quantifying memorization across neural language models. International Conference on Learning Representations.

Carlini et al., 2023b

Carlini, N., Nasr, M., Choquette-Choo, C. A., Jagielski, M., Gao, I., Awadalla, A., … others. (2023). Are aligned neural networks adversarially aligned? arXiv:2306.15447.

Carlini & Terzis, 2021

Carlini, N., & Terzis, A. (2021). Poisoning and backdooring contrastive learning. arXiv preprint arXiv:2106.09667.

Carlini et al., 2021

Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., … others. (2021). Extracting training data from large language models. USENIX Security Symposium (pp. 2633–2650).

Carlini & Wagner, 2017a

Carlini, N., & Wagner, D. (2017). Adversarial examples are not easily detected: bypassing ten detection methods. ACM Workshop on Artificial Intelligence and Security (pp. 3–14).

Carlini & Wagner, 2017b

Carlini, N., & Wagner, D. (2017). Magnet and" efficient defenses against adversarial attacks" are not robust to adversarial examples. arXiv preprint arXiv:1711.08478.

Carlini & Wagner, 2017c

Carlini, N., & Wagner, D. (2017). Towards evaluating the robustness of neural networks. IEEE Symposium on Security and Privacy (pp. 39–57).

Carlini et al., 2023c

Carlini, N., Hayes, J., Nasr, M., Jagielski, M., Sehwag, V., Tramer, F., … Wallace, E. (2023). Extracting training data from diffusion models. USENIX Security Symposium.

Carmon et al., 2019

Carmon, Y., Raghunathan, A., Schmidt, L., Duchi, J. C., & Liang, P. S. (2019). Unlabeled data improves adversarial robustness. Advances in Neural Information Processing Systems.

Chan et al., 2022

Chan, S.-H., Dong, Y., Zhu, J., Zhang, X., & Zhou, J. (2022). BadDet: Backdoor Attacks on Object Detection.

Chang et al., 2000

Chang, S. G., Yu, B., & Vetterli, M. (2000). Adaptive wavelet thresholding for image denoising and compression. IEEE Transactions on Image Processing, 9(9), 1532–1546.

Chao et al., 2023

Chao, P., Robey, A., Dobriban, E., Hassani, H., Pappas, G. J., & Wong, E. (2023). Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419.

Chaudhuri & Monteleoni, 2008

Chaudhuri, K., & Monteleoni, C. (2008). Privacy-preserving logistic regression. Advances in Neural Information Processing Systems.

Chen et al., 2018

Chen, B., Carvalho, W., Baracaldo, N., Ludwig, H., Edwards, B., Lee, T., … Srivastava, B. (2018). Detecting backdoor attacks on deep neural networks by activation clustering.

Chen et al., 2019

Chen, H., Fu, C., Zhao, J., & Koushanfar, F. (2019). Deepinspect: a black-box trojan detection and mitigation framework for deep neural networks. International Joint Conference on Artificial Intelligence (pp. 4658–4664).

Chen et al., 2022a

Chen, J., Wang, J., Peng, T., Sun, Y., Cheng, P., Ji, S., … Song, D. (2022). Copy, right? a testing framework for copyright protection of deep learning models. IEEE Symposium on Security and Privacy.

Chen et al., 2017a

Chen, P.-Y., Zhang, H., Sharma, Y., Yi, J., & Hsieh, C.-J. (2017). Zoo: zeroth order optimization based black-box attacks to deep neural networks without training substitute models. ACM Workshop on Artificial Intelligence and Security (pp. 15–26).

Chen et al., 2022b

Chen, S., Liu, C., Haque, M., Song, Z., & Yang, W. (2022). Nmtsloth: understanding and testing efficiency degradation of neural machine translation systems. ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (pp. 1148–1160).

Chen et al., 2021

Chen, T., Zhang, Z., Liu, S., Chang, S., & Wang, Z. (2021). Robust overfitting may be mitigated by properly learned smoothening. International Conference on Learning Representations.

Chen et al., 2020

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. International Conference on Machine Learning.

Chen et al., 2023

Chen, W., Song, D., & Li, B. (2023). Trojdiff: trojan attacks on diffusion models with diverse targets. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Chen et al., 2017b

Chen, X., Liu, C., Li, B., Lu, K., & Song, D. (2017). Targeted backdoor attacks on deep learning systems using data poisoning.

Chen et al., 2024

Chen, Y., Ma, X., Zou, D., & Jiang, Y.-G. (2024). Extracting training data from unconditional diffusion models. arXiv preprint arXiv:2406.12752.

Cheng et al., 2019

Cheng, M., Le, T., Chen, P.-Y., Zhang, H., Yi, J., & Hsieh, C.-J. (2019). Query-efficient hard-label black-box attack: an optimization-based approach. International Conference on Learning Representation.

Chiang et al., 2023

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang, H., … others. (2023). Vicuna: an open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).

Chou et al., 2023

Chou, S.-Y., Chen, P.-Y., & Ho, T.-Y. (2023). How to backdoor diffusion models? IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Chowdhery et al., 2023

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … others. (2023). Palm: scaling language modeling with pathways. Journal of Machine Learning Research, 24(240), 1–113.

Christiano et al., 2017

Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., & Amodei, D. (2017). Deep reinforcement learning from human preferences. Advances in Neural Information Processing Systems.

Clevert et al., 2016

Clevert, D.-A., Unterthiner, T., & Hochreiter, S. (2016). Fast and accurate deep network learning by exponential linear units (elus). International Conference on Learning Representations.

Cordts et al., 2016

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., … Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Croce & Hein, 2020a

Croce, F., & Hein, M. (2020). Minimally distorted adversarial examples with a fast adaptive boundary attack. International Conference on Machine Learning (pp. 2196–2205).

Croce & Hein, 2020b

Croce, F., & Hein, M. (2020). Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. International Conference on Machine Learning (pp. 2206–2216).

Crowson, 2022

Crowson, K. (2022). K-Diffusion.

Dai et al., 2023

Dai, W., Li, J., Li, D., Tiong, A. M. H., Zhao, J., Wang, W., … Hoi, S. (2023). InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning.

DarvishRouhani et al., 2019

Darvish Rouhani, B., Chen, H., & Koushanfar, F. (2019). Deepsigns: an end-to-end watermarking framework for ownership protection of deep neural networks. International Conference on Architectural Support for Programming Languages and Operating Systems.

Das et al., 2017

Das, N., Shanbhogue, M., Chen, S.-T., Hohman, F., Chen, L., Kounavis, M. E., & Chau, D. H. (2017). Keeping the bad guys out: Protecting and vaccinating deep learning with jpeg compression.

Deng et al., 2009

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition.

Devlin et al., 2018

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Ding et al., 2019

Ding, G. W., Sharma, Y., Lui, K. Y. C., & Huang, R. (2019). Mma training: direct input space margin maximization through adversarial training. International Conference on Learning Representations.

Doan et al., 2023

Doan, K. D., Lao, Y., Yang, P., & Li, P. (2023). Defending backdoor attacks on vision transformer via patch processing. AAAI Conference on Artificial Intelligence.

Dong et al., 2020

Dong, Y., Deng, Z., Pang, T., Zhu, J., & Su, H. (2020). Adversarial distributional training for robust deep learning. Advances in Neural Information Processing Systems (pp. 8270–8283).

Dong et al., 2018

Dong, Y., Liao, F., Pang, T., Su, H., Zhu, J., Hu, X., & Li, J. (2018). Boosting adversarial attacks with momentum. IEEE Conference on Computer Vision and Pattern Recognition (pp. 9185–9193).

Dosovitskiy et al., 2021

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … others. (2021). An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations.

Duan et al., 2023

Duan, J., Kong, F., Wang, S., Shi, X., & Xu, K. (2023). Are diffusion models vulnerable to membership inference attacks? International Conference on Machine Learning.

Duan et al., 2020

Duan, R., Ma, X., Wang, Y., Bailey, J., Qin, A. K., & Yang, Y. (2020). Adversarial camouflage: hiding physical-world attacks with natural styles. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1000–1008).

Dwork et al., 2006

Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating noise to sensitivity in private data analysis. Theory of Cryptography Conference.

Ebrahimi et al., 2018

Ebrahimi, J., Rao, A., Lowd, D., & Dou, D. (2018). Hotflip: white-box adversarial examples for text classification. Annual Meeting of the Association for Computational Linguistics (pp. 31–36).

Elman, 1990

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179–211.

Eykholt et al., 2018

Eykholt, K., Evtimov, I., Fernandes, E., Li, B., Rahmati, A., Xiao, C., … Song, D. (2018). Robust physical-world attacks on deep learning visual classification. IEEE Conference on Computer Vision and Pattern Recognition.

Feinman et al., 2017

Feinman, R., Curtin, R. R., Shintre, S., & Gardner, A. B. (2017). Detecting adversarial samples from artifacts.

Feng et al., 2019

Feng, J., Cai, Q.-Z., & Zhou, Z.-H. (2019). Learning to confuse: generating training time adversarial data with auto-encoder. Advances in Neural Information Processing Systems.

Fredrikson et al., 2015

Fredrikson, M., Jha, S., & Ristenpart, T. (2015). Model inversion attacks that exploit confidence information and basic countermeasures. ACM SIGSAC Conference on Computer and Communications Security (pp. 1322–1333).

Frosst et al., 2019

Frosst, N., Papernot, N., & Hinton, G. (2019). Analyzing and improving representations with the soft nearest neighbor loss. International Conference on Machine Learning.

Fu et al., 2022

Fu, Y., Zhang, S., Wu, S., Wan, C., & Lin, Y. (2022). Patch-fool: are vision transformers always robust against adversarial perturbations? International Conference on Learning Representations.

Fung et al., 2018

Fung, C., Yoon, C. J., & Beschastnikh, I. (2018). Mitigating sybils in federated learning poisoning.

Gailly & Adler, 2004

Gailly, J.-l., & Adler, M. (2004). Zlib compression library.

Gal & Ghahramani, 2016

Gal, Y., & Ghahramani, Z. (2016). A theoretically grounded application of dropout in recurrent neural networks. Advances in Neural Information Processing Systems.

Gan et al., 2020

Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., & Liu, J. (2020). Large-scale adversarial training for vision-and-language representation learning. Advances in Neural Information Processing Systems (pp. 6616–6628).

Geiping et al., 2021

Geiping, J., Fowl, L. H., Huang, W. R., Czaja, W., Taylor, G., Moeller, M., & Goldstein, T. (2021). Witches' brew: industrial scale data poisoning via gradient matching. International Conference on Learning Representations.

Goldblum et al., 2020

Goldblum, M., Fowl, L., Feizi, S., & Goldstein, T. (2020). Adversarially robust distillation. AAAI Conference on Artificial Intelligence (pp. 3996–4003).

Gong et al., 2023

Gong, Y., Ran, D., Liu, J., Wang, C., Cong, T., Wang, A., … Wang, X. (2023). Figstep: jailbreaking large vision-language models via typographic visual prompts. arXiv:2311.05608.

Gong et al., 2017

Gong, Z., Wang, W., & Ku, W.-S. (2017). Adversarial and clean data are not twins.

Goodfellow et al., 2015

Goodfellow, I. J., Shlens, J., & Szegedy, C. (2015). Explaining and harnessing adversarial examples. International Conference on Learning Representations.

Gowal et al., 2021

Gowal, S., Rebuffi, S.-A., Wiles, O., Stimberg, F., Calian, D. A., & Mann, T. A. (2021). Improving robustness using generated data. Advances in Neural Information Processing Systems.

Goyal et al., 2020

Goyal, S., Choudhury, A. R., Raje, S. M., Chakaravarthy, V. T., Sabharwal, Y., & Verma, A. (2020). Power-bert: accelerating bert inference via progressive word-vector elimination. International Conference on Machine Learning.

Greshake et al., 2023

Greshake, K., Abdelnabi, S., Mishra, S., Endres, C., Holz, T., & Fritz, M. (2023). More than you've asked for: a comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv e-prints, pp. arXiv–2302.

Gretton et al., 2012

Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., & Smola, A. (2012). A kernel two-sample test. The Journal of Machine Learning Research, 13(1), 723–773.

Grosse et al., 2017

Grosse, K., Manoharan, P., Papernot, N., Backes, M., & McDaniel, P. (2017). On the (statistical) detection of adversarial examples.

Gu et al., 2022

Gu, J., Tresp, V., & Qin, Y. (2022). Are vision transformers robust to patch perturbations? European Conference on Computer Vision.

Gu et al., 2017

Gu, T., Dolan-Gavitt, B., & Garg, S. (2017). Badnets: Identifying vulnerabilities in the machine learning model supply chain.

Gu et al., 2023

Gu, X., Du, C., Pang, T., Li, C., Lin, M., & Wang, Y. (2023). On memorization in diffusion models. arXiv preprint arXiv:2310.02664.

Guan et al., 2022

Guan, Y., Li, Z., Leng, J., Lin, Z., & Guo, M. (2022). Transkimmer: transformer learns to layer-wise skim. Annual Meeting of the Association for Computational Linguistics (pp. 7275–7286).

Guan et al., 2024

Guan, Z., Hu, M., Li, S., & Vullikanti, A. (2024). Ufid: a unified framework for input-level backdoor detection on diffusion models. arXiv preprint arXiv:2404.01101.

Guo et al., 2017

Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. International Conference on Machine Learning.

Guo et al., 2023

Guo, J., Li, J., Li, D., Tiong, A. M. H., Li, B., Tao, D., & Hoi, S. (2023). From images to textual prompts: zero-shot visual question answering with frozen large language models. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (pp. 10867–10877).

Guo et al., 2019

Guo, W., Wang, L., Xing, X., Du, M., & Song, D. (2019). Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems.

Gupta & Rahtu, 2019

Gupta, P., & Rahtu, E. (2019). Ciidefence: defeating adversarial attacks by fusing class-specific image inpainting and image denoising. IEEE International Conference on Computer Vision (pp. 6708–6717).

Hampel, 1974

Hampel, F. R. (1974). The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346), 383–393.

Hao et al., 2024

Hao, Y., Yang, W., & Lin, Y. (2024). Exploring backdoor vulnerabilities of chat models. arXiv preprint arXiv:2404.02406.

He et al., 2022a

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

He et al., 2016

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

He et al., 2022b

He, X., Xu, Q., Lyu, L., Wu, F., & Wang, C. (2022). Protecting intellectual property of language generation apis with lexical watermark. AAAI Conference on Artificial Intelligence.

He et al., 2019

He, Z., Zhang, T., & Lee, R. B. (2019). Model inversion attacks against collaborative inference. Annual Computer Security Applications Conference.

Hendrycks & Gimpel, 2016a

Hendrycks, D., & Gimpel, K. (2016). Early methods for detecting adversarial images.

Hendrycks & Gimpel, 2016b

Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus).

Hintersdorf et al., 2024

Hintersdorf, D., Struppek, L., Kersting, K., Dziedzic, A., & Boenisch, F. (2024). Finding nemo: localizing neurons responsible for memorization in diffusion models. arXiv preprint arXiv:2406.02366.

Hinton et al., 2015

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.

Ho et al., 2020

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems.

Houle, 2017

Houle, M. E. (2017). Local intrinsic dimensionality I: an extreme-value-theoretic foundation for similarity applications. International Conference on Similarity Search and Applications.

Hu et al., 2022

Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., … Chen, W. (2022). LoRA: low-rank adaptation of large language models. International Conference on Learning Representations.

Hu et al., 2019

Hu, S., Yu, T., Guo, C., Chao, W.-L., & Weinberger, K. Q. (2019). A new defense against adversarial images: turning a weakness into a strength. Advances in Neural Information Processing Systems.

Hua et al., 2024

Hua, A., Gu, J., Xue, Z., Carlini, N., Wong, E., & Qin, Y. (2024). Initialization matters for adversarial transfer learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 24831–24840).

Huang et al., 2023a

Huang, B., Wang, Z., Yang, J., Ai, J., Zou, Q., Wang, Q., & Ye, D. (2023). Implicit identity driven deepfake face swapping detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Huang et al., 2023b

Huang, H., Ma, X., Erfani, S., & Bailey, J. (2023). Distilling cognitive backdoor patterns within an image. International Conference on Learning Representations.

Huang et al., 2020

Huang, W. R., Geiping, J., Fowl, L., Taylor, G., & Goldstein, T. (2020). Metapoison: practical general-purpose clean-label data poisoning. Advances in Neural Information Processing Systems (pp. 12080–12091).

Ilyas et al., 2018

Ilyas, A., Engstrom, L., Athalye, A., & Lin, J. (2018). Black-box adversarial attacks with limited queries and information. International Conference on Machine Learning (pp. 2137–2146).

Ishihara, 2023

Ishihara, S. (2023). Training data extraction from pre-trained language models: a survey. arXiv preprint arXiv:2305.16157.

Izmailov et al., 2018

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. Conference on Uncertainty in Artificial Intelligence.

Jia et al., 2021

Jia, H., Choquette-Choo, C. A., Chandrasekaran, V., & Papernot, N. (2021). Entangled watermarks as a defense against model extraction. USENIX Security Symposium.

Jia et al., 2022a

Jia, J., Liu, Y., & Gong, N. Z. (2022). Badencoder: backdoor attacks to pre-trained encoders in self-supervised learning. IEEE Symposium on Security and Privacy.

Jia et al., 2022b

Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., & Lim, S.-N. (2022). Visual prompt tuning. European Conference on Computer Vision.

Jia et al., 2019

Jia, X., Wei, X., Cao, X., & Foroosh, H. (2019). Comdefend: an efficient image compression model to defend adversarial examples. IEEE Conference on Computer Vision and Pattern Recognition (pp. 6084–6092).

Jiang et al., 2023

Jiang, Y., Chan, C., Chen, M., & Wang, W. (2023). Lion: adversarial distillation of proprietary large language models. Conference on Empirical Methods in Natural Language Processing (pp. 3134–3154).

Jin et al., 2019

Jin, G., Shen, S., Zhang, D., Dai, F., & Zhang, Y. (2019). Ape-gan: adversarial perturbation elimination with gan. IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 3842–3846).

Kang et al., 2023

Kang, M., Zhu, J.-Y., Zhang, R., Park, J., Shechtman, E., Paris, S., & Park, T. (2023). Scaling up gans for text-to-image synthesis. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Kearns & Li, 1993

Kearns, M., & Li, M. (1993). Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4), 807–837.

Kim & Cho, 2021

Kim, G., & Cho, K. (2021). Length-adaptive transformer: train once with length drop, use anytime with search. Joint Conference of Annual Meeting of the Association for Computational Linguistics and International Joint Conference on Natural Language Processing.

Kim et al., 2022

Kim, S., Shen, S., Thorsley, D., Gholami, A., Kwon, W., Hassoun, J., & Keutzer, K. (2022). Learned token pruning for transformers. ACM SIGKDD Conference on Knowledge Discovery and Data Mining (pp. 784–794).

Kirchenbauer et al., 2023

Kirchenbauer, J., Geiping, J., Wen, Y., Katz, J., Miers, I., & Goldstein, T. (2023). A watermark for large language models. International Conference on Machine Learning.

Koh et al., 2022

Koh, P. W., Steinhardt, J., & Liang, P. (2022). Stronger data poisoning attacks break data sanitization defenses. Machine Learning, 111(1), 1–47.

Kruger et al., 2004

Kruger, L. E., Wohler, C., Wurz-Wessel, A., & Stein, F. (2004). In-factory calibration of multiocular camera systems. Optical Metrology in Production Engineering.

Kumar et al., 2020

Kumar, R. S. S., Nyström, M., Lambert, J., Marshall, A., Goertzel, M., Comissoneru, A., … Xia, S. (2020). Adversarial machine learning-industry perspectives. IEEE Security and Privacy Workshops (pp. 69–75).

Kurakin et al., 2016

Kurakin, A., Goodfellow, I., & Bengio, S. (2016). Adversarial machine learning at scale.

Kurakin et al., 2018

Kurakin, A., Goodfellow, I. J., & Bengio, S. (2018). Adversarial examples in the physical world. Artificial Intelligence Safety and Security (pp. 99–112). Chapman and Hall/CRC.

LeMerrer et al., 2020

Le Merrer, E., Perez, P., & Trédan, G. (2020). Adversarial frontier stitching for remote neural network watermarking. Neural Computing and Applications, 32(13), 9233–9244.

Lee et al., 2022

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., & Carlini, N. (2022). Deduplicating training data makes language models better. Annual Meeting of the Association for Computational Linguistics.

Lee et al., 2018

Lee, K., Lee, K., Lee, H., & Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in Neural Information Processing Systems.

Li et al., 2024a

Li, H., Chen, Y., Zheng, Z., Hu, Q., Chan, C., Liu, H., & Song, Y. (2024). Backdoor removal for generative large language models. arXiv preprint arXiv:2405.07667.

Li et al., 2023a

Li, J., Li, D., Savarese, S., & Hoi, S. (2023). Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. International conference on machine learning (pp. 19730–19742).

Li et al., 2022

Li, J., Li, D., Xiong, C., & Hoi, S. (2022). Blip: bootstrapping language-image pre-training for unified vision-language understanding and generation. International conference on machine learning (pp. 12888–12900).

Li et al., 2021a

Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., & Hoi, S. C. H. (2021). Align before fuse: vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34, 9694–9705.

Li et al., 2020a

Li, L., Bao, J., Zhang, T., Yang, H., Chen, D., Wen, F., & Guo, B. (2020). Face x-ray for more general face forgery detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Li et al., 2020b

Li, L., Ma, R., Guo, Q., Xue, X., & Qiu, X. (2020). Bert-attack: adversarial attack against bert using bert. Conference on Empirical Methods in Natural Language Processing (pp. 6193–6202).

Li et al., 2024b

Li, Q., Wang, W., Xu, C., Sun, Z., & Yang, M.-H. (2024). Learning disentangled representation for one-shot progressive face swapping. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Li et al., 2020c

Li, S., Cheng, Y., Wang, W., Liu, Y., & Chen, T. (2020). Learning to detect malicious clients for robust federated learning. arXiv preprint arXiv:2002.00211.

Li et al., 2024c

Li, W., Chen, P.-Y., Liu, S., & Wang, R. (2024). Psbd: prediction shift uncertainty unlocks backdoor detection. arXiv preprint arXiv:2406.05826.

Li et al., 2021b

Li, Y., Yang, Z., Wang, Y., & Xu, C. (2021). Neural architecture dilation for adversarial robustness. Advances in Neural Information Processing Systems (pp. 29578–29589).

Li et al., 2021c

Li, Y., Lyu, X., Koren, N., Lyu, L., Li, B., & Ma, X. (2021). Anti-backdoor learning: training clean models on poisoned data. Advances in Neural Information Processing Systems (pp. 14900–14912).

Li et al., 2021d

Li, Y., Lyu, X., Koren, N., Lyu, L., Li, B., & Ma, X. (2021). Anti-backdoor learning: training clean models on poisoned data. Advances in Neural Information Processing Systems.

Li et al., 2023b

Li, Y., Lyu, X., Ma, X., Koren, N., Lyu, L., Li, B., & Jiang, Y.-G. (2023). Reconstructive neuron pruning for backdoor defense. International Conference on Machine Learning.

Li et al., 2024d

Li, Y., Ma, X., He, J., Huang, H., & Jiang, Y.-G. (2024). Multi-trigger backdoor attacks: more triggers, more threats. arXiv preprint arXiv:2401.15295.

Li et al., 2021e

Li, Y., Li, Y., Wu, B., Li, L., He, R., & Lyu, S. (2021). Invisible backdoor attack with sample-specific triggers. IEEE International Conference on Computer Vision (pp. 16463–16472).

Li et al., 2024e

Li, Z., Wang, C., Ma, P., Liu, C., Wang, S., Wu, D., … Liu, Y. (2024). On extracting specialized code abilities from large language models: a feasibility study. IEEE/ACM International Conference on Software Engineering.

Liang et al., 2024

Liang, S., Zhu, M., Liu, A., Wu, B., Cao, X., & Chang, E.-C. (2024). Badclip: dual-embedding guided backdoor attack on multimodal contrastive learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Liao et al., 2018

Liao, F., Liang, M., Dong, Y., Pang, T., Hu, X., & Zhu, J. (2018). Defense against adversarial attacks using high-level representation guided denoiser. IEEE Conference on Computer Vision and Pattern Recognition (pp. 1778–1787).

Lin et al., 2014

Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., … Zitnick, C. L. (2014). Microsoft coco: common objects in context. European Conference on Computer Vision.

Liu et al., 2023

Liu, H., Li, C., Wu, Q., & Lee, Y. J. (2023). Visual instruction tuning. Advances in Neural Information Processing Systems.

Liu et al., 2024

Liu, H., Reiter, M. K., & Gong, N. Z. (2024). Mudjacking: patching backdoor vulnerabilities in foundation models. arXiv preprint arXiv:2402.14977.

Liu et al., 2018a

Liu, K., Dolan-Gavitt, B., & Garg, S. (2018). Fine-pruning: defending against backdooring attacks on deep neural networks. International Symposium on Research in Attacks, Intrusions, and Defenses (pp. 273–294).

Liu et al., 2020

Liu, X., Cheng, H., He, P., Chen, W., Wang, Y., Poon, H., & Gao, J. (2020). Adversarial training for large neural language models. arXiv preprint arXiv:2004.08994.

Liu et al., 2017

Liu, Y., Chen, X., Liu, C., & Song, D. (2017). Delving into transferable adversarial examples and black-box attacks.

Liu et al., 2018b

Liu, Y., Ma, S., Aafer, Y., Lee, W.-C., Zhai, J., Wang, W., & Zhang, X. (2018). Trojaning attack on neural networks. Network and Distributed Systems Security Symposium.

Lorenz et al., 2022

Lorenz, P., Keuper, M., & Keuper, J. (2022). Unfolding local growth rate estimates for (almost) perfect adversarial detection. International Conference on Computer Vision Theory and Applications.

Lu et al., 2023

Lu, D., Wang, Z., Wang, T., Guan, W., Gao, H., & Zheng, F. (2023). Set-level guidance attack: boosting adversarial transferability of vision-language pre-training models. IEEE/CVF International Conference on Computer Vision (pp. 102–111).

Lu et al., 2022

Lu, P., Mishra, S., Xia, T., Qiu, L., Chang, K.-W., Zhu, S.-C., … Kalyan, A. (2022). Learn to explain: multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems.

Lukas et al., 2021

Lukas, N., Zhang, Y., & Kerschbaum, F. (2021). Deep neural network fingerprinting by conferrable adversarial examples.

Luo et al., 2024

Luo, H., Gu, J., Liu, F., & Torr, P. (2024). An image is worth 1000 lies: adversarial transferability across prompts on vision-language models. arXiv:2403.09766.

Lv et al., 2021

Lv, P., Ma, H., Zhou, J., Liang, R., Chen, K., Zhang, S., & Yang, Y. (2021). Dbia: data-free backdoor injection attack against transformer networks. arXiv preprint arXiv:2111.11870.

Ma et al., 2023

Ma, H., Qiu, H., Gao, Y., Zhang, Z., Abuadbba, A., Xue, M., … Abbott, D. (2023). Quantization backdoors to deep learning commercial frameworks. IEEE Transactions on Dependable and Secure Computing.

Ma et al., 2024

Ma, J., Cao, A., Xiao, Z., Zhang, J., Ye, C., & Zhao, J. (2024). Jailbreaking prompt attack: a controllable adversarial attack against diffusion models. arXiv:2404.02928.

Ma et al., 2018

Ma, X., Li, B., Wang, Y., Erfani, S. M., Wijewickrema, S., Schoenebeck, G., … Bailey, J. (2018). Characterizing adversarial subspaces using local intrinsic dimensionality. International Conference on Learning Representations.

Madry et al., 2018

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., & Vladu, A. (2018). Towards deep learning models resistant to adversarial attacks. International Conference on Learning Representations.

Mahalanobis, 1936

Mahalanobis, P. C. (1936). On the generalized distance in statistics. Proceedings of the National Institute of Sciences, 2, 49–55.

Mahendran & Vedaldi, 2015

Mahendran, A., & Vedaldi, A. (2015). Understanding deep image representations by inverting them. IEEE Conference on Computer Vision and Pattern Recognition.

Mahendran & Vedaldi, 2016

Mahendran, A., & Vedaldi, A. (2016). Visualizing deep convolutional neural networks using natural pre-images. International Journal of Computer Vision, 120, 233–255.

Mahloujifar & Mahmoody, 2017

Mahloujifar, S., & Mahmoody, M. (2017). Blockwise p-tampering attacks on cryptographic primitives, extractors, and learners. Theory of Cryptography Conference (pp. 245–279).

Mahloujifar et al., 2019

Mahloujifar, S., Mahmoody, M., & Mohammed, A. (2019). Universal multi-party poisoning attacks. International Conference on Machine Learning (pp. 4274–4283).

Mahmood et al., 2021

Mahmood, K., Mahmood, R., & Van Dijk, M. (2021). On the robustness of vision transformers to adversarial examples. IEEE International Conference on Computer Vision.

Mao et al., 2023

Mao, C., Geng, S., Yang, J., Wang, X., & Vondrick, C. (2023). Understanding zero-shot adversarial robustness for large-scale models. International Conference on Learning Representations.

Masood et al., 2023

Masood, M., Nawaz, M., Malik, K. M., Javed, A., Irtaza, A., & Malik, H. (2023). Deepfakes generation and detection: state-of-the-art, open challenges, countermeasures, and way forward. Applied Intelligence, 53(4), 3974–4026.

Mattern et al., 2023

Mattern, J., Mireshghallah, F., Jin, Z., Schoelkopf, B., Sachan, M., & Berg-Kirkpatrick, T. (2023). Membership inference attacks against language models via neighbourhood comparison. Annual Meeting of The Association For Computational Linguistics.

McMahan et al., 2017

McMahan, B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication-efficient learning of deep networks from decentralized data. Artificial Intelligence and Statistics.

Mei & Zhu, 2015

Mei, S., & Zhu, X. (2015). Using machine teaching to identify optimal training-set attacks on machine learners. AAAI Conference on Artificial Intelligence.

Meng & Chen, 2017

Meng, D., & Chen, H. (2017). Magnet: a two-pronged defense against adversarial examples. ACM SIGSAC Conference on Computer and Communications Security (pp. 135–147).

Metzen et al., 2017

Metzen, J. H., Genewein, T., Fischer, V., & Bischoff, B. (2017). On detecting adversarial perturbations. International Conference on Learning Representations.

Micikevicius et al., 2018

Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., … others. (2018). Mixed precision training. International Conference on Learning Representations.

Miyato et al., 2018

Miyato, T., Maeda, S.-i., Koyama, M., & Ishii, S. (2018). Virtual adversarial training: a regularization method for supervised and semi-supervised learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 1979–1993.

Mo et al., 2024

Mo, Y., Huang, H., Li, M., Li, A., & Wang, Y. (2024). Terd: a unified framework for safeguarding diffusion models against backdoors. International Conference on Machine Learning.

Moosavi-Dezfooli et al., 2016

Moosavi-Dezfooli, S.-M., Fawzi, A., & Frossard, P. (2016). Deepfool: a simple and accurate method to fool deep neural networks. IEEE Conference on Computer Vision and Pattern Recognition (pp. 2574–2582).

Mordvintsev et al., 2015

Mordvintsev, A., Olah, C., & Tyka, M. (2015). Inceptionism: going deeper into neural networks.

Munoz-Gonzalez et al., 2019

Muñoz-González, L., Pfitzner, B., Russo, M., Carnerero-Cano, J., & Lupu, E. C. (2019). Poisoning attacks with generative adversarial nets.

Nair & Hinton, 2010

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. International Conference on Machine Learning.

Naseer et al., 2021

Naseer, M., Ranasinghe, K., Khan, S., Khan, F. S., & Porikli, F. (2021). On improving adversarial transferability of vision transformers. arXiv preprint arXiv:2106.04169.

Naseh et al., 2023

Naseh, A., Roh, J., & Houmansadr, A. (2023). Memory triggers: unveiling memorization in text-to-image generative models through word-level duplication. arXiv preprint arXiv:2312.03692.

Nasr et al., 2023

Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., … Lee, K. (2023). Scalable extraction of training data from (production) language models. arXiv preprint arXiv:2311.17035.

Nelson et al., 2008

Nelson, B., Barreno, M., Chi, F. J., Joseph, A. D., Rubinstein, B. I., Saini, U., … Xia, K. (2008). Exploiting machine learning to subvert your spam filter. LEET, 8(1), 9.

Nguyen et al., 2017

Nguyen, A., Clune, J., Bengio, Y., Dosovitskiy, A., & Yosinski, J. (2017). Plug & play generative networks: conditional iterative generation of images in latent space. IEEE Conference on Computer Vision and Pattern Recognition.

Nguyen et al., 2016

Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., & Clune, J. (2016). Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. Advances in Neural Information Processing systems, 29.

Nguyen & Tran, 2020

Nguyen, T. A., & Tran, A. (2020). Input-aware dynamic backdoor attack. Advances in Neural Information Processing Systems (pp. 3454–3464).

Nie et al., 2022

Nie, W., Guo, B., Huang, Y., Xiao, C., Vahdat, A., & Anandkumar, A. (2022). Diffusion models for adversarial purification. International Conference on Machine Learning (pp. 16805–16827).

Nirkin et al., 2019

Nirkin, Y., Keller, Y., & Hassner, T. (2019). FSGAN: subject agnostic face swapping and reenactment. IEEE International Conference on Computer Vision.

Noever & Noever, 2021

Noever, D. A., & Noever, S. E. M. (2021). Reading isn't believing: adversarial attacks on multi-modal neurons. arXiv:2103.10480.

Oh et al., 2019

Oh, S. J., Schiele, B., & Fritz, M. (2019). Towards reverse-engineering black-box neural networks. Explainable AI: Interpreting, Explaining and Visualizing Deep Learning (pp. 121–144). Springer.

Ooms, 2024

Ooms, J. (2024). cld3: Google's Compact Language Detector 3. R package version 1.6.0. URL: https://docs.ropensci.org/cld3/ https://github.com/ropensci/cld3 https://ropensci.r-universe.dev/cld3

Oord et al., 2018

Oord, A. v. d., Li, Y., & Vinyals, O. (2018). Representation learning with contrastive predictive coding.

OpenAI, 2024

OpenAI (2024). ChatGPT. Accessed: 2024-07-23.

Paperno et al., 2016

Paperno, D., Kruszewski, G., Lazaridou, A., Pham, Q. N., Bernardi, R., Pezzelle, S., … Fernández, R. (2016). The LAMBADA dataset: Word prediction requiring a broad discourse context.

Papernot et al., 2017

Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., & Swami, A. (2017). Practical black-box attacks against machine learning. ACM on Asia Conference on Computer and Communications Security (pp. 506–519).

Papernot et al., 2016

Papernot, N., McDaniel, P., Jha, S., Fredrikson, M., Celik, Z. B., & Swami, A. (2016). The limitations of deep learning in adversarial settings. IEEE European Symposium on Security and Privacy (pp. 372–387).

Papineni et al., 2002

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Annual Meeting of the Association for Computational Linguistics.

Peters et al., 2018

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Pinto et al., 2024

Pinto, F., Rauschmayr, N., Tramèr, F., Torr, P., & Tombari, F. (2024). Extracting training data from document-based vqa models. arXiv preprint arXiv:2407.08707.

Prakash et al., 2018

Prakash, A., Moran, N., Garber, S., DiLillo, A., & Storer, J. (2018). Deflecting adversarial attacks with pixel deflection. IEEE Conference on Computer Vision and Pattern Recognition (pp. 8571–8580).

Pruthi et al., 2019

Pruthi, D., Dhingra, B., & Lipton, Z. C. (2019). Combating adversarial misspellings with robust word recognition. Annual Meeting of the Association for Computational Linguistics (pp. 5582–5591).

Qi et al., 2023

Qi, X., Huang, K., Panda, A., Wang, M., & Mittal, P. (2023). Visual adversarial examples jailbreak large language models. arXiv:2306.13213.

Qian et al., 2020

Qian, Y., Yin, G., Sheng, L., Chen, Z., & Shao, J. (2020). Thinking in frequency: face forgery detection by mining frequency-aware clues. European Conference on Computer Vision.

Qin et al., 2019

Qin, C., Martens, J., Gowal, S., Krishnan, D., Dvijotham, K., Fawzi, A., … Kohli, P. (2019). Adversarial robustness through local linearization. Advances in Neural Information Processing Systems.

Radford et al., 2021

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … others. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning.

Radford et al., 2018

Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., & others. (2018). Improving language understanding by generative pre-training.

Rafailov et al., 2024

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2024). Direct preference optimization: your language model is secretly a reward model. Advances in Neural Information Processing Systems.

Ramachandran et al., 2017

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions.

Ramesh et al., 2022

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

Rebuffi et al., 2021a

Rebuffi, S.-A., Gowal, S., Calian, D. A., Stimberg, F., Wiles, O., & Mann, T. (2021). Fixing data augmentation to improve adversarial robustness.

Rebuffi et al., 2021b

Rebuffi, S.-A., Gowal, S., Calian, D. A., Stimberg, F., Wiles, O., & Mann, T. A. (2021). Data augmentation can improve robustness. Advances in Neural Information Processing Systems.

Rice et al., 2020

Rice, L., Wong, E., & Kolter, Z. (2020). Overfitting in adversarially robust deep learning. International Conference on Machine Learning.

Robey et al., 2023

Robey, A., Wong, E., Hassani, H., & Pappas, G. J. (2023). Smoothllm: defending large language models against jailbreaking attacks. arXiv preprint arXiv:2310.03684.

Rombach et al., 2022

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Ronneberger et al., 2015

Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: convolutional networks for biomedical image segmentation. International Conference on Medical Image Computing and Computer Assisted Intervention (pp. 234–241).

Roth et al., 2019

Roth, K., Kilcher, Y., & Hofmann, T. (2019). The odds are odd: a statistical test for detecting adversarial examples. International Conference on Machine Learning (pp. 5498–5507).

Saha et al., 2020

Saha, A., Subramanya, A., & Pirsiavash, H. (2020). Hidden trigger backdoor attacks. AAAI Conference on Artificial Intelligence.

Sakaguchi et al., 2017

Sakaguchi, K., Duh, K., Post, M., & Van Durme, B. (2017). Robsut wrod reocginiton via semi-character recurrent neural network. AAAI Conference on Artificial Intelligence.

Samangouei et al., 2018

Samangouei, P., Kabkab, M., & Chellappa, R. (2018). Defense-gan: protecting classifiers against adversarial attacks using generative models. International Conference on Learning Representations.

Schlarmann et al., 2024

Schlarmann, C., Singh, N. D., Croce, F., & Hein, M. (2024). Robust clip: unsupervised adversarial fine-tuning of vision embeddings for robust large vision-language models. International Conference on Machine Learning.

Schubert et al., 2014

Schubert, E., Zimek, A., & Kriegel, H.-P. (2014). Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data mining and knowledge discovery, 28, 190–237.

Schuhmann et al., 2022

Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C. W., Wightman, R., Cherti, M., … Jitsev, J. (2022). LAION-5b: an open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems.

Schulman et al., 2017

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.

Selvaraju et al., 2017

Selvaraju, R. R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., & Batra, D. (2017). Grad-cam: visual explanations from deep networks via gradient-based localization. IEEE International Conference on Computer Vision.

Sennrich et al., 2016

Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Annual Meeting of the Association for Computational Linguistics.

Sha et al., 2023

Sha, Z., He, X., Yu, N., Backes, M., & Zhang, Y. (2023). Can't steal? cont-steal! contrastive stealing attacks against image encoders. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Shafahi et al., 2018

Shafahi, A., Huang, W. R., Najibi, M., Suciu, O., Studer, C., Dumitras, T., & Goldstein, T. (2018). Poison frogs! targeted clean-label poisoning attacks on neural networks. Advances in Neural Information Processing Systems.

Shafahi et al., 2019

Shafahi, A., Najibi, M., Ghiasi, M. A., Xu, Z., Dickerson, J., Studer, C., … Goldstein, T. (2019). Adversarial training for free! Advances in Neural Information Processing Systems.

Shao et al., 2022

Shao, R., Shi, Z., Yi, J., Chen, P.-Y., & Hsieh, C.-J. (2022). On the adversarial robustness of vision transformers. Transactions on Machine Learning Research.

Sharif et al., 2016

Sharif, M., Bhagavatula, S., Bauer, L., & Reiter, M. K. (2016). Accessorize to a crime: real and stealthy attacks on state-of-the-art face recognition. ACM SIGSAC Conference on Computer and Communications Security.

Sharma et al., 2018

Sharma, P., Ding, N., Goodman, S., & Soricut, R. (2018). Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. Annual Meeting of the Association for Computational Linguistics.

Shayegani et al., 2023

Shayegani, E., Dong, Y., & Abu-Ghazaleh, N. (2023). Plug and pray: exploiting off-the-shelf components of multi-modal models. arXiv:2307.14539.

Shen et al., 2016

Shen, S., Tople, S., & Saxena, P. (2016). Auror: defending against poisoning attacks in collaborative deep learning systems. Conference on Computer Security Applications.

Shen & Sanghavi, 2019

Shen, Y., & Sanghavi, S. (2019). Learning with bad training data via iterative trimmed loss minimization. International Conference on Machine Learning (pp. 5739–5748).

Shi et al., 2022

Shi, Y., Han, Y., Tan, Y.-a., & Kuang, X. (2022). Decision-based black-box attack against vision transformers via patch-wise adversarial removal. Advances in Neural Information Processing Systems.

Shin et al., 2020

Shin, T., Razeghi, Y., Logan IV, R. L., Wallace, E., & Singh, S. (2020). Autoprompt: eliciting knowledge from language models with automatically generated prompts. Conference on Empirical Methods in Natural Language Processing.

Shokri et al., 2017

Shokri, R., Stronati, M., Song, C., & Shmatikov, V. (2017). Membership inference attacks against machine learning models. IEEE Symposium on Security and Privacy.

Smith & Topin, 2019

Smith, L. N., & Topin, N. (2019). Super-convergence: very fast training of residual networks using large learning rates. Artificial Intelligence and Machine Learning for Multi-Domain Operations Applications (pp. 369–386).

Smith, 2007

Smith, R. (2007). An overview of the tesseract ocr engine. International Conference on Document Analysis and Recognition.

Somepalli et al., 2022

Somepalli, G., Singla, V., Goldblum, M., Geiping, J., & Goldstein, T. (2022). Diffusion art or digital forgery? Investigating data replication in diffusion models. arXiv preprint arXiv:2212.03860.

Somepalli et al., 2023

Somepalli, G., Singla, V., Goldblum, M., Geiping, J., & Goldstein, T. (2023). Understanding data replication in diffusion models. International Conference on Machine Learning WorkShop.

Song et al., 2020

Song, J., Meng, C., & Ermon, S. (2020). Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.

Song et al., 2013

Song, S., Chaudhuri, K., & Sarwate, A. D. (2013). Stochastic gradient descent with differentially private updates. IEEE Global Conference on Signal and Information Processing.

Sorokin & Forsyth, 2008

Sorokin, A., & Forsyth, D. (2008). Utility data annotation with amazon mechanical turk. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Srivastava et al., 2014

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.

Subramanya et al., 2024

Subramanya, A., Koohpayegani, S. A., Saha, A., Tejankar, A., & Pirsiavash, H. (2024). A closer look at robustness of vision transformers to backdoor attacks. IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 3874–3883).

Subramanya et al., 2022

Subramanya, A., Saha, A., Koohpayegani, S. A., Tejankar, A., & Pirsiavash, H. (2022). Backdoor attacks on vision transformers. arXiv preprint arXiv:2206.08477.

Sun et al., 2023

Sun, X., Li, X., Meng, Y., Ao, X., Lyu, L., Li, J., & Zhang, T. (2023). Defending against backdoor attacks in natural language generation. AAAI Conference on Artificial Intelligence.

Sun et al., 2019

Sun, Z., Kairouz, P., Suresh, A. T., & McMahan, H. B. (2019). Can you really backdoor federated learning?

Sur et al., 2023

Sur, I., Sikka, K., Walmer, M., Koneripalli, K., Roy, A., Lin, X., … Jha, S. (2023). Tijo: trigger inversion with joint optimization for defending multimodal backdoored models. IEEE/CVF International Conference on Computer Vision.

Szegedy et al., 2014

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fergus, R. (2014). Intriguing properties of neural networks. International Conference on Learning Representations.

Szyller et al., 2021

Szyller, S., Atli, B. G., Marchal, S., & Asokan, N. (2021). Dawn: dynamic adversarial watermarking of neural networks. ACM International Conference on Multimedia.

Tan & Le, 2019

Tan, M., & Le, Q. (2019). Efficientnet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning (pp. 6105–6114).

Tang et al., 2020

Tang, R., Du, M., Liu, N., Yang, F., & Hu, X. (2020). An embarrassingly simple approach for trojan attack in deep neural networks. ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 218–228).

Taori et al., 2023

Taori, R., Gulrajani, I., Zhang, T., Dubois, Y., Li, X., Guestrin, C., … Hashimoto, T. B. (2023). Alpaca: a strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm.stanford.edu/2023/03/13/alpaca.html, 3(6), 7.

Tejankar et al., 2023

Tejankar, A., Sanjabi, M., Wang, Q., Wang, S., Firooz, H., Pirsiavash, H., & Tan, L. (2023). Defending against patch-based backdoor attacks on self-supervised learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Thies et al., 2016

Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., & Nießner, M. (2016). Face2face: real-time face capture and reenactment of rgb videos. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Tian et al., 2018

Tian, S., Yang, G., & Cai, Y. (2018). Detecting adversarial examples through image transformation. AAAI Conference on Artificial Intelligence.

Touvron et al., 2023

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., … others. (2023). Llama: open and efficient foundation language models. arXiv preprint arXiv:2302.13971.

Tramer et al., 2020

Tramer, F., Carlini, N., Brendel, W., & Madry, A. (2020). On adaptive attacks to adversarial example defenses. Advances in Neural Information Processing Systems (pp. 1633–1645).

Tramer et al., 2018

Tramèr, F., Kurakin, A., Papernot, N., Goodfellow, I., Boneh, D., & McDaniel, P. (2018). Ensemble adversarial training: attacks and defenses. International Conference on Learning Representations.

Tramer et al., 2016

Tramèr, F., Zhang, F., Juels, A., Reiter, M. K., & Ristenpart, T. (2016). Stealing machine learning models via prediction $\$APIs$\$. USENIX Security Symposium (pp. 601–618).

Tran et al., 2018

Tran, B., Li, J., & Madry, A. (2018). Spectral signatures in backdoor attacks. Advances in Neural Information Processing Systems.

Tu et al., 2019

Tu, C.-C., Ting, P., Chen, P.-Y., Liu, S., Zhang, H., Yi, J., … Cheng, S.-M. (2019). Autozoom: autoencoder-based zeroth order optimization method for attacking black-box neural networks. AAAI Conference on Artificial Intelligence (pp. 742–749).

Turner et al., 2018

Turner, A., Tsipras, D., & Madry, A. (2018). Clean-label backdoor attacks.

Uchida et al., 2017

Uchida, Y., Nagai, Y., Sakazawa, S., & Satoh, Shin'ichi. (2017). Embedding watermarks into deep neural networks. ACM on International Conference on Multimedia Retrieval.

Vaswani et al., 2017

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems.

Wang & Gong, 2018

Wang, B., & Gong, N. Z. (2018). Stealing hyperparameters in machine learning. IEEE Symposium on Security and Privacy (pp. 36–52).

Wang et al., 2019a

Wang, B., Yao, Y., Shan, S., Li, H., Viswanath, B., Zheng, H., & Zhao, B. Y. (2019). Neural cleanse: identifying and mitigating backdoor attacks in neural networks. IEEE Symposium on Security and Privacy (pp. 707–723).

Wang et al., 2017

Wang, D., Ye, M., & Xu, J. (2017). Differentially private empirical risk minimization revisited: faster and more general. Advances in Neural Information Processing Systems.

Wang et al., 2020a

Wang, H., Sreenivasan, K., Rajput, S., Vishwakarma, H., Agarwal, S., Sohn, J.-y., … Papailiopoulos, D. (2020). Attack of the tails: yes, you really can backdoor federated learning. Advances in Neural Information Processing Systems (pp. 16070–16084).

Wang et al., 2024a

Wang, R., Ma, X., Zhou, H., Ji, C., Ye, G., & Jiang, Y.-G. (2024). White-box multimodal jailbreaks against large vision-language models. arXiv:2405.17894.

Wang et al., 2022

Wang, S., Nepal, S., Abuadbba, A., Rudolph, C., & Grobler, M. (2022). Adversarial detection by latent style transformations. IEEE Transactions on Information Forensics and Security, 17, 1099–1114.

Wang et al., 2020b

Wang, S., Nepal, S., Rudolph, C., Grobler, M., Chen, S., & Chen, T. (2020). Backdoor attacks against transfer learning with pre-trained deep learning models. IEEE Transactions on Services Computing.

Wang et al., 2023a

Wang, X., Ji, Z., Ma, P., Li, Z., & Wang, S. (2023). Instructta: instruction-tuned targeted attack for large vision-language models. arXiv:2312.01886.

Wang et al., 2019b

Wang, Y., Ma, X., Bailey, J., Yi, J., Zhou, B., & Gu, Q. (2019). On the convergence and robustness of adversarial training. International Conference on Machine Learning (pp. 6586–6595).

Wang et al., 2019c

Wang, Y., Zou, D., Yi, J., Bailey, J., Ma, X., & Gu, Q. (2019). Improving adversarial robustness requires revisiting misclassified examples. International Conference on Learning Representations.

Wang et al., 2023b

Wang, Z., Pang, T., Du, C., Lin, M., Liu, W., & Yan, S. (2023). Better diffusion models further improve adversarial training. International Conference on Machine Learning.

Wang et al., 2024b

Wang, Z., Li, X., Zhu, H., & Xie, C. (2024). Revisiting adversarial training at scale. arXiv:2401.04727.

Wang et al., 2004

Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. (2004). Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4), 600–612.

Webster, 2023

Webster, R. (2023). A reproducible extraction of training images from diffusion models. arXiv preprint arXiv:2305.08694.

Wei et al., 2021

Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., … Le, Q. V. (2021). Finetuned language models are zero-shot learners. International Conference on Machine Learning.

Wei et al., 2022a

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … others. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems.

Wei & Zou, 2019

Wei, J., & Zou, K. (2019). Eda: easy data augmentation techniques for boosting performance on text classification tasks. Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing.

Wei et al., 2022b

Wei, Z., Chen, J., Goldblum, M., Wu, Z., Goldstein, T., & Jiang, Y.-G. (2022). Towards transferable adversarial attacks on vision transformers. AAAI Conference on Artificial Intelligence (pp. 2668–2676).

Wen et al., 2024

Wen, Y., Liu, Y., Chen, C., & Lyu, L. (2024). Detecting, explaining, and mitigating memorization in diffusion models. International Conference on Learning Representations.

Williams & Peng, 1990

Williams, R. J., & Peng, J. (1990). An efficient gradient-based algorithm for on-line training of recurrent network trajectories. Neural Computation, 2, 490–501.

Williams & Zipser, 2013

Williams, R. J., & Zipser, D. (2013). Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation (pp. 433–486). Psychology Press.

Wong et al., 2020

Wong, E., Rice, L., & Kolter, J. Z. (2020). Fast is better than free: revisiting adversarial training. International Conference on Learning Representations.

Wu et al., 2023a

Wu, C., Yin, S., Qi, W., Wang, X., Tang, Z., & Duan, N. (2023). Visual chatgpt: talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.

Wu & Wang, 2021

Wu, D., & Wang, Y. (2021). Adversarial neuron pruning purifies backdoored deep models. Advances in Neural Information Processing Systems (pp. 16913–16925).

Wu et al., 2020a

Wu, D., Wang, Y., Xia, S.-T., Bailey, J., & Ma, X. (2020). Skip connections matter: on the transferability of adversarial examples generated with resnets. International Conference on Learning Representations.

Wu et al., 2020b

Wu, D., Xia, S.-T., & Wang, Y. (2020). Adversarial weight perturbation helps robust generalization. Advances in Neural Information Processing Systems (pp. 2958–2969).

Wu et al., 2023b

Wu, S., Ma, C., Wei, K., Xu, X., Ding, M., Qian, Y., & Xiang, T. (2023). Refine, discriminate and align: stealing encoders via sample-wise prototypes and multi-relational extraction. arXiv preprint arXiv:2312.00855.

Xi et al., 2024

Xi, Z., Du, T., Li, C., Pang, R., Ji, S., Chen, J., … Wang, T. (2024). Defending pre-trained language models as few-shot learners against backdoor attacks. Advances in Neural Information Processing Systems.

Xiang et al., 2024

Xiang, Z., Jiang, F., Xiong, Z., Ramasubramanian, B., Poovendran, R., & Li, B. (2024). Badchain: backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242.

Xiao et al., 2018

Xiao, C., Li, B., Zhu, J. Y., He, W., Liu, M., & Song, D. (2018). Generating adversarial examples with adversarial networks. International Joint Conference on Artificial Intelligence (pp. 3905–3911).

Xie et al., 2019a

Xie, C., Huang, K., Chen, P.-Y., & Li, B. (2019). Dba: distributed backdoor attacks against federated learning. International Conference on Learning Representations.

Xie et al., 2020

Xie, C., Tan, M., Gong, B., Yuille, A., & Le, Q. V. (2020). Smooth adversarial training.

Xie et al., 2018

Xie, C., Wang, J., Zhang, Z., Ren, Z., & Yuille, A. (2018). Mitigating adversarial effects through randomization. International Conference on Learning Representations.

Xie et al., 2019b

Xie, C., Wu, Y., Maaten, L. v. d., Yuille, A. L., & He, K. (2019). Feature denoising for improving adversarial robustness. IEEE Conference on Computer Vision and Pattern Recognition (pp. 501–509).

Xie et al., 2019c

Xie, C., Zhang, Z., Zhou, Y., Bai, S., Wang, J., Ren, Z., & Yuille, A. L. (2019). Improving transferability of adversarial examples with input diversity. IEEE Conference on Computer Vision and Pattern Recognition (pp. 2730–2739).

Xu et al., 2020

Xu, K., Zhang, G., Liu, S., Fan, Q., Sun, M., Chen, H., … Lin, X. (2020). Adversarial t-shirt! evading person detectors in a physical world. European Conference on Computer Vision (pp. 665–681).

Xu et al., 2018

Xu, W., Evans, D., & Qi, Y. (2018). Feature squeezing: detecting adversarial examples in deep neural networks. Network and Distributed Systems Security Symposium.

Xu et al., 2023

Xu, X., Zhang, J., & Kankanhalli, M. (2023). Autolora: a parameter-free automated robust fine-tuning framework. arXiv preprint arXiv:2310.01818.

Yan et al., 2024

Yan, J., Yadav, V., Li, S., Chen, L., Tang, Z., Wang, H., … Jin, H. (2024). Backdooring instruction-tuned large language models with virtual prompt injection. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Yang et al., 2017

Yang, C., Wu, Q., Li, H., & Chen, Y. (2017). Generative poisoning attack method against neural networks.

Yang et al., 2020

Yang, H., Zhang, J., Dong, H., Inkawhich, N., Gardner, A., Touchet, A., … Li, H. (2020). Dverge: diversifying vulnerabilities for enhanced robust generation of ensembles. Advances in Neural Information Processing Systems (pp. 5505–5515).

Yang et al., 2019a

Yang, Q., Liu, Y., Chen, T., & Tong, Y. (2019). Federated machine learning: concept and applications. ACM Transactions on Intelligent Systems and Technology, 10, 1–19.

Yang et al., 2023a

Yang, W., Gao, J., & Mirzasoleiman, B. (2023). Better safe than sorry: pre-training clip against targeted data poisoning and backdoor attacks. arXiv preprint arXiv:2310.05862.

Yang et al., 2023b

Yang, W., Gao, J., & Mirzasoleiman, B. (2023). Robust contrastive language-image pretraining against data poisoning and backdoor attacks. Advances in Neural Information Processing Systems.

Yang et al., 2023c

Yang, Y., Gao, R., Wang, X., Xu, N., & Xu, Q. (2023). Mma-diffusion: multimodal attack on diffusion models. arXiv:2311.17516.

Yang et al., 2022

Yang, Y., Liu, T. Y., & Mirzasoleiman, B. (2022). Not all poisons are created equal: robust training against data poisoning. International Conference on Machine Learning (pp. 25154–25165).

Yang et al., 2019b

Yang, Z., Chang, E.-C., & Liang, Z. (2019). Adversarial neural network inversion via auxiliary knowledge alignment. arXiv preprint arXiv:1902.08552.

Yang et al., 2023d

Yang, Z., He, X., Li, Z., Backes, M., Humbert, M., Berrang, P., & Zhang, Y. (2023). Data poisoning attacks against multimodal encoders. International Conference on Machine Learning.

Yao et al., 2019

Yao, Y., Li, H., Zheng, H., & Zhao, B. Y. (2019). Latent backdoor attacks on deep neural networks. ACM SIGSAC Conference on Computer and Communications Security (pp. 2041–2055).

Ye et al., 2021

Ye, D., Lin, Y., Huang, Y., & Sun, M. (2021). Tr-bert: dynamic token reduction for accelerating bert inference. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

Yeom et al., 2018

Yeom, S., Giacomelli, I., Fredrikson, M., & Jha, S. (2018). Privacy risk in machine learning: analyzing the connection to overfitting. IEEE Computer Security Foundations Workshop.

Yin et al., 2018

Yin, D., Chen, Y., Kannan, R., & Bartlett, P. (2018). Byzantine-robust distributed learning: towards optimal statistical rates. International Conference on Machine Learning.

Yin et al., 2020

Yin, H., Molchanov, P., Alvarez, J. M., Li, Z., Mallya, A., Hoiem, D., … Kautz, J. (2020). Dreaming to distill: data-free knowledge transfer via deepinversion. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Yu et al., 2018

Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., & Sang, N. (2018). Bisenet: bilateral segmentation network for real-time semantic segmentation. European Conference on Computer Vision.

Yu et al., 2020

Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., & Finn, C. (2020). Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems.

Yuan et al., 2023

Yuan, Z., Zhou, P., Zou, K., & Cheng, Y. (2023). You are catching my attention: are vision transformers bad learners under backdoor attacks? IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 24605–24615).

Zhai et al., 2023

Zhai, S., Dong, Y., Shen, Q., Pu, S., Fang, Y., & Su, H. (2023). Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. ACM International Conference on Multimedia.

Zhang et al., 2019a

Zhang, D., Zhang, T., Lu, Y., Zhu, Z., & Dong, B. (2019). You only propagate once: accelerating adversarial training via maximal principle. Advances in Neural Information Processing Systems.

Zhang et al., 2019b

Zhang, H., Yu, Y., Jiao, J., Xing, E., El Ghaoui, L., & Jordan, M. (2019). Theoretically principled trade-off between robustness and accuracy. International Conference on Machine Learning (pp. 7472–7482).

Zhang et al., 2024a

Zhang, J., Wang, Z., Wang, R., Ma, X., & Jiang, Y.-G. (2024). Enja: ensemble jailbreak on large language models. arXiv preprint arXiv:2408.03603.

Zhang et al., 2018

Zhang, J., Gu, Z., Jang, J., Wu, H., Stoecklin, M. P., Huang, H., & Molloy, I. (2018). Protecting intellectual property of deep neural networks with watermarking. ACM Asia Conference on Computer and Communications Security.

Zhang et al., 2024b

Zhang, J., Ma, X., Wang, X., Qiu, L., Wang, J., Jiang, Y.-G., & Sang, J. (2024). Adversarial prompt tuning for vision-language models. European Conference on Computer Vision.

Zhang et al., 2022a

Zhang, J., Yi, Q., & Sang, J. (2022). Towards adversarial attack on vision-language pre-training models. ACM International Conference on Multimedia (pp. 5005–5013).

Zhang et al., 2017

Zhang, J., Zheng, K., Mou, W., & Wang, L. (2017). Efficient private ERM for smooth objectives.

Zhang et al., 2020a

Zhang, J., Chen, D., Liao, J., Fang, H., Zhang, W., Zhou, W., … Yu, N. (2020). Model watermarking for image processing networks. AAAI Conference on Artificial Intelligence.

Zhang et al., 2021

Zhang, J., Chen, D., Liao, J., Zhang, W., Feng, H., Hua, G., & Yu, N. (2021). Deep model intellectual property protection via deep watermarking. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Zhang et al., 2020b

Zhang, J., Xu, X., Han, B., Niu, G., Cui, L., Sugiyama, M., & Kankanhalli, M. (2020). Attacks which do not kill training make adversarial learning stronger. International Conference on Machine Learning (pp. 11278–11287).

Zhang et al., 2020c

Zhang, J., Zhu, J., Niu, G., Han, B., Sugiyama, M., & Kankanhalli, M. (2020). Geometry-aware instance-reweighted adversarial training. International Conference on Learning Representations.

Zhang et al., 2024c

Zhang, J., Liu, H., Jia, J., & Gong, N. Z. (2024). Data poisoning based backdoor attacks to contrastive learning. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Zhang et al., 2024d

Zhang, M., Yu, N., Wen, R., Backes, M., & Zhang, Y. (2024). Generated distributions are all you need for membership inference attacks against generative models. IEEE/CVF Winter Conference on Applications of Computer Vision.

Zhang et al., 2022b

Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., … Li, H. (2022). Tip-adapter: training-free adaption of clip for few-shot classification. European Conference on Computer Vision.

Zhang et al., 2023

Zhang, S., Zhang, M., Pan, X., & Yang, M. (2023). No-skim: towards efficiency robustness evaluation on skimming-based language models. arXiv preprint arXiv:2312.09494.

Zhang et al., 2020d

Zhang, T., Kishore, V., Wu, F., Weinberger, K. Q., & Artzi, Y. (2020). Bertscore: evaluating text generation with bert. International Conference on Learning Representations.

Zhao et al., 2021

Zhao, H., Wei, T., Zhou, W., Zhang, W., Chen, D., & Yu, N. (2021). Multi-attentional deepfake detection. IEEE/CVF Conference on Computer Vision and Pattern Recognition.

Zhao et al., 2024

Zhao, Y., Pang, T., Du, C., Yang, X., Li, C., Cheung, N.-M. M., & Lin, M. (2024). On evaluating adversarial robustness of large vision-language models. Advances in Neural Information Processing Systems.

Zheng et al., 2023

Zheng, M., Lou, Q., & Jiang, L. (2023). Trojvit: trojan insertion in vision transformers. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4025–4034).

Zhou et al., 2024a

Zhou, A., Li, B., & Wang, H. (2024). Robust prompt optimization for defending language models against jailbreaking attacks. arXiv preprint arXiv:2401.17263.

Zhou et al., 2024b

Zhou, C., Liu, P., Xu, P., Iyer, S., Sun, J., Mao, Y., … others. (2024). Lima: less is more for alignment. Advances in Neural Information Processing Systems.

Zhou et al., 2023a

Zhou, Z., Hu, S., Li, M., Zhang, H., Zhang, Y., & Jin, H. (2023). Advclip: downstream-agnostic adversarial examples in multimodal contrastive learning. ACM International Conference on Multimedia.

Zhou et al., 2023b

Zhou, Z., Hu, S., Zhao, R., Wang, Q., Zhang, L. Y., Hou, J., & Jin, H. (2023). Downstream-agnostic adversarial examples. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4345–4355).

Zhu et al., 2020

Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T., & Liu, J. (2020). Freelb: enhanced adversarial training for natural language understanding. International Conference on Learning Representations.

Zhu et al., 2019

Zhu, C., Huang, W. R., Li, H., Taylor, G., Studer, C., & Goldstein, T. (2019). Transferable clean-label poisoning attacks on deep neural nets. International Conference on Machine Learning (pp. 7614–7623).

Zhu et al., 2023

Zhu, D., Chen, J., Shen, X., Li, X., & Elhoseiny, M. (2023). Minigpt-4: enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.

Zhu et al., 2021

Zhu, J., Yao, J., Han, B., Zhang, J., Liu, T., Niu, G., … Yang, H. (2021). Reliable adversarial distillation with unreliable teachers. International Conference on Learning Representations.

Zhu et al., 2024

Zhu, L., Ning, R., Li, J., Xin, C., & Wu, H. (2024). Seer: backdoor detection for vision-language models through searching target text and image trigger jointly. AAAI Conference on Artificial Intelligence.

Zhuang et al., 2023

Zhuang, H., Zhang, Y., & Liu, S. (2023). A pilot study of query-free adversarial attack against stable diffusion. IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2384–2391).

Zi et al., 2021

Zi, B., Zhao, S., Ma, X., & Jiang, Y.-G. (2021). Revisiting adversarial robustness distillation: robust soft labels make student better. International Conference on Computer Vision.

Zou et al., 2023

Zou, A., Wang, Z., Kolter, J. Z., & Fredrikson, M. (2023). Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.

, 2023

张奇、桂韬、黄萱菁. (2023). 自然语言处理导论. 上海: 电子工业出版社.