Skip to content

Latest commit

 

History

History
55 lines (55 loc) · 2.21 KB

2023-04-24-schwinn23a.md

File metadata and controls

55 lines (55 loc) · 2.21 KB
title abstract layout series publisher issn id month tex_title firstpage lastpage page order cycles bibtex_author author date address container-title volume genre issued pdf extras
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
Over the past decade, there has been extensive research aimed at enhancing the robustness of neural networks, yet this problem remains vastly unsolved. Here, one major impediment has been the overestimation of the robustness of new defense approaches due to faulty defense evaluations. Flawed robustness evaluations necessitate rectifications in subsequent works, dangerously slowing down the research and providing a false sense of security. In this context, we will face substantial challenges associated with an impending adversarial arms race in natural language processing, specifically with closed-source Large Language Models (LLMs), such as ChatGPT, Google Bard, or Anthropic’s Claude. We provide a first set of prerequisites to improve the robustness assessment of new approaches and reduce the amount of faulty evaluations. Additionally, we identify embedding space attacks on LLMs as another viable threat model for the purposes of generating malicious content in open-sourced models. Finally, we demonstrate on a recently proposed defense that, without LLM-specific best practices in place, it is easy to overestimate the robustness of a new approach.
inproceedings
Proceedings of Machine Learning Research
PMLR
2640-3498
schwinn23a
0
Adversarial Attacks and Defenses in Large Language Models: Old and New Threats
103
117
103-117
103
false
Schwinn, Leo and Dobre, David and G{\"u}nnemann, Stephan and Gidel, Gauthier
given family
Leo
Schwinn
given family
David
Dobre
given family
Stephan
Günnemann
given family
Gauthier
Gidel
2023-04-24
Proceedings on "I Can't Believe It's Not Better: Failure Modes in the Age of Foundation Models" at NeurIPS 2023 Workshops
239
inproceedings
date-parts
2023
4
24