-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreviews1.json
379 lines (379 loc) · 22.2 KB
/
reviews1.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
[
{
"review": {
"id": "IiDLi8IfTR",
"forum": "lNLVvdHyAw",
"replyto": "lNLVvdHyAw",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
],
"nonreaders": [],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
],
"content": {
"summary": {
"value": "This paper focuses on jailbreak attacks via exploiting adversarial suffixes to hack large language models (LLMs). This work evaluates the use of perplexity for detecting the kind of jailbreak. Based on that, this work proposes a classifier trained on perplexity and token sequence length to improve the perplexity filtering. Comprehensive analysis and experiments are conducted to provide insights on identifying the adversarial suffix."
},
"soundness": {
"value": "2 fair"
},
"presentation": {
"value": "2 fair"
},
"contribution": {
"value": "3 good"
},
"strengths": {
"value": "1. This paper focuses on a newly emerged and important research topic, i.e., jailbreak attack on large language models.\n2. This paper provides a pioneer investigation about how to identify the adversarial suffix which is demonstrated to be effective in jailbreaking the large language models.\n3. This paper conducts and presents a comprehensive experimental part to show that the adversarial suffix is identifiable via perplexity.\n4. This paper provides corresponding discussions about the analytical results which contain much useful insights for later research on detecting and defending against such kind of adversarial suffix."
},
"weaknesses": {
"value": "1. The writing and presentation of the current version of this work can be further improved to highlight some technical contributions and also the structure of this draft.\n2. The use of perplexity is a little bit heuristic, with limited intuitive motivation for the proposed method. The underlying mechanism of the perplexity for adversarial suffixes is under-explained.\n3. The experiments strictly use GPT-2 for perplexity, could this be replaceable for any other choice? although the authors have already listed it as one of the limitations, it could be better to provide some discussion on this choice.\n4. It could be better to provide some discussion about how to detect human-crafted jailbreaks via the perspective of perplexity."
},
"questions": {
"value": "1. The underlying mechanism of perplexity for adversarial suffixes can be more clearly explained or presented.\n2. It could be better to enhance the experimental parts using different models for perplexity.\n3. It could be better to provide some discussion about how to detect human-crafted jailbreaks via the perspective of perplexity. \n4. The structure of the current version can be better improved to enhance the readability, and highlight some contribution points in the method part and also some conclusions in the analytics."
},
"flag_for_ethics_review": {
"value": [
"No ethics review needed."
]
},
"rating": {
"value": "5: marginally below the acceptance threshold"
},
"confidence": {
"value": "3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked."
},
"code_of_conduct": {
"value": "Yes"
}
},
"number": 1,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Review",
"ICLR.cc/2024/Conference/-/Edit"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1698643937829,
"cdate": 1698643937829,
"tmdate": 1699636672160,
"mdate": 1699636672160,
"license": "CC BY 4.0",
"version": 2
},
"response": {
"id": "Mr6UvbNUJ1",
"forum": "lNLVvdHyAw",
"replyto": "IiDLi8IfTR",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Authors"
],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Authors"
],
"content": {
"title": {
"value": "Continued Updates and Clarifications"
},
"comment": {
"value": "7. We now cite \"On Evaluating Adversarial Robustness\" (Carlini, 2019), which is described as a \"living document\" of recommendations on how to research and publish work on adversarial defenses. A GitHub link was provided there for researchers to propose amendments to the guidelines. We suggest an update to these guidelines to propose that defenses rigorously evaluate the rejection rates on regular day-to-day user behavior that their defense would have in practice. We tried to achieve this goal by evaluating over 175,000 regular prompts from 8 sources to stress test our classifier. Now that neural network systems like ChatGPT have millions of text queries a day, defenses for LLMs must perform such stress tests on large diverse samples of regular prompts to be useful.\n\nThey write \"The source code for a defense can be seen as the definitive reference for the algorithm\". In that spirit, it is fair to say that our defense is aligned with the source code reference for Zou et al., rather than any configuration of the GCG algorithm that could be inspired by their paper. They also wrote, \"Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded;\" We better understood the effort to acknowledge potential and actual adaptive flaws against our defense after reading this work, so we empathize that readers will benefit from this resource as well."
}
},
"number": 9,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1700548232972,
"cdate": 1700548232972,
"tmdate": 1700583403302,
"mdate": 1700583403302,
"license": "CC BY 4.0",
"version": 2
}
},
{
"review": {
"id": "qlDQYNZXl5",
"forum": "lNLVvdHyAw",
"replyto": "lNLVvdHyAw",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
],
"nonreaders": [],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
],
"content": {
"summary": {
"value": "The paper presents a method for detecting malicious prompts in Language Model Models (LLMs). The central concept involves utilizing GPT-2 to calculate the perplexity (PPL) of each prompt. Adversarial prompts, such as those generated by GCG, often consist of unreadable tokens, resulting in higher PPL values compared to benign prompts. The distinguishable PPL serves as an indicator to flag malicious prompts."
},
"soundness": {
"value": "2 fair"
},
"presentation": {
"value": "1 poor"
},
"contribution": {
"value": "2 fair"
},
"strengths": {
"value": "1. The paper focuses on a pressing and significant safety issue pertaining to the emerging Language Model Models (LLMs).\n\n2. The core idea of the paper is straightforward."
},
"weaknesses": {
"value": "1. The writing quality of the paper is poor, and its current state hinders clear comprehension and detracts from the overall presentation of the research.\n\n2. The paper would benefit from including evaluations on the adaptive attack setting. Currently, the perplexity s calculated using another LLM, namely GPT-2, which can potentially be deceived by adversarial attacks such as GCG. It is essential to consider that an attacker may strategically leverage the proposed defense mechanism to perform an overall optimization and potentially overcome the entire system. Therefore, it is important for the authors to explore and address this potential vulnerability in their evaluation.\n\n3. A more realistic scenario to consider is when the benign tokens of the prompt are considerably longer, while the adversarial suffix only consists of a few words. In such cases, the overall adversarial prompt may still maintain a relatively low perplexity (PPL) value. It would be valuable for the authors to acknowledge and discuss this potential challenge, as it can have implications for the effectiveness of the proposed detection method."
},
"questions": {
"value": "Please refer to the weakness section."
},
"flag_for_ethics_review": {
"value": [
"No ethics review needed."
]
},
"rating": {
"value": "5: marginally below the acceptance threshold"
},
"confidence": {
"value": "3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked."
},
"code_of_conduct": {
"value": "Yes"
}
},
"number": 2,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Review",
"ICLR.cc/2024/Conference/-/Edit"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1698768853220,
"cdate": 1698768853220,
"tmdate": 1699636672056,
"mdate": 1699636672056,
"license": "CC BY 4.0",
"version": 2
},
"response": {
"id": "ksh5aBgRUj",
"forum": "lNLVvdHyAw",
"replyto": "qlDQYNZXl5",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Authors"
],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Authors"
],
"content": {
"title": {
"value": "Question about Renaming Paper Title"
},
"comment": {
"value": "Hello,\nThank you for your time -- we have an idea to address your concern.\nOur paper's content is focused on detecting Zou et al.'s GCG based attack and contrasting its effectiveness when applying it to Jamarillo's attack. Would renaming the title from Detecting Language Models Attacks with Perplexity, to Detecting Greedy Coordinate Gradient Attacks with Perplexity, resolve your concern that newer non-GCG language model attacks have come about that perplexity would not be useful for detecting?"
}
},
"number": 15,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1700724783247,
"cdate": 1700724783247,
"tmdate": 1700724846854,
"mdate": 1700724846854,
"license": "CC BY 4.0",
"version": 2
}
},
{
"review": {
"id": "8lyX3QkumL",
"forum": "lNLVvdHyAw",
"replyto": "lNLVvdHyAw",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Reviewer_VSJg"
],
"nonreaders": [],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Reviewer_VSJg"
],
"content": {
"summary": {
"value": "Zou et al.'s adversarial attack on LLMs results in unreadable adversarial suffixes. This paper proposes a detection method using perplexity, a measure of readability. The authors highlight the marked difference in perplexity between adversarial and regular prompts. They also emphasize the difficulty of attaining low false positives with a straightforward perplexity filter. To address this, they consider both perplexity and token sequence length as two features, and train a classifier to reduce false positive rates. Overall, this work demonstrates a potential way to defend against adversarial suffixes."
},
"soundness": {
"value": "3 good"
},
"presentation": {
"value": "2 fair"
},
"contribution": {
"value": "2 fair"
},
"strengths": {
"value": "1. The message conveyed by this paper is clear and easy to understand. The empirical results serve as a helpful reference for future work. \n2. The authors collect regular prompts from various datasets, covering both human-crafted and machine-generated prompts. This better reflects real scenarios.\n3. The authors also point out that perplexity filtering cannot detect human-crafted jailbreaks, shedding light on the nuances of various jailbreak attacks."
},
"weaknesses": {
"value": "1. While the empirical evaluation is detailed, the overall idea seems straightforward given the stark gibberish looking of adversarial suffixes. Given this, I would expect more technical contributions like \n - Evaluating if the perplexity filter itself is robust against evading attacks.\n - Evaluating if different base models for calculating perplexity lead to different results.\n2. There is room for refining the paper's presentation, such as eliminating superfluous spaces to make it more compact."
},
"questions": {
"value": "1. Using token length as an additional feature for detection warrants further scrutiny. How susceptible is it to such evading attacks that lengthen the suffixes with filler texts?"
},
"flag_for_ethics_review": {
"value": [
"No ethics review needed."
]
},
"rating": {
"value": "5: marginally below the acceptance threshold"
},
"confidence": {
"value": "3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked."
},
"code_of_conduct": {
"value": "Yes"
}
},
"number": 3,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Review",
"ICLR.cc/2024/Conference/-/Edit"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1698810518215,
"cdate": 1698810518215,
"tmdate": 1699636671946,
"mdate": 1699636671946,
"license": "CC BY 4.0",
"version": 2
},
"response": {
"id": "OftzaGIBeD",
"forum": "lNLVvdHyAw",
"replyto": "8lyX3QkumL",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Authors"
],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Authors"
],
"content": {
"title": {
"value": "Adding a Reference to Carlini 2019"
},
"comment": {
"value": "7. We now cite \"On Evaluating Adversarial Robustness\" (Carlini, 2019), which is described as a \"living document\" of recommendations on how to research and publish work on adversarial defenses. A GitHub link was provided there for researchers to propose amendments to the guidelines. We suggest an update to these guidelines to propose that defenses rigorously evaluate the rejection rates on regular day-to-day user behavior that their defense would have in practice. We tried to achieve this\u00a0goal by evaluating over 175,000 regular prompts from 8 sources to stress test our classifier. Now that neural network systems like ChatGPT have millions of text queries a day, defenses for LLMs must perform such stress tests on large diverse samples of regular prompts to be useful.\n\nThey write \"The source code for a defense can be seen as the definitive reference for the algorithm\". In that spirit, it is fair to say that our defense is aligned with the source code reference for Zou et al., rather than any configuration of the GCG algorithm that could be inspired by their paper. They also wrote, \"Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded;\" We better understood the effort to acknowledge potential and actual adaptive flaws against our defense after reading this work, so we empathize that readers will benefit from this resource as well."
}
},
"number": 5,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1700547738423,
"cdate": 1700547738423,
"tmdate": 1700583352527,
"mdate": 1700583352527,
"license": "CC BY 4.0",
"version": 2
}
},
{
"review": {
"id": "TWePJA7L29",
"forum": "lNLVvdHyAw",
"replyto": "GWC65frD5V",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
],
"content": {
"title": {
"value": "Clarification on the Question 3"
},
"comment": {
"value": "Dear Authors,\n\nThanks for the clarification on the perplexity with the human-crafted jailbreaks. It is fine that the perplexity is ineffective at detecting human-crafted jailbreaks. The original question (or saying \"suggestion\") for the weaknesses point is aimed at better discussing the underlying mechanism of why the perplexity is ineffective, and whether is there any possibility or potential for detecting human-crafted jailbreaks. Since currently, human-crafted jailbreaks are more practical (easy-to-implement) in some scenarios, Question 3 may serve as a discussion point. \n\nThanks!\n\nBest regards,\nReviewer 8VeL"
}
},
"number": 3,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1700359398944,
"cdate": 1700359398944,
"tmdate": 1700359398944,
"mdate": 1700359398944,
"license": "CC BY 4.0",
"version": 2
},
"response": null
},
{
"review": {
"id": "5cOKfJrmNo",
"forum": "lNLVvdHyAw",
"replyto": "Ck8mNjrZyH",
"signatures": [
"ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
],
"readers": [
"everyone"
],
"writers": [
"ICLR.cc/2024/Conference",
"ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
],
"content": {
"title": {
"value": "Response to Rebuttal"
},
"comment": {
"value": "Thanks for the authors response. One of my main concerns, as mentioned in weakness 2, remains unresolved. I highly recommend that the authors read the paper titled \"AutoDan: Automatic and interpretable adversarial attacks on LLM\" https://arxiv.org/abs/2310.15140. In this paper, the proposed attack takes readability into consideration as an optimization constraint and successfully evades PPL checking for attacks. Therefore, I would like to keep my score."
}
},
"number": 13,
"invitations": [
"ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
],
"domain": "ICLR.cc/2024/Conference",
"tcdate": 1700715574733,
"cdate": 1700715574733,
"tmdate": 1700715574733,
"mdate": 1700715574733,
"license": "CC BY 4.0",
"version": 2
},
"response": null
}
]