reviews1.json

[
    {
        "review": {
            "id": "IiDLi8IfTR",
            "forum": "lNLVvdHyAw",
            "replyto": "lNLVvdHyAw",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
            ],
            "nonreaders": [],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
            ],
            "content": {
                "summary": {
                    "value": "This paper focuses on jailbreak attacks via exploiting adversarial suffixes to hack large language models (LLMs). This work evaluates the use of perplexity for detecting the kind of jailbreak. Based on that, this work proposes a classifier trained on perplexity and token sequence length to improve the perplexity filtering. Comprehensive analysis and experiments are conducted to provide insights on identifying the adversarial suffix."
                },
                "soundness": {
                    "value": "2 fair"
                },
                "presentation": {
                    "value": "2 fair"
                },
                "contribution": {
                    "value": "3 good"
                },
                "strengths": {
                    "value": "1. This paper focuses on a newly emerged and important research topic, i.e., jailbreak attack on large language models.\n2. This paper provides a pioneer investigation about how to identify the adversarial suffix which is demonstrated to be effective in jailbreaking the large language models.\n3. This paper conducts and presents a comprehensive experimental part to show that the adversarial suffix is identifiable via perplexity.\n4. This paper provides corresponding discussions about the analytical results which contain much useful insights for later research on detecting and defending against such kind of adversarial suffix."
                },
                "weaknesses": {
                    "value": "1. The writing and presentation of the current version of this work can be further improved to highlight some technical contributions and also the structure of this draft.\n2. The use of perplexity is a little bit heuristic, with limited intuitive motivation for the proposed method. The underlying mechanism of the perplexity for adversarial suffixes is under-explained.\n3. The experiments strictly use GPT-2 for perplexity, could this be replaceable for any other choice? although the authors have already listed it as one of the limitations, it could be better to provide some discussion on this choice.\n4. It could be better to provide some discussion about how to detect human-crafted jailbreaks via the perspective of perplexity."
                },
                "questions": {
                    "value": "1. The underlying mechanism of perplexity for adversarial suffixes can be more clearly explained or presented.\n2. It could be better to enhance the experimental parts using different models for perplexity.\n3. It could be better to provide some discussion about how to detect human-crafted jailbreaks via the perspective of perplexity. \n4. The structure of the current version can be better improved to enhance the readability, and highlight some contribution points in the method part and also some conclusions in the analytics."
                },
                "flag_for_ethics_review": {
                    "value": [
                        "No ethics review needed."
                    ]
                },
                "rating": {
                    "value": "5: marginally below the acceptance threshold"
                },
                "confidence": {
                    "value": "3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked."
                },
                "code_of_conduct": {
                    "value": "Yes"
                }
            },
            "number": 1,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Review",
                "ICLR.cc/2024/Conference/-/Edit"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1698643937829,
            "cdate": 1698643937829,
            "tmdate": 1699636672160,
            "mdate": 1699636672160,
            "license": "CC BY 4.0",
            "version": 2
        },
        "response": {
            "id": "Mr6UvbNUJ1",
            "forum": "lNLVvdHyAw",
            "replyto": "IiDLi8IfTR",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Authors"
            ],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Authors"
            ],
            "content": {
                "title": {
                    "value": "Continued Updates and Clarifications"
                },
                "comment": {
                    "value": "7. We now cite \"On Evaluating Adversarial Robustness\" (Carlini, 2019), which is described as a \"living document\" of recommendations on how to research and publish work on adversarial defenses. A GitHub link was provided there for researchers to propose amendments to the guidelines. We suggest an update to these guidelines to propose that defenses rigorously evaluate the rejection rates on regular day-to-day user behavior that their defense would have in practice. We tried to achieve this goal by evaluating over 175,000 regular prompts from 8 sources to stress test our classifier. Now that neural network systems like ChatGPT have millions of text queries a day, defenses for LLMs must perform such stress tests on large diverse samples of regular prompts to be useful.\n\nThey write \"The source code for a defense can be seen as the definitive reference for the algorithm\". In that spirit, it is fair to say that our defense is aligned with the source code reference for Zou et al., rather than any configuration of the GCG algorithm that could be inspired by their paper. They also wrote, \"Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded;\" We better understood the effort to acknowledge potential and actual adaptive flaws against our defense after reading this work, so we empathize that readers will benefit from this resource as well."
                }
            },
            "number": 9,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1700548232972,
            "cdate": 1700548232972,
            "tmdate": 1700583403302,
            "mdate": 1700583403302,
            "license": "CC BY 4.0",
            "version": 2
        }
    },
    {
        "review": {
            "id": "qlDQYNZXl5",
            "forum": "lNLVvdHyAw",
            "replyto": "lNLVvdHyAw",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
            ],
            "nonreaders": [],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
            ],
            "content": {
                "summary": {
                    "value": "The paper presents a method for detecting malicious prompts in Language Model Models (LLMs). The central concept involves utilizing GPT-2 to calculate the perplexity (PPL) of each prompt. Adversarial prompts, such as those generated by GCG, often consist of unreadable tokens, resulting in higher PPL values compared to benign prompts. The distinguishable PPL serves as an indicator to flag malicious prompts."
                },
                "soundness": {
                    "value": "2 fair"
                },
                "presentation": {
                    "value": "1 poor"
                },
                "contribution": {
                    "value": "2 fair"
                },
                "strengths": {
                    "value": "1. The paper focuses on a pressing and significant safety issue pertaining to the emerging Language Model Models (LLMs).\n\n2. The core idea of the paper is straightforward."
                },
                "weaknesses": {
                    "value": "1. The writing quality of the paper is poor, and its current state hinders clear comprehension and detracts from the overall presentation of the research.\n\n2. The paper would benefit from including evaluations on the adaptive attack setting. Currently, the perplexity s calculated using another LLM, namely GPT-2, which can potentially be deceived by adversarial attacks such as GCG. It is essential to consider that an attacker may strategically leverage the proposed defense mechanism to perform an overall optimization and potentially overcome the entire system. Therefore, it is important for the authors to explore and address this potential vulnerability in their evaluation.\n\n3. A more realistic scenario to consider is when the benign tokens of the prompt are considerably longer, while the adversarial suffix only consists of a few words. In such cases, the overall adversarial prompt may still maintain a relatively low perplexity (PPL) value. It would be valuable for the authors to acknowledge and discuss this potential challenge, as it can have implications for the effectiveness of the proposed detection method."
                },
                "questions": {
                    "value": "Please refer to the weakness section."
                },
                "flag_for_ethics_review": {
                    "value": [
                        "No ethics review needed."
                    ]
                },
                "rating": {
                    "value": "5: marginally below the acceptance threshold"
                },
                "confidence": {
                    "value": "3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked."
                },
                "code_of_conduct": {
                    "value": "Yes"
                }
            },
            "number": 2,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Review",
                "ICLR.cc/2024/Conference/-/Edit"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1698768853220,
            "cdate": 1698768853220,
            "tmdate": 1699636672056,
            "mdate": 1699636672056,
            "license": "CC BY 4.0",
            "version": 2
        },
        "response": {
            "id": "ksh5aBgRUj",
            "forum": "lNLVvdHyAw",
            "replyto": "qlDQYNZXl5",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Authors"
            ],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Authors"
            ],
            "content": {
                "title": {
                    "value": "Question about Renaming Paper Title"
                },
                "comment": {
                    "value": "Hello,\nThank you for your time -- we have an idea to address your concern.\nOur paper's content is focused on detecting Zou et al.'s GCG based attack and contrasting its effectiveness when applying it to Jamarillo's attack. Would renaming the title from Detecting Language Models Attacks with Perplexity, to Detecting Greedy Coordinate Gradient Attacks with Perplexity, resolve your concern that newer non-GCG language model attacks have come about that perplexity would not be useful for detecting?"
                }
            },
            "number": 15,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1700724783247,
            "cdate": 1700724783247,
            "tmdate": 1700724846854,
            "mdate": 1700724846854,
            "license": "CC BY 4.0",
            "version": 2
        }
    },
    {
        "review": {
            "id": "8lyX3QkumL",
            "forum": "lNLVvdHyAw",
            "replyto": "lNLVvdHyAw",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_VSJg"
            ],
            "nonreaders": [],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_VSJg"
            ],
            "content": {
                "summary": {
                    "value": "Zou et al.'s adversarial attack on LLMs results in unreadable adversarial suffixes. This paper proposes a detection method using perplexity, a measure of readability. The authors highlight the marked difference in perplexity between adversarial and regular prompts. They also emphasize the difficulty of attaining low false positives with a straightforward perplexity filter. To address this, they consider both perplexity and token sequence length as two features, and train a classifier to reduce false positive rates. Overall, this work demonstrates a potential way to defend against adversarial suffixes."
                },
                "soundness": {
                    "value": "3 good"
                },
                "presentation": {
                    "value": "2 fair"
                },
                "contribution": {
                    "value": "2 fair"
                },
                "strengths": {
                    "value": "1. The message conveyed by this paper is clear and easy to understand. The empirical results serve as a helpful reference for future work. \n2. The authors collect regular prompts from various datasets, covering both human-crafted and machine-generated prompts. This better reflects real scenarios.\n3. The authors also point out that perplexity filtering cannot detect human-crafted jailbreaks, shedding light on the nuances of various jailbreak attacks."
                },
                "weaknesses": {
                    "value": "1. While the empirical evaluation is detailed, the overall idea seems straightforward given the stark gibberish looking of adversarial suffixes. Given this, I would expect more technical contributions like \n    - Evaluating if the perplexity filter itself is robust against evading attacks.\n    - Evaluating if different base models for calculating perplexity lead to different results.\n2. There is room for refining the paper's presentation, such as eliminating superfluous spaces to make it more compact."
                },
                "questions": {
                    "value": "1. Using token length as an additional feature for detection warrants further scrutiny. How susceptible is it to such evading attacks that lengthen the suffixes with filler texts?"
                },
                "flag_for_ethics_review": {
                    "value": [
                        "No ethics review needed."
                    ]
                },
                "rating": {
                    "value": "5: marginally below the acceptance threshold"
                },
                "confidence": {
                    "value": "3: You are fairly confident in your assessment. It is possible that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work. Math/other details were not carefully checked."
                },
                "code_of_conduct": {
                    "value": "Yes"
                }
            },
            "number": 3,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Review",
                "ICLR.cc/2024/Conference/-/Edit"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1698810518215,
            "cdate": 1698810518215,
            "tmdate": 1699636671946,
            "mdate": 1699636671946,
            "license": "CC BY 4.0",
            "version": 2
        },
        "response": {
            "id": "OftzaGIBeD",
            "forum": "lNLVvdHyAw",
            "replyto": "8lyX3QkumL",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Authors"
            ],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Authors"
            ],
            "content": {
                "title": {
                    "value": "Adding a Reference to Carlini 2019"
                },
                "comment": {
                    "value": "7. We now cite \"On Evaluating Adversarial Robustness\" (Carlini, 2019), which is described as a \"living document\" of recommendations on how to research and publish work on adversarial defenses. A GitHub link was provided there for researchers to propose amendments to the guidelines. We suggest an update to these guidelines to propose that defenses rigorously evaluate the rejection rates on regular day-to-day user behavior that their defense would have in practice. We tried to achieve this\u00a0goal by evaluating over 175,000 regular prompts from 8 sources to stress test our classifier. Now that neural network systems like ChatGPT have millions of text queries a day, defenses for LLMs must perform such stress tests on large diverse samples of regular prompts to be useful.\n\nThey write \"The source code for a defense can be seen as the definitive reference for the algorithm\". In that spirit, it is fair to say that our defense is aligned with the source code reference for Zou et al., rather than any configuration of the GCG algorithm that could be inspired by their paper. They also wrote, \"Despite the significant amount of recent work attempting to design defenses that withstand adaptive attacks, few have succeeded;\" We better understood the effort to acknowledge potential and actual adaptive flaws against our defense after reading this work, so we empathize that readers will benefit from this resource as well."
                }
            },
            "number": 5,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1700547738423,
            "cdate": 1700547738423,
            "tmdate": 1700583352527,
            "mdate": 1700583352527,
            "license": "CC BY 4.0",
            "version": 2
        }
    },
    {
        "review": {
            "id": "TWePJA7L29",
            "forum": "lNLVvdHyAw",
            "replyto": "GWC65frD5V",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
            ],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_8VeL"
            ],
            "content": {
                "title": {
                    "value": "Clarification on the Question 3"
                },
                "comment": {
                    "value": "Dear Authors,\n\nThanks for the clarification on the perplexity with the human-crafted jailbreaks. It is fine that the perplexity is ineffective at detecting human-crafted jailbreaks. The original question (or saying \"suggestion\") for the weaknesses point is aimed at better discussing the underlying mechanism of why the perplexity is ineffective, and whether is there any possibility or potential for detecting human-crafted jailbreaks. Since currently, human-crafted jailbreaks are more practical (easy-to-implement) in some scenarios, Question 3 may serve as a discussion point. \n\nThanks!\n\nBest regards,\nReviewer 8VeL"
                }
            },
            "number": 3,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1700359398944,
            "cdate": 1700359398944,
            "tmdate": 1700359398944,
            "mdate": 1700359398944,
            "license": "CC BY 4.0",
            "version": 2
        },
        "response": null
    },
    {
        "review": {
            "id": "5cOKfJrmNo",
            "forum": "lNLVvdHyAw",
            "replyto": "Ck8mNjrZyH",
            "signatures": [
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
            ],
            "readers": [
                "everyone"
            ],
            "writers": [
                "ICLR.cc/2024/Conference",
                "ICLR.cc/2024/Conference/Submission6181/Reviewer_hkGN"
            ],
            "content": {
                "title": {
                    "value": "Response to Rebuttal"
                },
                "comment": {
                    "value": "Thanks for the authors response. One of my main concerns, as mentioned in weakness 2, remains unresolved. I highly recommend that the authors read the paper titled \"AutoDan: Automatic and interpretable adversarial attacks on LLM\" https://arxiv.org/abs/2310.15140. In this paper, the proposed attack takes readability into consideration as an optimization constraint and successfully evades PPL checking for attacks. Therefore, I would like to keep my score."
                }
            },
            "number": 13,
            "invitations": [
                "ICLR.cc/2024/Conference/Submission6181/-/Official_Comment"
            ],
            "domain": "ICLR.cc/2024/Conference",
            "tcdate": 1700715574733,
            "cdate": 1700715574733,
            "tmdate": 1700715574733,
            "mdate": 1700715574733,
            "license": "CC BY 4.0",
            "version": 2
        },
        "response": null
    }
]