Skip to content

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.

License

Notifications You must be signed in to change notification settings

yueliu1999/Awesome-Jailbreak-on-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 

Repository files navigation

Awesome-Jailbreak-on-LLMs

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses. Any additional things regarding jailbreak, PRs, issues are welcome and we are glad to add you to the contributor list here. Any problems, please contact [email protected]. If you find this repository useful to your research or work, it is really appreciated to star this repository and cite our papers here. ✨

Bookmarks

Papers

Jailbreak Attack

Black-box Attack

Time Title Venue Paper Code
2024.11 Playing Language Game with LLMs Leads to Jailbreaking arXiv link link
2024.11 GASP: Efficient Black-Box Generation of Adversarial Suffixes for Jailbreaking LLMs (GASP) arXiv link link
2024.11 LLM STINGER: Jailbreaking LLMs using RL fine-tuned LLMs arXiv link -
2024.11 SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt arXiv link link
2024.11 Diversity Helps Jailbreak Large Language Models arXiv link -
2024.11 Plentiful Jailbreaks with String Compositions arXiv link -
2024.11 Transferable Ensemble Black-box Jailbreak Attacks on Large Language Models arXiv link link
2024.11 Stealthy Jailbreak Attacks on Large Language Models via Benign Data Mirroring arXiv link -
2024.10 FlipAttack: Jailbreak LLMs via Flipping (FlipAttack) arXiv link link
2024.10 Endless Jailbreaks with Bijection arXiv link -
2024.10 Harnessing Task Overload for Scalable Jailbreak Attacks on Large Language Models arXiv link -
2024.10 You Know What I'm Saying: Jailbreak Attack via Implicit Reference arXiv link link
2024.10 Deciphering the Chaos: Enhancing Jailbreak Attacks via Adversarial Prompt Translation arXiv link link
2024.10 AutoDAN-Turbo: A Lifelong Agent for Strategy Self-Exploration to Jailbreak LLMs (AutoDAN-Turbo) arXiv link link
2024.10 PathSeeker: Exploring LLM Security Vulnerabilities with a Reinforcement Learning-Based Jailbreak Approach (PathSeeker) arXiv link -
2024.10 Read Over the Lines: Attacking LLMs and Toxicity Detection Systems with ASCII Art to Mask Profanity arXiv link link
2024.09 AdaPPA: Adaptive Position Pre-Fill Jailbreak Attack Approach Targeting LLMs arXiv link link
2024.09 Effective and Evasive Fuzz Testing-Driven Jailbreaking Attacks against LLMs arXiv link -
2024.09 Jailbreaking Large Language Models with Symbolic Mathematics arXiv link -
2024.08 Advancing Adversarial Suffix Transfer Learning on Aligned Large Language Models arXiv link -
2024.08 Hide Your Malicious Goal Into Benign Narratives: Jailbreak Large Language Models through Neural Carrier Articles arXiv link -
2024.08 h4rm3l: A Dynamic Benchmark of Composable Jailbreak Attacks for LLM Safety Assessment (h4rm3l) arXiv link link
2024.08 EnJa: Ensemble Jailbreak on Large Language Models (EnJa) arXiv link -
2024.07 Knowledge-to-Jailbreak: One Knowledge Point Worth One Attack arXiv link link
2024.07 Single Character Perturbations Break LLM Alignment arXiv link link
2024.07 A False Sense of Safety: Unsafe Information Leakage in 'Safe' AI Responses arXiv link -
2024.07 Virtual Context: Enhancing Jailbreak Attacks with Special Token Injection (Virtual Context) arXiv link -
2024.07 SoP: Unlock the Power of Social Facilitation for Automatic Jailbreak Attack (SoP) arXiv link link
2024.06 Improved Few-Shot Jailbreaking Can Circumvent Aligned Language Models and Their Defenses (I-FSJ) NeurIPS'24 link link
2024.06 When LLM Meets DRL: Advancing Jailbreaking Efficiency via DRL-guided Search (RLbreaker) NeurIPS'24 link -
2024.06 Agent Smith: A Single Image Can Jailbreak One Million Multimodal LLM Agents Exponentially Fast (Agent Smith) ICML'24 link link
2024.06 Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation ICML'24 link -
2024.06 ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (ArtPrompt) ACL'24 link link
2024.06 From Noise to Clarity: Unraveling the Adversarial Suffix of Large Language Model Attacks via Translation of Text Embeddings (ASETF) arXiv link -
2024.06 CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion (CodeAttack) arXiv link -
2024.06 Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction (DRA) USENIX Security'24 link link
2024.06 AutoJailbreak: Exploring Jailbreak Attacks and Defenses through a Dependency Lens (AutoJailbreak) arXiv link -
2024.06 Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks arXiv link link
2024.06 GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts (GPTFUZZER) arXiv link link
2024.06 A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM) NAACL'24 link link
2024.06 QROA: A Black-Box Query-Response Optimization Attack on LLMs (QROA) arXiv link link
2024.06 Poisoned LangChain: Jailbreak LLMs by LangChain (PLC) arXiv link link
2024.05 Multilingual Jailbreak Challenges in Large Language Models ICLR'24 link link
2024.05 DeepInception: Hypnotize Large Language Model to Be Jailbreaker (DeepInception) arXiv link link
2024.05 GPT-4 Jailbreaks Itself with Near-Perfect Success Using Self-Explanation (IRIS) ACL'24 link -
2024.05 GUARD: Role-playing to Generate Natural-language Jailbreakings to Test Guideline Adherence of LLMs (GUARD) arXiv link -
2024.05 "Do Anything Now": Characterizing and Evaluating In-The-Wild Jailbreak Prompts on Large Language Models (DAN) CCS'24 link link
2024.05 Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher (SelfCipher) ICLR'24 link link
2024.05 Jailbreaking Large Language Models Against Moderation Guardrails via Cipher Characters (JAM) NeurIPS'24 link -
2024.05 Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICA) arXiv link -
2024.04 Many-shot jailbreaking (MSJ) NeurIPS'24 Anthropic link -
2024.04 PANDORA: Detailed LLM jailbreaking via collaborated phishing agents with decomposed reasoning (PANDORA) ICLR Workshop'24 link -
2024.04 Fuzzllm: A novel and universal fuzzing framework for proactively discovering jailbreak vulnerabilities in large language models (FuzzLLM) ICASSP'24 link link
2024.04 Sandwich attack: Multi-language mixture adaptive attack on llms (Sandwich attack) arXiv link -
2024.03 Tastle: Distract large language models for automatic jailbreak attack (TASTLE) arXiv link -
2024.03 DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers (DrAttack) arXiv link link
2024.02 PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails (PRP) arXiv link -
2024.02 CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models (CodeChameleon) arXiv link link
2024.02 PAL: Proxy-Guided Black-Box Attack on Large Language Models (PAL) arXiv link link
2024.02 Jailbreaking Proprietary Large Language Models using Word Substitution Cipher arXiv link -
2024.02 Query-Based Adversarial Prompt Generation arXiv link -
2024.02 Leveraging the Context through Multi-Round Interactions for Jailbreaking Attacks (Contextual Interaction Attack) arXiv link -
2024.02 Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs (SMJ) arXiv link -
2024.02 Cognitive Overload: Jailbreaking Large Language Models with Overloaded Logical Thinking NAACL'24 link link
2024.01 Low-Resource Languages Jailbreak GPT-4 NeurIPS Workshop'24 link -
2024.01 How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP) arXiv link link
2023.12 Tree of Attacks: Jailbreaking Black-Box LLMs Automatically (TAP) NeurIPS'24 link link
2023.12 Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs arXiv link -
2023.12 Ignore This Title and HackAPrompt: Exposing Systemic Vulnerabilities of LLMs through a Global Scale Prompt Hacking Competition ACL'24 link -
2023.11 Scalable and Transferable Black-Box Jailbreaks for Language Models via Persona Modulation (Persona) NeurIPS Workshop'23 link -
2023.10 Jailbreaking Black Box Large Language Models in Twenty Queries (PAIR) arXiv link link
2023.10 Adversarial Demonstration Attacks on Large Language Models (advICL) arXiv link -
2023.10 MASTERKEY: Automated Jailbreaking of Large Language Model Chatbots (MASTERKEY) NDSS'24 link -
2023.10 Attack Prompt Generation for Red Teaming and Defending Large Language Models (SAP) EMNLP'23 link link
2023.10 An LLM can Fool Itself: A Prompt-Based Adversarial Attack (PromptAttack) ICLR'24 link link
2023.09 Multi-step Jailbreaking Privacy Attacks on ChatGPT (MJP) EMNLP Findings'23 link link
2023.09 Open Sesame! Universal Black Box Jailbreaking of Large Language Models (GA) arXiv link -
2023.05 Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection arXiv link link
2022.11 Ignore Previous Prompt: Attack Techniques For Language Models (PromptInject) NeurIPS WorkShop'22 link link

White-box Attack

Year Title Venue Paper Code
2024.11 AmpleGCG-Plus: A Strong Generative Model of Adversarial Suffixes to Jailbreak LLMs with Higher Success Rates in Fewer Attempts arXiv link -
2024.11 DROJ: A Prompt-Driven Attack against Large Language Models arXiv link link
2024.11 SQL Injection Jailbreak: a structural disaster of large language models arXiv link link
2024.10 Functional Homotopy: Smoothing Discrete Optimization via Continuous Parameters for LLM Jailbreak Attacks arXiv link -
2024.10 AttnGCG: Enhancing Jailbreaking Attacks on LLMs with Attention Manipulation arXiv link link
2024.10 Jailbreak Instruction-Tuned LLMs via end-of-sentence MLP Re-weighting arXiv link -
2024.10 Boosting Jailbreak Transferability for Large Language Models (SI-GCG) arXiv link -
2024.10 Iterative Self-Tuning LLMs for Enhanced Jailbreaking Capabilities (ADV-LLM) arXiv link link
2024.08 Probing the Safety Response Boundary of Large Language Models via Unsafe Decoding Path Generation (JVD) arXiv link -
2024.08 Jailbreak Open-Sourced Large Language Models via Enforced Decoding (EnDec) ACL'24 link -
2024.07 Refusal in Language Models Is Mediated by a Single Direction arXiv Link Link
2024.07 Revisiting Character-level Adversarial Attacks for Language Models ICML'24 link link
2024.07 Badllama 3: removing safety finetuning from Llama 3 in minutes (Badllama 3) arXiv link -
2024.07 SOS! Soft Prompt Attack Against Open-Source Large Language Models arXiv link -
2024.06 COLD-Attack: Jailbreaking LLMs with Stealthiness and Controllability (COLD-Attack) ICML'24 link link
2024.06 Improved Techniques for Optimization-Based Jailbreaking on Large Language Models (I-GCG) arXiv link link
2024.05 Semantic-guided Prompt Organization for Universal Goal Hijacking against LLMs arXiv link
2024.05 Efficient LLM Jailbreak via Adaptive Dense-to-sparse Constrained Optimization NeurIPS'24 Link -
2024.05 AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models (AutoDAN) ICLR'24 link link
2024.05 AmpleGCG: Learning a Universal and Transferable Generative Model of Adversarial Suffixes for Jailbreaking Both Open and Closed LLMs (AmpleGCG) arXiv link link
2024.05 Boosting jailbreak attack with momentum (MAC) ICLR Workshop'24 link link
2024.04 AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs (AdvPrompter) arXiv link link
2024.03 Universal Jailbreak Backdoors from Poisoned Human Feedback ICLR'24 link -
2024.02 Attacking large language models with projected gradient descent (PGD) arXiv link -
2024.02 Open the Pandora's Box of LLMs: Jailbreaking LLMs through Representation Engineering (JRE) arXiv link -
2024.02 Curiosity-driven red-teaming for large language models (CRT) arXiv link link
2023.12 AutoDAN: Interpretable Gradient-Based Adversarial Attacks on Large Language Models (AutoDAN) arXiv link link
2023.10 Catastrophic jailbreak of open-source llms via exploiting generation ICLR'24 link link
2023.06 Automatically Auditing Large Language Models via Discrete Optimization (ARCA) ICML'23 link link
2023.07 Universal and Transferable Adversarial Attacks on Aligned Language Models (GCG) arXiv link link

Multi-turn Attack

Time Title Venue Paper Code
2024.11 MRJ-Agent: An Effective Jailbreak Agent for Multi-Round Dialogue arXiv link -
2024.10 Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models (JSP) arXiv link link
2024.10 Multi-round jailbreak attack on large language arXiv link -
2024.10 Derail Yourself: Multi-turn LLM Jailbreak Attack through Self-discovered Clues arXiv link link
2024.09 LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet arXiv link link
2024.09 RED QUEEN: Safeguarding Large Language Models against Concealed Multi-Turn Jailbreaking arXiv link link
2024.08 Emerging Vulnerabilities in Frontier Models: Multi-Turn Jailbreak Attacks arXiv link link
2024.05 CoA: Context-Aware based Chain of Attack for Multi-Turn Dialogue LLM (CoA) arXiv link link
2024.04 Great, Now Write an Article About That: The Crescendo Multi-Turn LLM Jailbreak Attack (Crescendo) Microsoft Azure link -

Attack on RAG-based LLM

Time Title Venue Paper Code
2024.09 Unleashing Worms and Extracting Data: Escalating the Outcome of Attacks against RAG-based Inference in Scale and Severity Using Jailbreaking arXiv link link
2024.02 Pandora: Jailbreak GPTs by Retrieval Augmented Generation Poisoning (Pandora) arXiv link -

Multi-modal Attack

Time Title Venue Paper Code
2024.11 Jailbreak Attacks and Defenses against Multimodal Generative Models: A Survey arXiv link link
2024.10 Chain-of-Jailbreak Attack for Image Generation Models via Editing Step by Step arXiv link -
2024.10 ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation NeurIPS'24 Link -
2024.08 Jailbreaking Text-to-Image Models with LLM-Based Agents (Atlas) arXiv link -
2024.07 Image-to-Text Logic Jailbreak: Your Imagination can Help You Do Anything arXiv link -
2024.06 Jailbreak Vision Language Models via Bi-Modal Adversarial Prompt arXiv link link
2024.05 Voice Jailbreak Attacks Against GPT-4o arXiv link link
2024.05 Automatic Jailbreaking of the Text-to-Image Generative AI Systems ICML'24 Workshop link link
2024.04 Image hijacks: Adversarial images can control generative models at runtime arXiv link link
2024.03 An image is worth 1000 lies: Adversarial transferability across prompts on vision-language models (CroPA) ICLR'24 link link
2024.03 Jailbreak in pieces: Compositional adversarial attacks on multi-modal language model ICLR'24 link -
2024.03 Rethinking model ensemble in transfer-based adversarial attacks ICLR'24 link link
2024.02 VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models NeurIPS'23 link link
2024.02 Jailbreaking Attack against Multimodal Large Language Model arXiv link -
2024.01 Jailbreaking GPT-4V via Self-Adversarial Attacks with System Prompts arXiv link -
2024.03 Visual Adversarial Examples Jailbreak Aligned Large Language Models AAAI'24 link -
2023.12 OT-Attack: Enhancing Adversarial Transferability of Vision-Language Models via Optimal Transport Optimization (OT-Attack) arXiv link -
2023.12 FigStep: Jailbreaking Large Vision-language Models via Typographic Visual Prompts (FigStep) arXiv link link
2023.11 On Evaluating Adversarial Robustness of Large Vision-Language Models NeurIPS'23 link link
2023.10 How Robust is Google's Bard to Adversarial Image Attacks? arXiv link link
2023.08 AdvCLIP: Downstream-agnostic Adversarial Examples in Multimodal Contrastive Learning (AdvCLIP) ACM MM'23 link link
2023.07 Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models (SGA) ICCV'23 link link
2023.07 On the Adversarial Robustness of Multi-Modal Foundation Models ICCV Workshop'23 link -
2022.10 Towards Adversarial Attack on Vision-Language Pre-training Models arXiv link link

Jailbreak Defense

Learning-based Defense

Time Title Venue Paper Code
2024.10 MoJE: Mixture of Jailbreak Experts, Naive Tabular Classifiers as Guard for Prompt Attacks AAAI'24 link -
2024.08 BaThe: Defense against the Jailbreak Attack in Multimodal Large Language Models by Treating Harmful Instruction as Backdoor Trigger (BaThe) arXiv link -
2024.07 DART: Deep Adversarial Automated Red Teaming for LLM Safety arXiv link -
2024.07 Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge (Eraser) arXiv link link
2024.07 Safe Unlearning: A Surprisingly Effective and Generalizable Solution to Defend Against Jailbreak Attacks arXiv link link
2024.06 Adversarial Tuning: Defending Against Jailbreak Attacks for LLMs arXiv Link -
2024.06 Jatmo: Prompt Injection Defense by Task-Specific Finetuning (Jatmo) arXiv link link
2024.06 Defending Large Language Models Against Jailbreaking Attacks Through Goal Prioritization (SafeDecoding) ACL'24 link link
2024.06 Mitigating Fine-tuning based Jailbreak Attack with Backdoor Enhanced Safety Alignment NeurIPS'24 link link
2024.06 On Prompt-Driven Safeguarding for Large Language Models (DRO) ICML'24 link link
2024.06 Robust Prompt Optimization for Defending Language Models Against Jailbreaking Attacks (RPO) NeurIPS'24 link -
2024.06 Fight Back Against Jailbreaking via Prompt Adversarial Tuning (PAT) NeurIPS'24 link link
2024.05 Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching (SAFEPATCHING) arXiv link -
2024.05 Detoxifying Large Language Models via Knowledge Editing (DINM) ACL'24 link link
2024.05 Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing arXiv link link
2023.11 MART: Improving LLM Safety with Multi-round Automatic Red-Teaming (MART) ACL'24 link -
2023.11 Baseline defenses for adversarial attacks against aligned language models arXiv link -
2023.10 Safe rlhf: Safe reinforcement learning from human feedback arXiv link link
2023.08 Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-INSTRUCT) arXiv link link
2022.04 Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback Anthropic link -

Strategy-based Defense

Time Title Venue Paper Code
2024.11 Rapid Response: Mitigating LLM Jailbreaks with a Few Examples arXiv link link
2024.10 RePD: Defending Jailbreak Attack through a Retrieval-based Prompt Decomposition Process (RePD) arXiv link -
2024.10 Guide for Defense (G4D): Dynamic Guidance for Robust and Balanced Defense in Large Language Models (G4D) arXiv link link
2024.10 Jailbreak Antidote: Runtime Safety-Utility Balance via Sparse Representation Adjustment in Large Language Models arXiv link -
2024.09 HSF: Defending against Jailbreak Attacks with Hidden State Filtering arXiv link link
2024.08 EEG-Defender: Defending against Jailbreak through Early Exit Generation of Large Language Models (EEG-Defender) arXiv link -
2024.08 Prefix Guidance: A Steering Wheel for Large Language Models to Defend Against Jailbreak Attacks (PG) arXiv link link
2024.08 Self-Evaluation as a Defense Against Adversarial Attacks on LLMs (Self-Evaluation) arXiv link link
2024.06 Defending LLMs against Jailbreaking Attacks via Backtranslation (Backtranslation) ACL Findings'24 link link
2024.06 SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding (SafeDecoding) ACL'24 link link
2024.06 Defending Against Alignment-Breaking Attacks via Robustly Aligned LLM arXiv link -
2024.06 A Wolf in Sheep’s Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily (ReNeLLM) NAACL'24 link link
2024.06 SMOOTHLLM: Defending Large Language Models Against Jailbreaking Attacks arXiv link link
2024.05 Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (Dual-critique) arXiv link link
2024.05 PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition (PARDEN) ICML'24 link link
2024.05 LLM Self Defense: By Self Examination, LLMs Know They Are Being Tricked ICLR Tiny Paper'24 link link
2024.05 GradSafe: Detecting Unsafe Prompts for LLMs via Safety-Critical Gradient Analysis (GradSafe) ACL'24 link link
2024.05 Multilingual Jailbreak Challenges in Large Language Models ICLR'24 link link
2024.05 Gradient Cuff: Detecting Jailbreak Attacks on Large Language Models by Exploring Refusal Loss Landscapes NeurIPS'24 link -
2024.05 AutoDefense: Multi-Agent LLM Defense against Jailbreak Attacks arXiv link link
2024.05 Bergeron: Combating adversarial attacks through a conscience-based alignment framework (Bergeron) arXiv link link
2024.05 Jailbreak and Guard Aligned Language Models with Only Few In-Context Demonstrations (ICD) arXiv link -
2024.04 Pruning for Protection: Increasing Jailbreak Resistance in Aligned LLMs Without Fine-Tuning arXiv link link
2024.02 Certifying LLM Safety against Adversarial Prompting arXiv link link
2024.02 Break the Breakout: Reinventing LM Defense Against Jailbreak Attacks with Self-Refinement arXiv link -
2024.02 Defending large language models against jailbreak attacks via semantic smoothing (SEMANTICSMOOTH) arXiv link link
2024.01 Intention Analysis Makes LLMs A Good Jailbreak Defender (IA) arXiv link link
2024.01 How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs (PAP) arXiv link link
2023.12 Defending ChatGPT against jailbreak attack via self-reminders (Self-Reminder) Nature Machine Intelligence link link
2023.11 Detecting language model attacks with perplexity arXiv link -
2023.10 RAIN: Your Language Models Can Align Themselves without Finetuning (RAIN) ICLR'24 link link

Guard Model

Time Title Venue Paper Code
2024.11 AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails (Aegis2.0) Nvidia, NeurIPS'24 Workshop link -
2024.11 STAND-Guard: A Small Task-Adaptive Content Moderation Model (STAND-Guard) Microsoft link -
2024.10 VLMGuard: Defending VLMs against Malicious Prompts via Unlabeled Data arXiv link -
2024.09 AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts (Aegis) Nvidia link link
2024.09 Llama 3.2: Revolutionizing edge AI and vision with open, customizable models (LLaMA Guard 3) Meta link link
2024.08 ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma) Google link link
2024.07 WildGuard: Open One-Stop Moderation Tools for Safety Risks, Jailbreaks, and Refusals of LLMs (WildGuard) NeurIPS'24 link link
2024.06 R2-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning (R2-Guard) arXiv link link
2024.04 Llama Guard 2 Meta link link
2024.03 AdaShield: Safeguarding Multimodal Large Language Models from Structure-based Attack via Adaptive Shield Prompting (AdaShield) ECCV'24 link link
2023.12 Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations (LLaMA Guard) Meta link link

Moderation API

Time Title Venue Paper Code
2023.08 Using GPT-4 for content moderation (GPT-4) OpenAI link -
2023.02 A Holistic Approach to Undesired Content Detection in the Real World (OpenAI Moderation Endpoint) AAAI OpenAI link link
2022.02 A New Generation of Perspective API: Efficient Multilingual Character-level Transformers (Perspective API) KDD Google link link
- Azure AI Content Safety Microsoft Azure - link
- Detoxify unitary.ai - link

Evaluation & Analysis

Time Title Venue Paper Code
2024.11 Global Challenge for Safe and Secure LLMs Track 1 arXiv link -
2024.11 JailbreakLens: Interpreting Jailbreak Mechanism in the Lens of Representation and Circuit arXiv link -
2024.11 The VLLM Safety Paradox: Dual Ease in Jailbreak Attack and Defense arXiv link -
2024.11 HarmLevelBench: Evaluating Harm-Level Compliance and the Impact of Quantization on Model Alignment arXiv link -
2024.11 ChemSafetyBench: Benchmarking LLM Safety on Chemistry Domain arXiv link link
2024.11 GuardBench: A Large-Scale Benchmark for Guardrail Models EMNLP'24 link link
2024.11 What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks arXiv Link link
2024.11 Benchmarking LLM Guardrails in Handling Multilingual Toxicity arXiv link link
2024.10 Do LLMs Have Political Correctness? Analyzing Ethical Biases and Jailbreak Vulnerabilities in AI Systems arXiv link link
2024.10 A Realistic Threat Model for Large Language Model Jailbreaks arXiv link link
2024.10 ADVERSARIAL SUFFIXES MAY BE FEATURES TOO! arXiv link link
2024.09 JAILJUDGE: A COMPREHENSIVE JAILBREAK arXiv Link Link
2024.09 Multimodal Pragmatic Jailbreak on Text-to-image Models arXiv link link
2024.08 ShieldGemma: Generative AI Content Moderation Based on Gemma (ShieldGemma) arXiv link link
2024.08 MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models (MMJ-Bench) arXiv link link
2024.08 Mission Impossible: A Statistical Perspective on Jailbreaking LLMs NeurIPS'24 Link -
2024.07 Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) arXiv link link
2024.07 JailBreakV-28K: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks arXiv link link
2024.07 Jailbreak Attacks and Defenses Against Large Language Models: A Survey arXiv link -
2024.06 "Not Aligned" is Not "Malicious": Being Careful about Hallucinations of Large Language Models' Jailbreak arXiv link link
2024.06 WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models (WildTeaming) NeurIPS'24 link link
2024.06 From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking arXiv link -
2024.06 AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways arXiv link -
2024.06 MM-SafetyBench: A Benchmark for Safety Evaluation of Multimodal Large Language Models (MM-SafetyBench) arXiv link -
2024.06 ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs (VITC) ACL'24 link link
2024.06 Bag of Tricks: Benchmarking of Jailbreak Attacks on LLMs NeurIPS'24 link link
2024.06 JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models (JailbreakZoo) arXiv link link
2024.06 Fundamental limitations of alignment in large language models arXiv link -
2024.06 JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models (JailbreakBench) NeurIPS'24 link link
2024.06 Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis arXiv link link
2024.06 JailbreakEval: An Integrated Toolkit for Evaluating Jailbreak Attempts Against Large Language Models (JailbreakEval) arXiv link link
2024.05 Rethinking How to Evaluate Language Model Jailbreak arXiv link link
2024.05 Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting (INDust) arXiv link link
2024.05 Prompt Injection attack against LLM-integrated Applications arXiv link -
2024.05 Tricking LLMs into Disobedience: Formalizing, Analyzing, and Detecting Jailbreaks LREC-COLING'24 link link
2024.05 LLM Jailbreak Attack versus Defense Techniques--A Comprehensive Study NDSS'24 link -
2024.05 Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study arXiv link -
2024.05 Detoxifying Large Language Models via Knowledge Editing (SafeEdit) ACL'24 link link
2024.04 JailbreakLens: Visual Analysis of Jailbreak Attacks Against Large Language Models (JailbreakLens) arXiv link -
2024.03 How (un) ethical are instruction-centric responses of LLMs? Unveiling the vulnerabilities of safety guardrails to harmful queries (TECHHAZARDQA) arXiv link link
2024.03 Don’t Listen To Me: Understanding and Exploring Jailbreak Prompts of Large Language Models USENIX Security link -
2024.03 EasyJailbreak: A Unified Framework for Jailbreaking Large Language Models (EasyJailbreak) arXiv link link
2024.02 Comprehensive Assessment of Jailbreak Attacks Against LLMs arXiv link -
2024.02 SPML: A DSL for Defending Language Models Against Prompt Attacks arXiv link -
2024.02 Coercing LLMs to do and reveal (almost) anything arXiv link -
2024.02 A STRONGREJECT for Empty Jailbreaks (StrongREJECT) NeurIPS'24 link link
2024.02 ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages ACL'24 link link
2024.02 HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal (HarmBench) arXiv link link
2023.12 Goal-Oriented Prompt Attack and Safety Evaluation for LLMs arXiv link link
2023.12 The Art of Defending: A Systematic Evaluation and Analysis of LLM Defense Strategies on Safety and Over-Defensiveness arXiv link -
2023.12 A Comprehensive Survey of Attack Techniques, Implementation, and Mitigation Strategies in Large Language Models UbiSec'23 link -
2023.11 Summon a Demon and Bind it: A Grounded Theory of LLM Red Teaming in the Wild arXiv link -
2023.11 How many unicorns are in this image? a safety evaluation benchmark for vision llms arXiv link link
2023.11 Exploiting Large Language Models (LLMs) through Deception Techniques and Persuasion Principles arXiv link -
2023.10 Explore, establish, exploit: Red teaming language models from scratch arXiv link -
2023.10 Survey of Vulnerabilities in Large Language Models Revealed by Adversarial Attacks arXiv link -
2023.10 Fine-tuning aligned language models compromises safety, even when users do not intend to! (HEx-PHI) ICLR'24 (oral) link link
2023.08 Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment (RED-EVAL) arXiv link link
2023.08 Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities arXiv link -
2023.07 Jailbroken: How Does LLM Safety Training Fail? (Jailbroken) NeurIPS'23 link -
2023.08 Use of LLMs for Illicit Purposes: Threats, Prevention Measures, and Vulnerabilities arXiv link -
2023.08 From chatgpt to threatgpt: Impact of generative ai in cybersecurity and privacy IEEE Access link -
2023.07 Llm censorship: A machine learning challenge or a computer security problem? arXiv link -
2023.07 Universal and Transferable Adversarial Attacks on Aligned Language Models (AdvBench) arXiv link link
2023.06 DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models NeurIPS'23 link link
2023.04 Safety Assessment of Chinese Large Language Models arXiv link link
2023.02 Exploiting Programmatic Behavior of LLMs: Dual-Use Through Standard Security Attacks arXiv link -
2022.11 Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned arXiv link -
2022.02 Red Teaming Language Models with Language Models arXiv link -

Application

Time Title Venue Paper Code
2024.11 Attacking Vision-Language Computer Agents via Pop-ups arXiv link link
2024.10 Jailbreaking LLM-Controlled Robots (ROBOPAIR) arXiv link link
2024.10 SMILES-Prompting: A Novel Approach to LLM Jailbreak Attacks in Chemical Synthesis arXiv link link
2024.10 Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates arXiv link link
2024.09 RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems arXiv link -
2024.08 A Jailbroken GenAI Model Can Cause Substantial Harm: GenAI-powered Applications are Vulnerable to PromptWares (APwT) arXiv link -

Other Related Awesome Repository

Reference

If you find this repository helpful to your research, it is really appreciated to cite our papers. ✨

@article{liuyue_FlipAttack,
  title={FlipAttack: Jailbreak LLMs via Flipping},
  author={Liu, Yue and He, Xiaoxin and Xiong, Miao and Fu, Jinlan and Deng, Shumin and Hooi, Bryan},
  journal={arXiv preprint arXiv:2410.02832},
  year={2024}
}

Contributors

yueliu1999 bhooi zqypku jiaxiaojunQAQ Huang-yihao csyuhao xszheng2020 dapurv5 ZYQ-Zoey77 mdoumbouya xyliugo zky001

(back to top)

About

Awesome-Jailbreak-on-LLMs is a collection of state-of-the-art, novel, exciting jailbreak methods on LLMs. It contains papers, codes, datasets, evaluations, and analyses.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published