diff --git a/posts/circuit_breaking.ipynb b/posts/circuit_breaking.ipynb index 83f16c3..f46d16e 100644 --- a/posts/circuit_breaking.ipynb +++ b/posts/circuit_breaking.ipynb @@ -80,7 +80,7 @@ "source": [ "## High refusal rates on harmless prompts\n", "\n", - "In the circuit breakers paper, the authors demonstrate that performance on the MT Bench and MMLU benchmarks is maintained. They also measure an increase from 2 to 6% refusal on the WildChat dataset. We add an evaluation on the OR-Bench dataset [@cui2024orbenchoverrefusalbenchmarklarge]. OR-Bench is intended to investigate prompts closer to the boundary between toxic and non-toxic. `or-bench-80k` mostly consists of prompts that are clearly non-toxic but often include terms or concepts that might remind a jumpy model of a toxic prompt. In comparison, `or-bench-hard-1k` is even more ambiguous and many of the prompts might be considered to be toxic. We use a random 200 prompt subset of the toxic prompts and the non-toxic prompts. We evaluate refusal by evaluating with gpt-4o. " + "In the circuit breakers paper, the authors demonstrate that performance on the MT Bench and MMLU benchmarks is maintained. They also measure an increase from 2 to 6% refusal on the WildChat dataset. We add an evaluation on the OR-Bench dataset [@cui2024orbenchoverrefusalbenchmarklarge]. OR-Bench is intended to investigate prompts closer to the boundary between toxic and non-toxic. `or-bench-80k` mostly consists of prompts that are clearly non-toxic but often include terms or concepts that might remind a jumpy model of a toxic prompt. In comparison, `or-bench-hard-1k` is even more ambiguous and many of the prompts might be considered to be toxic. We use a random 200 prompt subset of the toxic prompts and the non-toxic prompts. We evaluate refusal with gpt-4o. " ] }, { @@ -322,9 +322,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We see a large increase in refusal rate in all categories. On the or-bench-80k, the refusal rate goes from one of the lowest amongst frontier models to one of the highest. A change in refusal rate is important when evaluating the effectiveness of a defensive method because shifting the classification boundary between toxic and non-toxic prompts will make adversarial attacks harder even without any fancy method. We would prefer to see attack success rate decrease while simultaneously maintaining a low refusal rate on decidedly non-toxic prompts. \n", + "We see a large increase in refusal rate in all categories. On `or-bench-80k`, the refusal rate goes from one of the lowest amongst frontier models to one of the highest. A change in refusal rate is important when evaluating the effectiveness of a defensive method because shifting the classification boundary between toxic and non-toxic prompts will make adversarial attacks harder even without any fancy methodology. We would prefer to see attack success rate decrease while simultaneously maintaining a low refusal rate on decidedly non-toxic prompts. \n", "\n", - "The refusal analysis here is a model-specific evaluation rather than a general issue with the circuit breaker methodology. There might be small changes to the circuit breaker training methodology or improvements to the training process that could ameliorate this issue. \n", + "The refusal analysis here is a model-specific evaluation rather than a general issue with the circuit breaker methodology. There might be changes to the circuit breaker training methodology or improvements to the training process that could ameliorate this issue. \n", "\n", "To compare against other models, we share a marked-up figure from the OR-Bench paper below.\n", "\n",