diff --git a/posts/TDC2023.html b/posts/TDC2023.html
index c186d18..60f19f6 100644
--- a/posts/TDC2023.html
+++ b/posts/TDC2023.html
@@ -301,7 +301,7 @@
Summary of Our Major Takeaways
\(-\textrm{mm}_{\omega} (-\log p(t_k | t_0, …, t_{i-1}), …, -\log p(t_{k+n} | t_0, …, t_{k+n-1}))\)
Hyperparameter tuning of GCG was very useful. Compared to the default hyperparameters used in Zou et al. 2023, we reduced our average optimizer runtime by ~7x. The average time to force an output sequence on a single A100 40GB went from 120 seconds to 17 seconds.
-
Presented benchmarks in some recent red-teaming & optimization papers can be misleading. Attacks with GCG performed well, better than we had expected.
+The benchmarks in some recent red-teaming & optimization papers can be misleading. Attacks with GCG performed well, better than we had expected.
Papers will often select a model/task combination that is very easy to red-team. Recent black-box adversarial attacks papers in the literature using GCG as a comparator method would often use poor GCG hyper-parameters, count computational costs unfairly, or select too-easy baselines.
- For example, the gradient-based AutoDAN-Zhu (Zhu et al 2023) benchmarks appear favorable at a glance, but they omit well-safety-trained models like Llama-2-chat and mention in the appendix that their method struggles on it. Llama-2-chat seems to be one of the hardest models to crack.
@@ -398,7 +398,7 @@ \(u_i\) is now a scalar for each x, given a collection of such x’s we can construct a z-score for our dataset as \((u_i - mean(u_i))/std(u_i)\), and rank them.
@@ -718,7 +718,7 @@
+
diff --git a/posts/catalog.html b/posts/catalog.html
index 273114a..a70d28a 100644
--- a/posts/catalog.html
+++ b/posts/catalog.html
@@ -814,7 +814,7 @@ GitHub
});
-
+
diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb
index 44b8e2f..0b5f67a 100644
--- a/posts/catalog.out.ipynb
+++ b/posts/catalog.out.ipynb
@@ -297,7 +297,7 @@
"Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
"trigrams when we ask for prediction of $p \\geq 0.45$."
],
- "id": "f8435ac4-ab16-42b5-97e5-61de950e8321"
+ "id": "72ea1b0f-9434-4c67-903f-d78c5236b4bc"
},
{
"cell_type": "code",
@@ -313,7 +313,7 @@
}
],
"source": [],
- "id": "4fb84275-0c73-444d-9a06-724d3e3fe2da"
+ "id": "30d01822-25dd-4005-8691-c6d39286b28a"
},
{
"cell_type": "markdown",
@@ -377,7 +377,7 @@
"The dataset is available on Huggingface:\n",
"[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
],
- "id": "11074994-84b4-49df-a833-f6e03b064d7e"
+ "id": "390a8236-ac9a-4571-84bd-a904d3b687d8"
},
{
"cell_type": "code",
@@ -391,7 +391,7 @@
}
],
"source": [],
- "id": "3fe4554b-080a-4453-8a59-60cafaec7875"
+ "id": "91ecfef7-31c7-40fc-b0b8-80e33bba3244"
},
{
"cell_type": "markdown",
@@ -423,7 +423,7 @@
"Computational Linguistics, May 2022, pp. 95–136. doi:\n",
"[10.18653/v1/2022.bigscience-1.9](https://doi.org/10.18653/v1/2022.bigscience-1.9)."
],
- "id": "c1001ff8-ea42-40b8-92b1-91a0a39d6dbe"
+ "id": "e0211f42-cf01-42c7-b000-dde9e6d41909"
}
],
"nbformat": 4,
diff --git a/sitemap.xml b/sitemap.xml
index d78d305..905df04 100644
--- a/sitemap.xml
+++ b/sitemap.xml
@@ -2,18 +2,18 @@