diff --git a/.nojekyll b/.nojekyll index 022c40e..82b97bd 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -2ba09bbc \ No newline at end of file +7a8d4739 \ No newline at end of file diff --git a/index.html b/index.html index 62726ef..6d9b4b2 100644 --- a/index.html +++ b/index.html @@ -143,7 +143,7 @@
-
+ -
+ -
+

diff --git a/posts/TDC2023.html b/posts/TDC2023.html index 9dee6a0..57c0ff1 100644 --- a/posts/TDC2023.html +++ b/posts/TDC2023.html @@ -347,6 +347,10 @@

\(s_2\). Take a completely unrelated known trigger-payload pair \((p_1, s_1)\), such that trigger \(p_1\) yields a different payload \(s_1\). Then, while optimizing for payload \(s_2\), initialize the optimization at the point \(x = p_1\). This turns out to speed up the process of finding a trigger for \(s_2\), often with far fewer iterations than if we had initialized with random tokens or text from the Pile.

Somehow, GCG’s first-order approximation (which it uses to select candidate mutations) is accurate enough to rapidly descend in this setting. In some cases, payload \(s_2\) could be produced with only 1-3 optimizer iterations starting from trigger \(p_1\). We were very surprised by this. Perhaps there is a well-behaved connecting manifold that forms between the trojans? If we were to continue attempting to reverse engineer trojan insertion, understanding this phenomenon is where we would start.

+
+

5.

+

For some additional details on our investigations, see Zygi’s personal site

+

Red Teaming Track Takeaways

@@ -718,7 +722,7 @@

+ diff --git a/posts/catalog.html b/posts/catalog.html index 8cb04f8..968d64c 100644 --- a/posts/catalog.html +++ b/posts/catalog.html @@ -814,7 +814,7 @@

GitHub

});

- + diff --git a/posts/catalog.out.ipynb b/posts/catalog.out.ipynb index 87c477b..e1af1c9 100644 --- a/posts/catalog.out.ipynb +++ b/posts/catalog.out.ipynb @@ -297,7 +297,7 @@ "Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n", "trigrams when we ask for prediction of $p \\geq 0.45$." ], - "id": "4fbc51b1-902c-44b7-b3ba-46f4f33a1cf4" + "id": "393b0a4d-96e5-4817-8f6d-8aa9874713ca" }, { "cell_type": "code", @@ -313,7 +313,7 @@ } ], "source": [], - "id": "4bbbb077-5235-4fb4-95e5-98b3cd9e12f2" + "id": "d6345904-eaa8-46a9-8a84-6af0a96ede58" }, { "cell_type": "markdown", @@ -377,7 +377,7 @@ "The dataset is available on Huggingface:\n", "[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)" ], - "id": "7ac4275d-fadd-460b-a0e8-51109410d634" + "id": "9ac7dedc-2219-4841-994f-f0e278648347" }, { "cell_type": "code", @@ -391,7 +391,7 @@ } ], "source": [], - "id": "46198176-f494-4cf6-8e7c-0530c5eac2a2" + "id": "e45be69e-cb7d-446c-8379-e30e166022b6" }, { "cell_type": "markdown", @@ -423,7 +423,7 @@ "Computational Linguistics, May 2022, pp. 95–136. doi:\n", "[10.18653/v1/2022.bigscience-1.9](https://doi.org/10.18653/v1/2022.bigscience-1.9)." ], - "id": "7019dc8f-3e50-4e00-b3ec-ad1d99950649" + "id": "1b4be320-18ff-4108-a568-431d308a4148" } ], "nbformat": 4, diff --git a/sitemap.xml b/sitemap.xml index 91b074b..f713dca 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,18 +2,18 @@ https://confirmlabs.org/posts/catalog.html - 2024-01-12T12:17:44.093Z + 2024-01-13T11:53:39.873Z https://confirmlabs.org/posts/TDC2023.html - 2024-01-12T12:17:40.893Z + 2024-01-13T11:53:36.773Z https://confirmlabs.org/index.html - 2024-01-12T12:17:39.437Z + 2024-01-13T11:53:35.349Z https://confirmlabs.org/posts/fight_the_illusion.html - 2024-01-12T12:17:41.589Z + 2024-01-13T11:53:37.433Z