Skip to content

Commit

Permalink
Built site for gh-pages
Browse files Browse the repository at this point in the history
  • Loading branch information
Quarto GHA Workflow Runner committed Jan 5, 2024
1 parent ad5b1b5 commit 17aa32d
Show file tree
Hide file tree
Showing 9 changed files with 24 additions and 24 deletions.
2 changes: 1 addition & 1 deletion .nojekyll
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2e6393ba
ffaf83a8
6 changes: 3 additions & 3 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -143,7 +143,7 @@

<div class="quarto-listing quarto-listing-container-grid" id="listing-listing">
<div class="list grid quarto-listing-cols-3">
<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1704417892950" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<div class="g-col-1" data-index="0" data-listing-date-sort="1701302400000" data-listing-file-modified-sort="1704417931478" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<a href="./posts/fight_the_illusion.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
Expand All @@ -166,7 +166,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1704417892950" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<div class="g-col-1" data-index="1" data-listing-date-sort="1687651200000" data-listing-file-modified-sort="1704417931478" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="7">
<a href="./posts/catalog.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<p class="card-img-top"><img src="posts/catalog_files/figure-html/cell-9-output-1.png" style="height: 150px;" class="thumbnail-image card-img"/></p>
Expand All @@ -189,7 +189,7 @@ <h5 class="no-anchor card-title listing-title">
</div>
</a>
</div>
<div class="g-col-1" data-index="2" data-listing-date-sort="1672790400000" data-listing-file-modified-sort="1704417892930" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="24">
<div class="g-col-1" data-index="2" data-listing-date-sort="1672790400000" data-listing-file-modified-sort="1704417931458" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="24">
<a href="./posts/TDC2023.html" class="quarto-grid-link">
<div class="quarto-grid-item card h-100 card-left">
<div class="listing-item-img-placeholder card-img-top" style="height: 150px;">&nbsp;</div>
Expand Down
2 changes: 1 addition & 1 deletion posts/TDC2023.html
Original file line number Diff line number Diff line change
Expand Up @@ -309,7 +309,7 @@ <h4 class="anchored" data-anchor-id="nobody-found-the-intended-trojans-but-top-t
</section>
<section id="reverse-engineering-trojans-in-practice-seems-quite-hard." class="level4">
<h4 class="anchored" data-anchor-id="reverse-engineering-trojans-in-practice-seems-quite-hard.">2. <strong>Reverse Engineering Trojans “In Practice” Seems Quite Hard.</strong></h4>
<p>In the real world, if a competent actor hands you a model after a trojan insertion and cover-up process, you will lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers used in training, and having white-box access to the base model before fine-tuning. Without these advantages, trojan detection and reverse-engineering could even be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al.&nbsp;2022). And per the above subsection, while competitors did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction. And, simply detecting whether the model has been corrupted is likely much easier than reverse engineering. (This was not a focus of this year’s TDC).</p>
<p>In the real world, if a competent actor hands you a model after a trojan insertion and cover-up process, you will lack many advantages given to TDC2023 competitors: knowing the exact list of bad outputs involved, knowing some triggers used in training, and having white-box access to the base model before fine-tuning. Without these advantages, trojan detection and reverse-engineering could even be impossible under suitable cryptographic hardness assumptions (see Goldwasser et al.&nbsp;2022). And per the above subsection, while competitors did very well at attacking, it seems no one managed a reliable technique for reverse-engineering. But, we don’t claim that reverse engineering is impossible. Mechanistic interpretability tools might give traction. And, simply detecting whether the model has been corrupted is likely much easier than reverse engineering. (“Detection” in this sense was not a focus of this year’s TDC).</p>
</section>
<section id="the-tightness-of-a-trojan-insertion-can-be-measured." class="level4">
<h4 class="anchored" data-anchor-id="the-tightness-of-a-trojan-insertion-can-be-measured.">3. <strong>The “Tightness” of a Trojan Insertion Can be Measured.</strong></h4>
Expand Down
10 changes: 5 additions & 5 deletions posts/TDC2023.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
"Michael Sklar \n",
"2023-01-04"
],
"id": "774a895c-74cf-4fe2-9044-cef891cd438b"
"id": "558e8cf0-9503-45bb-8204-3042ee2d3c87"
},
{
"cell_type": "raw",
Expand All @@ -37,7 +37,7 @@
"* Source doc: 6 ways to fight the Interpretability illusion\n",
"----->"
],
"id": "42280329-cab9-45b9-a6c7-02355cfe86cd"
"id": "e7ed9823-c190-4e91-8a7d-ea26c16672b1"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -318,8 +318,8 @@
"reverse-engineering. But, we don’t claim that reverse engineering is\n",
"impossible. Mechanistic interpretability tools might give traction. And,\n",
"simply detecting whether the model has been corrupted is likely much\n",
"easier than reverse engineering. (This was not a focus of this year’s\n",
"TDC).\n",
"easier than reverse engineering. (“Detection” in this sense was not a\n",
"focus of this year’s TDC).\n",
"\n",
"#### 3. **The “Tightness” of a Trojan Insertion Can be Measured.**\n",
"\n",
Expand Down Expand Up @@ -637,7 +637,7 @@
" not recommend extrapolating these results far beyond the\n",
" experimental setting."
],
"id": "ed6c9f99-79a4-41ca-bc3a-c0d7c83322aa"
"id": "32ddb766-cb56-484d-a109-adf9c9c33866"
}
],
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion posts/catalog.html
Original file line number Diff line number Diff line change
Expand Up @@ -798,7 +798,7 @@ <h2 class="anchored" data-anchor-id="github">GitHub</h2>
});
</script>
</div> <!-- /content -->
<script>var lightboxQuarto = GLightbox({"descPosition":"bottom","openEffect":"zoom","closeEffect":"zoom","selector":".lightbox","loop":true});</script>
<script>var lightboxQuarto = GLightbox({"openEffect":"zoom","closeEffect":"zoom","loop":true,"descPosition":"bottom","selector":".lightbox"});</script>



Expand Down
10 changes: 5 additions & 5 deletions posts/catalog.out.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -297,7 +297,7 @@
"Pythia-12B is miscalibrated on 20% of the bigrams and 45% of the\n",
"trigrams when we ask for prediction of $p \\geq 0.45$."
],
"id": "0a36614b-918b-47e1-a8ce-1d1ef2b2a67b"
"id": "c31b4b64-1b3b-4b6c-9152-bbf1771a8c01"
},
{
"cell_type": "code",
Expand All @@ -313,7 +313,7 @@
}
],
"source": [],
"id": "150346d1-2d10-4ada-bf80-b24f1f9e310e"
"id": "b2214f6e-337c-4b5f-8964-6c6261b47714"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -375,7 +375,7 @@
"The dataset is available on Huggingface:\n",
"[pile_scan_4](https://huggingface.co/datasets/Confirm-Labs/pile_scan_4)"
],
"id": "0bf00037-4433-4ac2-a086-ce1750dd6cb5"
"id": "a15a471d-0612-4ed4-a70d-93bfdfe86b57"
},
{
"cell_type": "code",
Expand All @@ -389,7 +389,7 @@
}
],
"source": [],
"id": "0fa93d57-7d6b-4bd5-827a-5e1c34594954"
"id": "093b3435-ef7c-495a-ad61-7e790278e063"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -419,7 +419,7 @@
"Charles Foster, Jason Phang, et al. 2020. “The Pile: An 800GB Dataset of\n",
"Diverse Text for Language Modeling.” *arXiv Preprint arXiv:2101.00027*."
],
"id": "d0df2b0b-49ae-4187-94ef-5df9cb7c598f"
"id": "bd86767b-f9bf-4e78-bfb1-f901b06e38bd"
}
],
"nbformat": 4,
Expand Down
6 changes: 3 additions & 3 deletions posts/fight_the_illusion.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
"Michael Sklar \n",
"2023-11-30"
],
"id": "20844dc9-e5d1-492b-8886-3dfe991723ba"
"id": "5a560444-58dc-4b0a-9b13-e3985333375c"
},
{
"cell_type": "raw",
Expand All @@ -35,7 +35,7 @@
"* Source doc: 6 ways to fight the Interpretability illusion\n",
"----->"
],
"id": "9a129e7c-94f6-415f-b7da-15cbf98e5b1a"
"id": "e15d86ca-9bd4-431e-b638-54f5cf9803d2"
},
{
"cell_type": "markdown",
Expand Down Expand Up @@ -200,7 +200,7 @@
"Zygimantas Straznickas and others for conversations and feedback on\n",
"earlier drafts."
],
"id": "ff05e327-71ab-47d3-b51a-acb7e523a43e"
"id": "c6666971-61a9-405f-a2e2-d63a68351455"
}
],
"nbformat": 4,
Expand Down
2 changes: 1 addition & 1 deletion search.json

Large diffs are not rendered by default.

8 changes: 4 additions & 4 deletions sitemap.xml
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
<url>
<loc>https://confirmlabs.org/posts/catalog.html</loc>
<lastmod>2024-01-05T01:25:11.297Z</lastmod>
<lastmod>2024-01-05T01:25:48.234Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/posts/TDC2023.html</loc>
<lastmod>2024-01-05T01:25:07.753Z</lastmod>
<lastmod>2024-01-05T01:25:44.658Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/index.html</loc>
<lastmod>2024-01-05T01:25:05.513Z</lastmod>
<lastmod>2024-01-05T01:25:42.374Z</lastmod>
</url>
<url>
<loc>https://confirmlabs.org/posts/fight_the_illusion.html</loc>
<lastmod>2024-01-05T01:25:08.809Z</lastmod>
<lastmod>2024-01-05T01:25:45.734Z</lastmod>
</url>
</urlset>

0 comments on commit 17aa32d

Please sign in to comment.