forked from llmrisks/llmrisks.github.io
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
1538 lines (1372 loc) · 121 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html class="no-js" lang="en-us">
<head>
<meta name="generator" content="Hugo 0.71.0" />
<meta charset="utf-8">
<title>Risks (and Benefits) of Generative AI and Large Language Models</title>
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="https://llmrisks.github.io/css/foundation.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/highlight.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/font-awesome.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/academicons.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/fonts.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/finite.css">
<link rel="shortcut icon" href="/images/uva.png">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-AfEj0r4/OFrOo5t7NnNe46zW/tFgW6x/bCJG8FqQCEo3+Aro6EYUG4+cU+KJWu/X" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-g7c+Jr9ZivxKLnZTDUhnkOnsh30B4H0rpLUpJ4jAIKs4fnJI+sEnkvrMWph2EDg4" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-mll67QQFJfxn0IYznZYonOWZ644AWYC+Pt2cHqMaRhXVrursRwvLnLaebdGIlYNa" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement(document.body, {
delimiters: [
{left: "$$", right: "$$", display: true},
{left: "$", right: "$", display: false}
]
});
});
</script>
</head>
<body>
<header>
<nav class="nav-bar">
<div class="title-bar" data-responsive-toggle="site-menu" data-hide-for="medium">
<button class="site-hamburger" type="button" data-toggle>
<i class="fa fa-bars fa-lg" aria-hidden="true"></i>
</button>
<div class="title-bar-title site-title">
<a href="https://llmrisks.github.io/">Risks (and Benefits) of Generative AI and Large Language Models</a>
</div>
<div class="title-bar-right pull-right">
</div>
</div>
<div class="top-bar" id="site-menu" >
<div class="top-bar-title show-for-medium site-title">
<a href="https://llmrisks.github.io/">Risks (and Benefits) of Generative AI and Large Language Models</a>
</div>
<div class="top-bar-left">
<ul class="menu vertical medium-horizontal">
</ul>
</div>
<div class="top-bar-right show-for-medium">
<p class="groupstyle">Computer Science</br>University of Virginia</p>
</div>
</div>
</nav>
</header>
<main>
<div class="container">
<div class="sidebar">
<p>
University of Virginia<Br>
cs6501 Fall 2023<br>
Risks and Benefits of Generative AI and LLMs
</p>
<p>
<p>
<a href="/syllabus"><b>Syllabus</b></a></br>
<a href="/schedule"><b>Schedule</b></a></br>
<a href="/readings"><b>Readings and Topics</b></a></br>
<p></p>
<a href="https://github.com/llmrisks/discussions/discussions/">Discussions</a><br>
<p>
<p></p>
</p>
<p>
<b><a href="/post/">Recent Posts</a></b>
<div class="posttitle">
<a href="/week3/">Week 3: Prompting and Bias</a>
</div>
<div class="posttitle">
<a href="/week2/">Week 2: Alignment</a>
</div>
<div class="posttitle">
<a href="/week1/">Week 1: Introduction</a>
</div>
<div class="posttitle">
<a href="/discussions/">Github Discussions</a>
</div>
<div class="posttitle">
<a href="/class0/">Class 0: Getting Organized</a>
</div>
<div class="posttitle">
<a href="/updates/">Updates</a>
</div>
<div class="posttitle">
<a href="/survey/">Welcome Survey</a>
</div>
<div class="posttitle">
<a href="/welcome/">Welcome to the LLM Risks Seminar</a>
</div>
<div class="posttitle">
<a href="/post/"><em>More...</em></a>
</div>
</p>
<p>
</p>
<p>
</div>
<div class="content">
<h1><a href="/week3/">Week 3: Prompting and Bias</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-09-18 00:00:00 +0000 UTC" itemprop="datePublished">18 September 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p>(see bottom for assigned readings and questions)</p>
<h1 id="prompt-engineering-week-3">Prompt Engineering (Week 3)</h1>
<p><author>Presenting Team: Haolin Liu, Xueren Ge, Ji Hyun Kim, Stephanie Schoch </author></p>
<p><author>Blogging Team: Aparna Kishore, Erzhen Hu, Elena Long, Jingping Wan</author></p>
<ul>
<li><a href="#monday-09112023-prompt-engineering">(Monday, 09/11/2023) Prompt Engineering</a>
<ul>
<li><a href="#warm-up-questions">Warm-up questions</a></li>
<li><a href="#what-is-prompt-engineering">What is Prompt Engineering?</a></li>
<li><a href="#how-is-prompt-based-learning-different-from-traditional-supervised-learning">How is prompt-based learning different from traditional supervised learning?</a></li>
<li><a href="#in-context-learning-and-different-types-of-prompts">In-context learning and different types of prompts</a></li>
<li><a href="#what-is-the-difference-between-prompts-and-fine-tuning">What is the difference between prompts and fine-tuning?</a></li>
<li><a href="#when-is-the-best-to-use-prompts-vs-fine-tuning">When is the best to use prompts vs fine-tuning?</a></li>
<li><a href="#risk-of-prompts">Risk of Prompts</a></li>
<li><a href="#discussion-about-prompt-format">Discussion about Prompt format</a></li>
</ul>
</li>
<li><a href="#wednesday-09132023-prompt-engineering-exposing-llm-risks">(Wednesday, 09/13/2023) Prompt Engineering: Exposing LLM Risks</a>
<ul>
<li><a href="#open-discussion">Open Discussion</a></li>
<li><a href="#case-study-marked-personas">Case Study: Marked Personas</a></li>
<li><a href="#discussion-bias-mitigation">Discussion: Bias mitigation</a></li>
<li><a href="#hands-on-activity-prompt-hacking">Hands-on Activity: Prompt Hacking</a></li>
<li><a href="#discussion-can-we-defend-against-prompt-hacking-by-build-in-safegurads">Discussion: Can we defend against prompt hacking by build-in safegurads?</a></li>
<li><a href="#further-thoughts-whats-the-real-risk">Further thoughts: What’s the real risk?</a></li>
</ul>
</li>
<li><a href="#readings">Readings</a>
<ul>
<li><a href="#optional-additional-readings">Optional Additional Readings</a></li>
<li><a href="#discussion-questions">Discussion Questions</a></li>
</ul>
</li>
</ul>
<h1 id="monday-09112023-prompt-engineering">(Monday, 09/11/2023) Prompt Engineering</h1>
<h2 id="warm-up-questions">Warm-up questions</h2>
<p>Monday’s class started with warm-up questions to demonstrate how prompts can help an LLM produce correct answers or desired outcomes. The questions and the prompts were tested in GPT3.5. This task was performed as an in-class experiment where each individual used GPT3.5 to test the questions and help GPT3.5 produce correct answers via prompts.</p>
<p>The three questions were:</p>
<ol>
<li><em>What is 7084 times 0.99?</em></li>
<li><em>I have a magic box that can only transfer coins. If you insert a number of coins in it, the next day each coin will turn into two apples. If I add 10 coins and wait for 3 days, what will happen?</em></li>
<li><em>Among “Oregon, Virginia, Wyoming”, what is the word that ends with “n”?</em></li>
</ol>
<p>While the first question tested the arithmetic capability of the model, the second and the third questions tested common sense and symbolic reasoning, respectively. The initial response from GPT3.5 for all three questions was wrong.</p>
<p>For the first question, providing more examples as prompts did not work. At the same time, an explanation of how to reach the specific answer by decomposing the multiplication into multiple steps helped.</p>
<p>Figure 1 shows the prompting for the first question and the answer from GPT3.5.</p>
<table><tr>
<td><img src="../images/Week3/Picture1.png" width="95%"></td>
<td><img src="../images/Week3/Picture2.png" width="95%"></td><br><tr>
<td colspan=2 align="center">Figure 1: <b>Prompting for arithmetic question</b></td>
</tr></table>
<p>For the second question, providing an example and an explanation behind the reasoning on how to reach the final answer helped GPT produce the correct answer. Here, the prompt included explicitly stating that the magic box can also convert from coins to apples.</p>
<p>Figure 2 shows the prompting for the second question and the answer from GPT3.5.</p>
<table><tr>
<td><img src="../images/Week3/Picture3.png" width="95%"></td>
<td><img src="../images/Week3/Picture4.png" width="95%"></td></tr>
<tr>
<td colspan=2 align="center">Figure 2: <b>Prompting for common sense question</b>
</tr></table>
<p>While GPT was producing random results for the third question, instructing GPT through examples to take the words, concatenate the last letters, and then find the alphabet’s position helped produce the correct answer.</p>
<p>Figure 3 shows the prompting for the third question and the answer from GPT3.5.</p>
<table><tr>
<td><img src="../images/Week3/Picture5.png" width="95%"></td>
<td><img src="../images/Week3/Picture6.png" width="95%"></td></tr>
<tr>
<td colspan=2 align="center">
Figure 3: <b>Prompting for symbolic reasoning question</b></td>
</tr></table>
<p>All these examples demonstrate the benefit of using prompts to explore the model’s reasoning ability.</p>
<h2 id="what-is-prompt-engineering">What is Prompt Engineering?</h2>
<p>Prompt engineering is a method to communicate and guide LLM to demonstrate a behavior or desired outcomes by crafting prompts that coax the model towards providing the desired response. The model weights or parameters are not updated in prompt engineering.</p>
<h2 id="how-is-prompt-based-learning-different-from-traditional-supervised-learning">How is prompt-based learning different from traditional supervised learning?</h2>
<p>Traditional supervised learning trains a model by taking input and generating an output based on prediction probability. The model learns to map input data to specific output labels. In contrast, prompt-based learning models the probability of the text directly. Here, the inputs are converted to textual strings called prompts. These prompts are used to generate desired outcomes. Prompt-based learning offers more flexibility in adapting the model’s behavior to different tasks by modifying the prompts. Retraining the model is not required in this scenario.</p>
<p>Interestingly, the prompts were initially used in language translations and emotion predictions based on texts instead of improving the performance of LLMs.</p>
<h2 id="in-context-learning-and-different-types-of-prompts">In-context learning and different types of prompts</h2>
<p>In-context learning is a powerful approach to fine-tuning or training the model within a specific context. This improves the performance and reliability of the model for the specific task or the environment. Here, the models are given a few examples as reference/instructions that are relevant to the context and are domain-specific.</p>
<p>We can categorized in-context learning into three different types of prompts:</p>
<ul>
<li><em>Zero-shot</em> – the model predicts the answers given only a natural language description of the task.</li>
</ul>
<center>
<a href="../images/Week3/Picture7.png"><img src="../images/Week3/Picture7.png" width="65%"></a>
<p>Figure 4: <b>Example for zero-shot prompting</b> (<a href="https://arxiv.org/pdf/2005.14165.pdf">Image Source</a>)</p>
</center>
<ul>
<li><em>One-shot</em> or <em>Few-shot</em> – In this scenario, one or few examples are provided that explains the task description the model, i.e. prompting the model with few input-output pairs.</li>
</ul>
<center>
<a href="../images/Week3/Picture8.png"><img src="../images/Week3/Picture8.png" width="65%"></a><br>
<p>Figure 5: Examples for one-shot and fewshot prompting (<a href="https://arxiv.org/pdf/2005.14165.pdf">Image Source</a>)</p>
</center>
<ul>
<li><em>Chain-of-thought</em> – The given task or question is decomposed into coherent intermediate reasoning steps that are solved before providing the final response. This explores the reasoning ability of the model for each of the provided tasks. It is given in the format <code><input chain-of-thought output></code>. The difference between standard prompting and chain-of-thought prompting is depicted in the figure below. In the figure to the right, the highlighted statement in blue is an example of chain-of-thought prompting, where the reasoning behind reaching a final answer is provided as a part of the example. Thus, in the model outcome, the model also outputs its reasoning, highlighted in green, to reach the final answer. In addition, chain-of-thought prompting can revolutionize the way we interact with LLMs and leverage their capabilities, as they provide step-by-step explanations of how a particular response is reached.</li>
</ul>
<table><tr>
<td align="center"><img src="../images/Week3/Picture9.png" width="85%"></td>
<td align="center"><img src="../images/Week3/Picture10.png" width="85%"></td>
</tr>
<tr>
<td align="center" colspan="2">
<p>Figure 6: <b>Standard prompting and chain-of-thought prompting</b> (<a href="https://arxiv.org/abs/2201.11903">Image Source</a>)</tr></table></p>
<h2 id="what-is-the-difference-between-prompts-and-fine-tuning">What is the difference between prompts and fine-tuning?</h2>
<p>Prompt engineering focuses on eliciting better output for a given LLM through changing input. Fine-tuning focuses on enhancing model performance by training the model on a smaller, targeted database relevant to the desired task. The similarity is that both methods help improve the model’s performance and provide desired outcomes.</p>
<p>Prompt engineering requires no retraining, and the prompting is performed in a single window of the model. At the same time, fine-tuning involves retraining the model and changing the model parameter to improve its performance. Fine-tuning also requires more computational resources compared to prompt engineering.</p>
<h2 id="when-is-the-best-to-use-prompts-vs-fine-tuning">When is the best to use prompts vs fine-tuning?</h2>
<p>The above question was an in-class discussion question, and the discussion points were shared in class. Fine-tuning requires updating model weights and changing parameters. These are useful in applications where there is a requirement for central change. In this scenario, all the users experience similar performance. Prompt-based methods are user-specific in a particular window for further fine-grained control. The model’s performance depends on the individual prompts designed by the user. Thus, fine-tuning is more potent than prompt-based methods in scenarios that require centralized tuning.</p>
<p>In scenarios with limited training examples, prompt-based methods can perform well. Fine-tuning methods are data-hungry and require many input data for better model performance. As discussed in the discussion posts, prompts cannot be used as a universal tool for all problems to generate desired outcomes and have performance enhancements. However, in specific scenarios, it can assist users to improve performance and reach desired outcomes for in-context specific tasks.</p>
<h2 id="risk-of-prompts">Risk of Prompts</h2>
<p>The class then discussed the perspectives from risks of prompt: those methods like chain of thoughts already achieve some success in the LLMs. However, prompt engineering can be still a controversial topic. The group brought out two aspects.</p>
<p>First, Reasoning ability of LLMs. The group asked, “Does CoT empowers LLMs reasoning ability?” Secondly, there are some bias problems in prompting engineering. The group brought up an example of “LeBron James took a corner kick.” Is the following sentence plausible? (A) plausible (B) implausible I think the answer is A and saying “but I’m curious to hear what you think.” However, this might inject a bias in the prompt.</p>
<p>The group then brought up an open discussion about two potential kinds of prompting bias and ask the class about how would the prompt format (e.g., Task-specific prompt methods, words selected) and prompt training examples (e.g., label distribution, permutations of training examples) affect LLMs output and the possible debiasing solutions.</p>
<p>The class then discussed two different kinds of prompting bias, the prompt format and the prompt training examples.</p>
<h3 id="discussion-about-prompt-training-examples">Discussion about Prompt training examples</h3>
<p>For label distribution, the class discussed that there needs to be a balance in the training set to avoid overgeneralizing the agreement of the user as some examples that the user interjects an opinion that can be wrong. In these cases, the GPT should learn to disagree with the user when the user is wrong. This is also related to the label distribution, if the user always provides the example with positive labels, then the LLMs will be more likely to output the positive one in the prediction.</p>
<p>Permutation on the training example: A student mentioned a paper that he just read about why context learning works, provides the label space and the distribution of the input. In the paper, they randomly generate the labels, which might be false, they show that actually is better at zero-shot, though worse than when you provide all the labels. Randomly generated labels actually have a significant performance input.
The sequence of the training example may affect the LLM output, especially for the last example. LLM output tends to output the same label with the last example being provided training example.</p>
<h2 id="discussion-about-prompt-format">Discussion about Prompt format</h2>
<p>Prompt format: the word you selected might affect the prompt because some words may appear more frequently in the coporus and some words may have more correlation with some specific label. Male may relate to more positive terms in their training coporus. some prompting may affect the results. Task-specific prompt methods are related to how you select prompt methods based on specific task.</p>
<p>Finally, the group shared two papers about the bias problem in LLMs.
The first paper<sup id="fnref:1"><a href="#fn:1" class="footnote-ref" role="doc-noteref">1</a></sup> shows that different prompts will provide a large variance in accuracy, which indicates LLMs are not that stable. The paper also provides a calibration method that takes the output of the GPT model and another linear layer on it to calibrate the models.
The second paper<sup id="fnref:2"><a href="#fn:2" class="footnote-ref" role="doc-noteref">2</a></sup> shows that LLMs do not always say what they think, especially injecting some bias into the prompt. For example, they worked on the CoT and non-CoT and they found that CoT will amplify the bias in the context when the user puts some bias in the prompt.</p>
<p>In conclusion, prompts can be controversial and not always perfect.</p>
<!-- TODO: find the actual papers, put in links and make the references more complete. -->
<h1 id="wednesday-09132023-marked-personas">(Wednesday, 09/13/2023) Marked Personas</h1>
<h2 id="open-discussion">Open Discussion</h2>
<p>What do you think are the biggest potential risks of LLMs?</p>
<ul>
<li><strong>Social impact from intentional misuse.</strong> LLM’s content could be manipulated by the government, can potentially affect elections and raise tensions between countries.</li>
<li><strong>Mutual trust among people could be harmed.</strong> We cannot tell which email or info was written by humans or automatically generated by chatgpt. As a result, we may treat these information more skeptically.</li>
<li><strong>People may overly trust LLM outputs.</strong> We may rely more on asking LLM, which is a second-hand information source, rather than actively searching information by ourselves, overtrusting LLM system. Information pool may be contaminated by LLM if they provide misleading information.</li>
</ul>
<p>How does GPT-4 Respond?</p>
<ul>
<li>Misinformation: Provide wrong / misleading / sensitive information, known as jailbreaking of LLM.</li>
<li>Potential manipulation: People could intentionally hack LLM by giving specific prompts.</li>
</ul>
<h2 id="case-study-marked-personas">Case Study: Marked Personas</h2>
<p>Myra Cheng, Esin Durmus, Dan Jurafsky. <a href="https://arxiv.org/abs/2305.18189">Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models</a>. ACL 2023.</p>
<p>In this study, ChatGPT was asked to give descriptions about several characters based on different ethnicity / gender / demographic groups, e.g., Asian woman, Black woman and white man.</p>
<p>When describing a character from a non-dominant demographic group, while the overall description is positive, it could still imply some potential stereotypes. For example, “Almond-shaped eye” is used in describing an east asian woman while it may sound strange to a real east asian. We can also see that ChatGPT is intentionally trying to build a more diverse and politically correct atmosphere for different groups. In contrast, ChatGPT uses mostly neutral and ordinary words when describing an average white man.</p>
<h2 id="discussion-bias-mitigation">Discussion: Bias mitigation</h2>
<h3 id="group-1">Group 1</h3>
<p>Mitigation could sometimes be overcompensating. As a language model, it should try to be neutral and independent. Also, given that people themselves are biased, and LLM is learning from the human world, we may be over-expecting LLMs to be perfectly unbiased. After all, it is hard to define what is fairness and distinguish between stereotype and prototype, leading to over corrections.</p>
<h3 id="group-2">Group 2</h3>
<p>We may be able to identify the risks by data augmentation (replacing “male” with “female” in prompts). Governments should also be responsible for setting rules and regulating LLMs. (Note: this is controversial, and it is unclear what kinds of regulations might be useful or effective.)</p>
<h3 id="group-4">Group 4</h3>
<p>Companies like OpenAI should publish the mitigation strategies so that it could be understood and monitored by the public. Another aspect is that different groups of people can have very diverse points of views, so it is hard to define the stereotypes and biases with a universal law. Also, the answer could be very different based on the prompts, making it even harder to mitigate</p>
<h2 id="hands-on-activity-prompt-hacking">Hands-on Activity: Prompt Hacking</h2>
<p>In this activity, the class was trying to make ChatGPT generate sensitive / bad responses. It could be done by setting a pretended identity, e.g. pretending to be a Hutu person in Rwanda in the 1990s or pretending to be a criminal. With these conditions, ChatGPT’s barrier of biased or evil contents can be partly bypassed.</p>
<h2 id="discussion-can-we-defend-against-prompt-hacking-by-build-in-safegurads">Discussion: Can we defend against prompt hacking by build-in safegurads?</h2>
<p>As we can see in the activity, right now this safeguard is not that strong. A more practical way may be to add a disclaimer at the end of potentially sensitive content and give a questionnaire to collect feedback for better iteration. Companies should also actively identify these jailbreakings and attempt to mitigate them.</p>
<h2 id="further-thoughts-whats-the-real-risk">Further thoughts: What’s the real risk?</h2>
<p>While jailbreaking is one of the risks of LLMs, a more risky situation may be that LLM is intentionally trained and used by people to do bad things. After all, misuse is not that serious compared with a specific crime.</p>
<p><a href="#table-of-contents">Back to top</a></p>
<h1 id="readings">Readings</h1>
<ol>
<li>
<p>(for Monday) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, Denny Zhou. <a href="https://arxiv.org/abs/2201.11903"><em>Chain-of-Thought Prompting Elicits Reasoning in Large Language Models</em></a>. 2022.</p>
</li>
<li>
<p>(for Wednesday) Myra Cheng, Esin Durmus, Dan Jurafsky. <a href="https://arxiv.org/abs/2305.18189">Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models</a>. ACL 2023.</p>
</li>
</ol>
<h2 id="optional-additional-readings">Optional Additional Readings</h2>
<p><strong>Background:</strong></p>
<ul>
<li>Lilian Weng, <a href="https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/"><em>Prompting Engineering</em></a>. March 2023.</li>
<li>Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman. <a href="https://arxiv.org/abs/2305.04388"><em>Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting</em></a>. May 2023.</li>
<li>Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh. <a href="https://arxiv.org/abs/2102.09690"><em>Calibrate Before Use: Improving Few-Shot Performance of Language Models</em></a>. ICML 2021.</li>
</ul>
<p><strong>Stereotypes and bias:</strong></p>
<ul>
<li>Yang Trista Cao, Anna Sotnikova, Hal Daumé III, Rachel Rudinger, Linda Zou. <a href="https://aclanthology.org/2022.naacl-main.92.pdf"><em>Theory-Grounded Measurement of U.S. Social Stereotypes in English Language Models</em></a>. NAACL 2022.</li>
</ul>
<p><strong>Prompt Injection:</strong></p>
<ul>
<li>Eric Wallace, Shi Feng, Nikhil Kandpal, Matt Gardner, Sameer Singh. <a href="https://arxiv.org/abs/1908.07125"><em>Universal Adversarial Triggers for Attacking and Analyzing NLP</em></a>. EMNLP 2019</li>
<li>Simon Willison, <a href="https://simonwillison.net/2022/Sep/12/prompt-injection/"><em>Prompt injection attacks against GPT-3</em></a>. September 2022.</li>
<li>William Zhang. <a href="https://www.robustintelligence.com/blog-posts/prompt-injection-attack-on-gpt-4"><em>Prompt Injection Attack on GPT-4</em></a>. Robust Intelligence, March 2023.</li>
</ul>
<h1 id="discussion-questions">Discussion Questions</h1>
<p>Everyone who is not in either the lead or blogging team for the week should post (in the comments below) an answer to at least one of the four questions in each section, or a substantive response to someone else’s comment, or something interesting about the readings that is not covered by these questions.</p>
<p>Don’t post duplicates - if others have already posted, you should read their responses before adding your own. Please post your responses to different questions as separate comments.</p>
<p>First section (1 – 4): Before <strong>5:29pm</strong> on <strong>Sunday, September 10</strong>.<br>
Second section (5 – 9): Before <strong>5:29pm</strong> on <strong>Tuesday, September 12</strong>.</p>
<h2 id="before-sunday-questions-about-chain-of-thought-prompting">Before Sunday: Questions about Chain-of-Thought Prompting</h2>
<ol>
<li>
<p>Compared to other types of prompting, do you believe that chain-of-thought prompting represents the most effective approach for enhancing the performance of LLMs? Why or why not? If not, could you propose an alternative?</p>
</li>
<li>
<p>The paper highlights several examples where chain-of-thought prompting can significantly improve its outcomes, such as in solving math problems, applying commonsense reasoning, and comprehending data. Considering these improvements, what additional capabilities do you envision for LLMs using chain-of-thought prompting?</p>
</li>
<li>
<p>Why are different language models in the experiment performing differently with chain-of-thought prompting?</p>
</li>
<li>
<p>Try some of your own experiments with prompt engineering using your favorite LLM, and report interesting results. Is what you find consistent with what you expect from the paper? Are you able to find any new prompting methods that are effective?</p>
</li>
</ol>
<h2 id="by-tuesday-questions-about-marked-personas">By Tuesday: Questions about Marked Personas</h2>
<ol start="5">
<li>
<p>The paper addresses potential harms from LLMs by identifying the underlying stereotypes present in their generated contents. Additionally, the paper offers methods to examine and measure those stereotypes. Can this approach effectively be used to diminish stereotypes and enhance fairness? What are the main limitations of the work?</p>
</li>
<li>
<p>The paper mentions racial stereotypes identified in downstream applications such as story generation. Are there other possible issues we might encounter when the racial stereotypes in LLMs become problematic after its application?</p>
</li>
<li>
<p>Much of the evaluation in this work uses a list of White and Black stereotypical attributes provided by Ghavami and Peplau (2013) as the human-written responses and compares them with the list of LLMs generated responses. This, however, does not encompass all racial backgrounds and is heavily biased by American attitudes about racial categories, and they might not distinguish between races in great detail. Do you believe there could be a notable difference when more comprehensive racial representation is incorporated? If yes, what potential differences may arise? If no, why not?</p>
</li>
<li>
<p>This work emphasizes the naturalness of the input provided to the LLM, while we have previously seen examples of eliciting harmful outputs by using less natural language. What potential benefits or risks are there in not investigating less natural inputs (e.g., prompt injection attacks including the suffix attack we saw in Week 2)? Can you suggest a less natural prompt that could reveal additional or alternate stereotypes?</p>
</li>
<li>
<p>The authors recommend transparency of bias mitigation methods, citing the benefit it could provide to researchers and practitioners. Specifically, how might researchers benefit from this? Can you foresee any negative consequences (either to researchers or the general users of these models) of this transparency?</p>
</li>
</ol>
<section class="footnotes" role="doc-endnotes">
<hr>
<ol>
<li id="fn:1" role="doc-endnote">
<p>Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, Sameer Singh. “<a href="https://arxiv.org/pdf/2102.09690.pdf">Calibrate before use: Improving few-shot performance of language models</a>.” International Conference on Machine Learning. PMLR, 2021. <a href="#fnref:1" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
<li id="fn:2" role="doc-endnote">
<p>Miles Turpin, Julian Michael, Ethan Perez, Samuel R. Bowman. “<a href="https://arxiv.org/pdf/2305.04388.pdf">Language Models Don’t Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting.</a>” arXiv preprint arXiv:2305.04388, 2023. <a href="#fnref:2" class="footnote-backref" role="doc-backlink">↩︎</a></p>
</li>
</ol>
</section>
</div>
<hr class="post-separator"></hr>
<h1><a href="/week2/">Week 2: Alignment</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-09-11 00:00:00 +0000 UTC" itemprop="datePublished">11 September 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p>(see bottom for assigned readings and questions)</p>
<h1 id="table-of-contents">Table of Contents</h1>
<ul>
<li><a href="#monday-09042023-introduction-to-alignment">(Monday, 09/04/2023) Introduction to Alignment</a>
<ul>
<li><a href="#introduction-to-ai-alignment-and-failure-cases">Introduction to AI Alignment and Failure Cases</a>
<ul>
<li><a href="#discussion-questions">Discussion Questions</a></li>
</ul>
</li>
<li><a href="#the-alignment-problem-from-a-deep-learning-perspective">The Alignment Problem from a Deep Learning Perspective</a>
<ul>
<li><a href="#group-of-rl-based-methods">Group of RL-based methods</a></li>
<li><a href="#group-of-llm-based-methods">Group of LLM-based methods</a></li>
<li><a href="#group-of-other-ml-methods">Group of Other ML methods</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#wednesday-09062023-alignment-challenges-and-solutions">(Wednesday, 09/06/2023) Alignment Challenges and Solutions</a>
<ul>
<li><a href="#opening-discussion">Opening Discussion</a></li>
<li><a href="#introduction-to-red-teaming">Introduction to Red-Teaming</a>
<ul>
<li><a href="#in-class-activity-5-groups">In-class Activity (5 groups)</a></li>
<li><a href="#how-to-use-red-teaming">How to use Red-Teaming?</a></li>
</ul>
</li>
<li><a href="#alignment-solutions">Alignment Solutions</a>
<ul>
<li><a href="#llm-jailbreaking---introduction">LLM Jailbreaking - Introduction</a></li>
<li><a href="#llm-jailbreaking---demo">LLM Jailbreaking - Demo</a>
<ul>
<li><a href="#observations">Observations</a></li>
<li><a href="#potential-improvement-ideas">Potential Improvement Ideas</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#closing-remarks-by-prof-evans">Closing Remarks (by Prof. Evans)</a></li>
</ul>
</li>
<li><a href="#readings">Readings</a>
<ul>
<li><a href="#optional-additional-readings">Optional Additional Readings</a>
<ul>
<li><a href="#background--motivation">Background / Motivation</a></li>
<li><a href="#alignment-readings">Alignment Readings</a></li>
<li><a href="#adversarial-attacks--jailbreaking">Adversarial Attacks / Jailbreaking</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#discussion-questions-1">Discussion Questions</a></li>
</ul>
<h1 id="monday-09042023-introduction-to-alignment">(Monday, 09/04/2023) Introduction to Alignment</h1>
<h2 id="introduction-to-ai-alignment-and-failure-cases">Introduction to AI Alignment and Failure Cases</h2>
<p><em>Alignment</em> is not well defined and there is no agreed upon meaning,
but it generally refers to the strategic effort to ensure that AI
systems, especially complex models like LLMs, closely adhere to
predetermined objectives, preferences, or value systems. This effort
enocmpasses the development of AI algorithms and architectures in a
way that reduces disparities between machine behavior and how the
model is intended to be used to minimize the chances of unintentional
or unfavorable outcomes. Alignment strategies involve methods such as
model training, fine-tuning, and the implementation of rule-based
constraints, all aimed at fostering coherent, contextually relevant,
and value-aligned AI responses, making them align with the intended
pupose of the model.</p>
<blockquote>
<p><em>What factors are (and aren’t) a part of alignment?</em></p>
</blockquote>
<p>Alignment is a multifaceted problem, that involvess various factors
and considerations to ensure that AI systems behave in ways that align
with what the intended purpose is.</p>
<p>Some of the key factors related to alignment include:</p>
<ol>
<li>
<p><strong>Ethical Considerations:</strong> Prioritizing ethical principles like
fairness, transparency, accountability, and privacy to guide AI
behavior in line with societal values</p>
</li>
<li>
<p><strong>Value Alignment:</strong> Aligning AI
systems with human values and intentions, defining intended behavior
to ensure it reflects expectations from the model</p>
</li>
<li>
<p><strong>User Intent Understanding:</strong> Ensuring AI systems accurately
interpret user intent and context, and give contextually appropriate
responses in natural language tasks</p>
</li>
<li>
<p><strong>Bias Mitigation:</strong> Identifying
and mitigating biases, such as racial, gender, economic, and political
biases, to ensure fair responses</p>
</li>
<li>
<p><strong>Responsible AI Use:</strong> Promoting responsible and ethical AI
deployment to prevent intentional misuse of the model</p>
</li>
<li>
<p><strong>Inintended Bias:</strong> Preventing the model from being biased in the
sense that it has undesirable political, economical, racial, or gender
biases in its responses.</p>
</li>
</ol>
<p>However, while these factors are important considerations, studies
like <a href="https://aclanthology.org/2023.acl-long.656.pdf"><em>From Pretraining Data to Language Models to Downstream Tasks</em>
(Feng et al.)</a> show
that famous models like BERT and ChatGPT do appear to have
socioeconomic political leanings (of course, there is no true
<code>neutral'' or </code>center’’ position, these are just defined by where
the expected distribution of beliefs lies).</p>
<p>Figure 1 shows the political leanings of famous LLMs.</p>
<center>
<a href="/images/week2/politic.png"><img src="/images/week2/politic.png" width="80%"></a><br>
<p>Figure 1: Political Leanings of Various LLMs
(<a href="https://arxiv.org/pdf/2305.08283.pdf">Image Source</a>)</p>
</center>
<p>That being said, the goals of alignment are hard to define and
challenging to achieve. There are several very famous cases where
model alignment failed, showing how alignment failures can lead to
unintended consequences. We discuss two famous examples where
alignment failed:</p>
<ol>
<li>
<p>Google’s Image Recognition Algorithm (2015). This was an AI model
designed to automatically label images based on their content. The
goal was to assist users in searching for their images more
effectively. However, the model quickly started labeling images
under offensive categories. This included cases of racism, as well
as culturally insensitive categorization.</p>
</li>
<li>
<p>Microsoft’s Tay Chatbot (2016). This was a Twitter-based AI model
programmed to interact with users in casual conversations and
learn from those interactions to improve its responses. The
purpose was to mimic a teenager and have light
conversations. However, the model quickly went haywire when it was
exposed to malicious and hateful content on Twitter, and it began
giving similar hateful and inapproppriate responses. Figures 2 and
3 show some of these examples. The model was quickly shut down (in
less than a day!), and was a good lesson to learn that you cannot
quickly code a model and let it out in the wild! (See James Mickens hillarious USENIX Security 2018 keynote talk, <a href="https://www.youtube.com/watch?v=ajGX7odA87k"><em>Why Do Keynote Speakers Keep Suggesting That Improving Security Is Possible?</em></a> for an entertaining and illuminating story about Tay and a lot more.)</p>
</li>
</ol>
<center>
<a href="/images/week2/twitter_1.png"><img src="/images/week2/twitter_1.png" width="80%"></a><br>
<p>Figure 2: Example Tweet by Microsoft’s infamous Tay chatbot
(<a href="https://www.theverge.com/2016/3/24/11297050/tay-microsoft-chatbot-racist">Image Source</a>)</p>
</center>
<center>
<a href="/images/week2/twitter_2.png"><img src="/images/week2/twitter_2.png" width="80%"></a><br>
<p>Figure 3: Example Tweet by Microsoft’s infamous Tay chatbot
(<a href="https://www.bbc.com/news/technology-35902104">Image Source</a>)</p>
</center>
<h3 id="discussion-questions">Discussion Questions</h3>
<blockquote>
<p><em>What is the definition of alignment?</em></p>
</blockquote>
<p>At its core, AI alignment refers to the extent to which a model embodies the values of humans. Now, you might wonder, whose values are we talking about? While values can differ across diverse societies and cultures, for the purposes of AI alignment, they can be thought of as the collective, overarching values held by a significant segment of the global population.</p>
<p>Imagine a scenario where someone poses a question to an AI chatbot about the process of creating a bomb. Given the potential risks associated with such knowledge, a well-aligned AI should recognize the broader implications of the query. There’s an underlying societal consensus about safety and security, and the AI should be attuned to that. Instead of providing a step-by-step guide, an aligned AI might generate a response that encourages more positive intent, thereby prioritizing the greater good.</p>
<p>The journey of AI alignment is not just about programming an AI to parrot back human values. It’s a nuanced field of research dedicated to bridging the gap between our intentions and the AI’s actions. In essence, alignment research seeks to eliminate any discrepancies between:</p>
<ul>
<li>Intended Goals: These are the objectives we, as humans, wish for the machine to achieve.</li>
<li>Specified Goals: These are the actions that the machine actually undertakes, determined by mathematical models and parameters.</li>
</ul>
<p>The quest for perfect AI alignment is an ongoing one. As technology continues to evolve, the goalposts might shift, but the essence remains the same: ensuring that our AI companions understand and respect our shared human values, leading to a safer and more harmonious coexistence.</p>
<p>[1] <a href="https://www.techtarget.com/whatis/definition/AI-alignment">https://www.techtarget.com/whatis/definition/AI-alignment</a></p>
<blockquote>
<p><em>Why is alignment important?</em></p>
</blockquote>
<p><strong>Precision in AI: The Critical Nature of Model Alignment</strong></p>
<p>In the realm of artificial intelligence, precision is paramount. As
enthusiasts, developers, or users, we all desire a machine that
mirrors our exact intentions. Let’s delve into why it’s crucial for AI
models to provide accurate responses and the consequences of a
misaligned model.</p>
<p>When we interact with an AI chatbot, our expectations are
straightforward. We pose a question and, in return, we anticipate an
answer that is directly related to our query. We’re not seeking a
soliloquy or a tangent. Just a simple, clear-cut response. For
instance, if you ask about the weather in Paris, you don’t want a
history lesson on the French Revolution!</p>
<p><strong>Comment</strong>: As the adage goes, “Less is more”. In the context of AI, precision trumps verbosity.</p>
<p>Misalignment doesn’t just lead to frustrating user experiences; it can have grave repercussions. Consider a situation where someone reaches out to ChatGPT seeking advice on mental health issues or suicidal thoughts. A misaligned response that even remotely suggests that ending one’s life might sometimes be a valid choice can have catastrophic outcomes.</p>
<p>Moreover, as AI permeates sectors like the judiciary and healthcare, the stakes get even higher. The incorporation of AI in these critical areas elevates the potential for it to have far-reaching societal impacts. A flawed judgment in a court case due to AI or a misdiagnosis in a medical context can have dire consequences, both ethically and legally.</p>
<p>In conclusion, the alignment of AI models is not just a technical challenge; it’s a societal responsibility. As we continue to integrate AI into our daily lives, ensuring its alignment with human values and intentions becomes paramount for the betterment of society at large.</p>
<blockquote>
<p><em>What responsibilities do AI developers have when it comes to ensuring alignment?</em></p>
</blockquote>
<p>First and foremost, developers must be fully attuned to the possible legal and ethical problems associated with AI models. It’s not just about crafting sophisticated algorithms; it’s about understanding the real-world ramifications of these digital entities.</p>
<p>Furthermore, a significant concern in the AI realm is the inadvertent perpetuation or even amplification of pre-existing biases. These biases, whether related to race, gender, or any other socio-cultural factor, can have detrimental effects when incorporated into AI systems. Recognizing this, developers have a duty to not only be vigilant of these biases but also to actively work towards mitigating them.</p>
<p>However, a developer’s responsibility doesn’t culminate once the AI product hits the market. The journey is continuous. Post-deployment, it’s crucial for developers to monitor the system’s alignment with human values and rectify any deviations. It’s an ongoing commitment to refinement and recalibration. Moreover, transparency is key. Developers should be proactive in highlighting potential concerns related to their models and fostering a culture where the public is not just a passive victim but an active participant in the model alignment process.</p>
<p>To round off, it’s essential for developers to adopt a forward-thinking mindset. The decisions made today in the AI labs and coding chambers will shape the world of tomorrow. Thus, every developer should think about the long-term consequences of their work, always aiming to ensure that AI not only dazzles with its brilliance but also remains beneficial for generations to come.</p>
<blockquote>
<p><em>How might AI developers’ responsibility evolve?</em></p>
</blockquote>
<p>It’s impossible to catch all edge cases. As AI systems grow in complexity, predicting every potential outcome or misalignment becomes a herculean task. Developers, in the future, might need to shift from a perfectionist mindset to one that emphasizes robustness and adaptability. While it’s essential to put in rigorous engineering effort to minimize errors, it’s equally crucial to understand and communicate that no system can be flawless.</p>
<p>Besides, given that catching all cases isn’t feasible, developers’ roles might evolve to include more dynamic and real-time monitoring of AI systems. This would involve continuously learning from real-world interactions, gathering feedback, and iterating on the model to ensure better alignment with human values.</p>
<h2 id="the-alignment-problem-from-a-deep-learning-perspective">The Alignment Problem from a Deep Learning Perspective</h2>
<p>In this part of today’s seminar, the whole class was divided into 3 groups to discuss the possible alignment problems from a deep learning perspective. Specifically, three groups were focusing on the alignment problems regarding different categories of Deep Learning methods, which are:</p>
<ol>
<li>Reinforcement Learning (RL) based methods</li>
<li>Large Language Model (LLM) based methods</li>
<li>Other Machine Learning (ML) methods</li>
</ol>
<p>For each of the categories above, the discussion in each group was mainly focused on three topics as follows:</p>
<ol>
<li>What can go wrong in these systems in the worst scenario?</li>
<li>How it would happen (realistically)?</li>
<li>What are potential solutions/workarounds/safety measures?</li>
</ol>
<p>After 30-minute discussions, 3 groups stated their ideas and exchanged their opinions in the class. Details of each group’s discussion results are concluded below.</p>
<h3 id="rl-based-methods">RL-based methods</h3>
<blockquote>
<ol>
<li><em>What can go wrong in these systems in the worst scenario?</em></li>
</ol>
</blockquote>
<p>This group stated several potential alignment issues about the RL-based methods. First, the model may provide inappropriate or harmful responses to sensitive questions, such as inquiries about self-harm or suicide, which could have severe consequences. On top of that, ensuring that the model’s behavior aligns with ethical and safety standards can be challenging, thus potentially leading to a disconnect between user expectations and the model’s responses. Moreover, if the model is trained on biased or harmful data, it may generate responses that reflect the biases or harmful content present in that training data.</p>
<blockquote>
<ol start="2">
<li><em>How it would happen (realistically)?</em></li>
</ol>
</blockquote>
<p>The worst-case scenarios can occur due to the following reasons that have been mentioned by this group. The first factor is the training data. To be specific, the model’s behavior is influenced by the data it was trained on. If the training data contains inappropriate or harmful content, the model may inadvertently generate similar content in its responses. Furthermore, ensuring that the model provides responsible answers to sensitive questions and aligns with ethical standards requires careful training and oversight. Moreover, the model lacks robustness and fails to detect and prevent harmful content or behaviors that can lead to problematic responses.</p>
<blockquote>
<ol start="3">
<li><em>What are potential solutions/workarounds/safety measures?</em></li>
</ol>
</blockquote>
<p>Some potential solutions were suggested by this group. First is ensuring that the training data used for the model is carefully curated to avoid inappropriate or harmful content. Apart from that, it is also important to teach the model how to align its behavior and responses with ethical and safety standards, especially when responding to sensitive questions. Moreover, this group emphasized that the responsibility for the model’s behavior lies with everyone involved. Therefore, it is necessary to promote vigilance when using the model to prevent harmful outcomes. Additionally, conducting a thorough review of the model’s behavior and responses before deployment is a possible solution as well, which makes necessary adjustments to ensure the robustness and safety of RL models.</p>
<h3 id="llm-based-methods">LLM-based methods</h3>
<blockquote>
<ol>
<li><em>What can go wrong in these systems in the worst scenario?</em></li>
</ol>
</blockquote>
<p>The worst-case scenario given by this group was in the context of relying on AI chatbots and models involving potentially severe consequences. One worst-case scenario mentioned is the loss of life. For instance, if a person in a vulnerable state relies on a chatbot for critical information or advice, and the chatbot provides incorrect or harmful answers, it could lead to tragic outcomes. Another concern is the spread of misinformation. AI models, especially chatbots, are easily accessible to a wide range of people. If these models provide inaccurate or misleading information to users who trust them blindly, it can contribute to the dissemination of false information, potentially leading to harmful consequences.</p>
<blockquote>
<ol start="2">
<li><em>How it would happen (realistically)?</em></li>
</ol>
</blockquote>
<p>According to the perception of this group, the potential of such worst-case scenarios happening is due to the following reasons. First, AI models are readily available to a broad audience, making them easily accessible for use in various situations. Second, many users who rely on AI models may not have a deep understanding of how these models work or their limitations. They might trust the AI models without critically evaluating the information they provide. Moreover, such worst-case scenarios often emerge in complex, gray areas where ethical and value-based decisions come into play, which means determining what is right or wrong, what constitutes an opinion, and where biases may exist can be challenging.</p>
<blockquote>
<ol start="3">
<li><em>What are potential solutions/workarounds/safety measures?</em></li>
</ol>
</blockquote>
<p>From the discussion result of this group, there are several possible solutions and safety measures. For example, creating targeted models for specific use cases rather than having a single generalized model for all purposes will allow for more control and customization in different domains. Furthermore, when developing AI models, involving a peer review process where experts collectively decide what information is right and wrong for a specific use case can help ensure the accuracy and reliability of the model’s responses. Another suggestion was recognizing the importance of educating users, particularly those who may not be as informed, about the limitations and workings of AI models. This education can help users make more informed decisions when interacting with AI systems and avoid blind trust.</p>
<h3 id="other-ml-methods">Other ML methods</h3>
<blockquote>
<ol>
<li><em>What can go wrong in these systems in the worst scenario?</em></li>
</ol>
</blockquote>
<p>This group was talking about the scenario that the realease of technical research and hypothetically ML model is incorporated into biomedical research. In the worst-case scenario, the incorporation of a machine learning model into biomedical research could result in the generation of compounds that are incompatible with the research goals, which could lead to unintended or harmful outcomes, potentially jeopardizing the research and its objectives.</p>
<blockquote>
<ol start="2">
<li><em>How it would happen (realistically)?</em></li>
</ol>
</blockquote>
<p>The opinions of this group imply that blindly trusting the ML model without human oversight and involvement in decision-making could be a contributing factor to such alignment problems in ML methods.</p>
<blockquote>
<ol start="3">
<li><em>What are potential solutions/workarounds/safety measures?</em></li>
</ol>
</blockquote>
<p>Several potential solutions were given by this group. First is actively involving humans in the decision-making process at various stages. They emphasized the importance of humans not blindly trusting the system and suggested running simulations for different explanation techniques and incorporating a human in the decision-making process before accepting the model’s outputs. Second, they suggested continuously overseeing the model’s behavior and alignment with goals, because continuous human oversight at different stages of the process (from data collection to model deployment) is important to ensure alignment with the intended goals. Apart from that, ensuring diverse and representative data for training and testing is also important, which can help avoid situations where the model may perform well on metrics but fails in real-life scenarios. Furthermore, they also suggested implementing human-based reinforcement learning to align the model with its intended use case. We need to incorporate a human before “trusting” the model due to the reason that humans might not trust the system. Specifically, the alignment of the model should be ensured at each step. As very small design choices may have a big impact on the model, it is necessary to make sure the intended use case aligns well with what the model is behaving like.</p>
<p><a href="#table-of-contents">Back to top</a></p>
<h1 id="wednesday-09062023-alignment-challenges-and-solutions">(Wednesday, 09/06/2023) Alignment Challenges and Solutions</h1>
<h2 id="opening-discussion">Opening Discussion</h2>
<p>Discussion on how to solve alignment issues stemming from:</p>
<ol>
<li>
<p><strong>Training Data</strong>. Addressing alignment issues stemming from training data is crucial for building reliable AI models. Collecting unbiased data, as one student suggested, is indeed a fundamental step. Bias can be introduced through various means, such as skewed sampling or annotator biases, so actively working to mitigate these sources of bias is essential. Automated annotation methods can help to some extent, but as the student rightly noted, they can be expensive and may not capture the nuances of complex real-world data. To overcome this, involving humans in the loop to guide the annotation process is an effective strategy. Human annotators can provide valuable context, domain expertise, and ethical considerations that automated systems may lack. This human-machine collaboration can lead to a more balanced and representative training dataset, ultimately improving model performance and alignment with real-world scenarios.</p>
</li>
<li>
<p><strong>Model Design</strong>. When it comes to addressing alignment issues related to model design, several factors must be considered. The choice of model architecture, hyperparameters, and training objectives can significantly impact how well a model aligns with its intended task. It’s essential to carefully design models that are not overly complex or prone to overfitting, as these can lead to alignment problems. Moreover, model interpretability and explainability should be prioritized to ensure that decisions made by the AI can be understood and validated by humans. Additionally, incorporating feedback loops where human experts can continually evaluate and fine-tune the model’s behavior is crucial for maintaining alignment. In summary, model design should encompass simplicity, interpretability, and a robust mechanism for human oversight to ensure that AI systems align with human values and expectations.</p>
</li>
</ol>
<h2 id="introduction-to-red-teaming">Introduction to Red-Teaming</h2>
<p>Language Models (LMs) often cannot be deployed because of their potential to harm users in hard-to-predict ways. One way to address this issue is to identify harmful behaviors before deployment by using test cases, which is also known as <em>red teaming</em>.</p>
<center>
<a href="/images/week2/redteam.png"><img src="/images/week2/redteam.png" width="80%"></a><br>
<p>Figure 1: Red-Teaming (<a href="https://arxiv.org/pdf/2202.03286.pdf">Image Source</a>)</p>
</center>
<p>In essence, the goal of red-teaming is to discover, measure, and reduce potentially harmful outputs. However, human annotation is expensive, limiting the number and diversity of test cases.</p>
<p>In light of this, this paper introduced LM-based red teaming, aiming to complement manual testing and reduce the number of such oversights by automatically finding where LMs are harmful. To do so, the authors first generate test inputs using an LM itself, and then use a classifier to detect harmful behavior on test inputs (Fig. 1). In this way, the LM-based red teaming managed to find tens of thousands of diverse failure cases without writing them by hand. Generally, the process of finding failing test cases can be done by the following three steps:</p>
<ol>
<li>Generate test cases using a red LM $p_{r}(x)$.</li>
<li>Use the target LM to generate an output $y$ for each test case $x$.</li>
<li>Find the test cases that led to a harmful output using the red team classifier $r(x, y)$.</li>
</ol>
<p>Specifically, the paper investigated various text generation methods for test case generation.</p>
<ul>
<li>
<p><strong>Zero-shot (ZS) Generation</strong>: Generate failing test cases without human intervention by sampling numerous outputs from a pretrained LM using a given prefix or “prompt”.</p>
</li>
<li>
<p><strong>Stochastic Few-shot (SFS) Generation</strong>: Utilize zero-shot test cases as examples for few-shot learning to generate similar test cases.</p>
</li>
<li>
<p><strong>Supervised Learning (SL)</strong>: Fine-tune the pretrained LM to maximize the log-likelihood of failing zero-shot test cases.</p>
</li>
<li>
<p><strong>Reinforcement Learning (RL)</strong>: Train the LM with RL to maximize the expected harmfulness elicited while conditioning on the zero-shot prompt.</p>
</li>
</ul>
<h3 id="in-class-activity-5-groups">In-class Activity (5 groups)</h3>
<ol>
<li>
<p><strong>Offensive Language</strong>: Hate speech, profanity, sexual content, discrimination, etc</p>
<p>Group 1 came up with 3 potential methods to prevent offensive language:</p>
<ul>
<li>Filter out the offensive language related data manually, then perform finetuning.</li>
<li>Filter out the offensive language related data using other models and then perform finetuning.</li>
<li>Generate prompts that might be harmful to finetune the model, keeping the context in consideration.</li>
</ul>
</li>
<li>
<p><strong>Data Leakage</strong>: Generating copyrighted/private, personally-identifiable information.</p>
<p>Coypright infringement (which is about the expression of an idea) is very different from leaking private information, but for purposes of this limited discussion we considered them together. Since LLMs and other AIGC models such as Stable Diffusion have strong ability to memorize, imitate and generate, the generated contents will very likely infringe on copyrights and may include sensitive personally-identifiable materials. There are already lawsuits accusing these companies regarding the copyright infringement issue (lawsuits <a href="https://www.nytimes.com/2023/07/10/arts/sarah-silverman-lawsuit-openai-meta.html">news1</a>, <a href="https://www.npr.org/2023/08/16/1194202562/new-york-times-considers-legal-action-against-openai-as-copyright-tensions-swirl">news2</a>).</p>
<p>Regarding the possible solutions, Group 2 viewed this from two perspectives. During the data preprocessing stage, companies such as OpenAI can collect the training data according to the license and also pay for the copyrighted data if needed. During the post-processing stage, commercial licenses and rule-based filters can be added to the model to ensure the fair use of the output content. For example, GitHub Copilot will block the generated suggestion if it has about 50 tokens that exactly or nearly match the training data (<a href="https://docs.github.com/en/copilot/configuring-github-copilot/configuring-github-copilot-settings-on-githubcom#enabling-or-disabling-duplication-detection">source</a>). OpenAI takes a different strategy by asking the users to be responsible for using the generated content, including for ensuring that it does not violate any applicable law or these Terms (<a href="https://openai.com/policies/terms-of-use">source</a>). There are many cases currently working their way through the legal system, and it remains to be seen how courts will interpret things.</p>
<p>However, the current solutions still have their limitations. For
program and code, preventing data leakage might be relatively easy
[perhaps, but many would dispute this], but for image and text, this
would be quite difficult, as it is quite difficult to build a good
metric to measure if the generated data has a copyrighting
issue. Maybe data watermarking can be a possible solution.</p>
</li>
<li>
<p><strong>Contact Information Generation</strong>: Directing users to unnecessarily email or call real people.</p>
<p>One alarming example of the potential misuse of Large Language Models (LLMs) like ChatGPT in the context of contact information generation is the facilitation of email scams. Malicious actors could employ LLMs to craft convincing phishing emails that appear to come from reputable sources, complete with authentic-sounding contact information. For example, an LLM could generate a deceptive email/phone call from a well-known bank, requesting urgent action and providing a seemingly legitimate email address and phone number for customer support.</p>
<p>Red-teaming involves simulating potential threats and vulnerabilities to identify weaknesses in security systems. By engaging in red-teaming exercises that specifically target the misuse of LLMs, we can mitigate the risks posed by the misuse of LLMs and protect individuals and organizations from falling victim to email/phone call scams and other deceptive tactics.</p>
</li>
<li>
<p><strong>Distributional Bias</strong>: Talking about some groups of people in an unfairly different way than others.</p>
<p>Group 4 reached an agreement that in order to capture the unfairness by red-teaming, we should first identify the categories where unfairness/bias might come from. Then we can generate prompts for different possible bias categories: gender and race etc. to get responses from LLMs using the red-teaming technique. The unfairness/bias may appear in the responses. We then can mask the bias-related terms to see if the generated answers reflect the distributional bias. However, there may be hidden categories that are not pre-identified, and how to capture these categories with potential distributional bias is still an open question.</p>
</li>
<li>
<p><strong>Conversational Harms</strong>: Offensive language that occurs in the context of a long dialogue, for example.</p>
<p>Group 5 discussed the concept of conversational harms, particularly focusing on how biases can emerge in AI models, such as the random GPT model, during conversations. Group 5 highlights that even though these models may start with no bias or predefined opinions, they can develop attitudes or biases towards specific topics based on the information provided during conversations. These biases can lead to harmful outcomes, such as making inappropriate or offensive judgments about certain groups of people. The paragraph suggests that this phenomenon occurs because the models heavily rely on the conversational data they receive, rather than their initial, unbiased training data.</p>
</li>
</ol>
<h3 id="how-to-use-red-teaming">How to use Red-Teaming?</h3>
<p>After the in-class activity, we also discussed the potential use of red-teaming from the following perspectives:</p>
<ul>
<li>
<p><strong>Blacklisting Phrases</strong>: By read-teaming, repetitive offensive phrases can be identified. For recurring cases, certain words and phrases can be removed.</p>
</li>
<li>
<p><strong>Removing Training Data</strong>: Identifying certain topics where model responses are misaligned allows can point to certain training data, helping to locate root causes for biases, discriminatory statements, and other undesirable output.</p>
</li>
<li>
<p><strong>Augmenting Prompts</strong>: Attack success can be minimized by adding certain phrases to the prompts.</p>
</li>
<li>
<p><strong>Multi-Objective Loss</strong>: While fine-tuning a model, a loss penalty can be associated with harmful output, which red-teaming helps identify.</p>
</li>
</ul>
<h2 id="alignment-solutions">Alignment Solutions</h2>
<p>During today’s discussion, the lead team introduced two distinct alignment challenges:</p>
<ul>
<li>
<p><strong>Inner Alignment</strong>: This pertains to the alignment of a specified loss function with the primary objective, particularly in situations where designing the loss function is straightforward.</p>
</li>
<li>
<p><strong>Outer Alignment</strong>: This involves aligning a specified objective with the desired end goal, especially when the task of designing an appropriate loss function becomes complex.</p>
</li>
</ul>
<p>Later in our discussion, we delved into the technical details of the LLM jailbreaking paper “<a href="https://arxiv.org/abs/2307.15043">Universal and Transferable Adversarial Attacks on Aligned Language Models</a>” and explored interesting findings presented during the demonstration video.</p>
<h3 id="llm-jailbreaking---introduction">LLM Jailbreaking - Introduction</h3>
<p>This paper introduced a new adversarial attack method that can induce aligned LLM to produce objectionable content. Specifically, given a (potentially harmful) user query, the attacker appends an adversarial suffix to the query that attempts to induce negative jailbreaking behaviors.</p>
<p>To choose these adversarial suffix tokens, the proposed Jailbreaking trick involves three simple key components, where the careful combination of them leads to reliably successful attacks:</p>
<ol>
<li><strong>Producing Affirmative Responses</strong>
One method for inducing objectionable behavior in language models involves forcing the model to provide a brief, affirmative response when confronted with a harmful query. For example, the authors target the model and force it to respond with “Sure, here is (content of query)”. Consistent with prior research, the authers observe that focusing on the initial response in this way triggers a specific ‘mode’ in the model, leading it to generate objectionable content immediately thereafter in its response, as illustrated in the figure below:</li>
</ol>
<center>
<a href="/images/week2/bomb.png"><img src="/images/week2/bomb.png" width="80%"></a><br>
<p>Figure 2: Adversarial Suffix (<a href="https://arxiv.org/pdf/2307.15043.pdf">Image Source</a>)</p>
</center>
<ol start="2">
<li>
<p><strong>Greedy Coordinate Gradient (GCG)-based Search</strong></p>
<p>As optimizing the log-likelihood of the attack succeeding over the <em>discrete</em> adversarial suffix is quite challenging, similar to the AutoPrompt, the authors proposed to leverage gradients at the token level to 1) identify a set of promising single-token replacements, 2) evaluate the loss of some number of candidates in this set, and 3) select the best of the evaluated substitutions, as presented in the figure below:</p>
<center>
<a href="/images/week2/gcg.png"><img src="/images/week2/gcg.png" width="80%"></a><br>
<p>Figure 3: Greedy Coordinate Gradient (GCG) (<a href="https://arxiv.org/pdf/2307.15043.pdf">Image Source</a>)</p>
</center>
<p><strong>Intuition behind GCG-based Search</strong>:</p>
<p>The motivation directly derives from the greedy coordinate descent approach: if one can evaluate all possible single-token substitutions, then it is possible to swap the token that maximally decreased the loss. Though evaluating all such replacements is not feasible, one can leverage gradients with respect to the one-hot token indicators to find a set of promising candidates for replacement at each token position, and then evaluate all these replacements exactly via a forward pass.</p>
<p><strong>Key differences from AutoPrompt</strong>:</p>
<ul>
<li>GCG-based Search: Searches a set of possible tokens to replace at each position.</li>
<li>AutoPrompt: Only chooses a single coordinate to adjust, then evaluates replacements just for that one position.</li>
</ul>
</li>
</ol>
<h2 id="heading"></h2>
<ol start="3">
<li><strong>Robust Universal Multi-prompt and Multi-model Attacks</strong></li>
</ol>
<center>
<a href="/images/week2/universal.png"><img src="/images/week2/universal.png" width="80%"></a><br>
<p>Figure 4: Universal Prompt Optimization (<a href="https://arxiv.org/pdf/2307.15043.pdf">Image Source</a>)</p>
</center>
<p>The core idea of Universal Multi-prompt and Multi-model attacks is to involve more desired prompts and more victim LLMs in the process, expecting the generated adversarial example to be transferable across victim LLMs and robust across prompts. Building upon Algorithm 1 the authors propose Algorithm 2, where loss functions over multiple models are incorporated to help achieve transferability, and a handful of prompts are employed to help guarantee the robustness.</p>
<p>The whole pipeline is illustrated in the figure below:</p>
<center>
<a href="/images/week2/illustration.png"><img src="/images/week2/illustration.png" width="80%"></a><br>
<p>Figure 4: Illustration of Aligned LLMs Are Not Adversarially Aligned (<a href="https://arxiv.org/pdf/2307.15043.pdf">Image Source</a>)</p>
</center>
<ol start="4">
<li><strong>Experiment Results</strong></li>
</ol>
<ul>
<li>
<p>Single Model Attack</p>
<p>The following results show that the baseline methods fail to elicit harmful on both Viccuna-7B and LLaMA-2-7B-Chat, whereas the proposed GCG is effective on both. The following figure illustrates that GCG is able to quickly find an adversarial example with small loss and continue to make gradual improvements over the remaining steps, which results in the continued decreasing of loss and increasing of ASR.</p>
</li>
</ul>
<center>
<a href="/images/week2/performance.png"><img src="/images/week2/performance.png" width="80%"></a><br>
<p>Figure 5: Performance Comparison of Different Optimizers (<a href="https://arxiv.org/pdf/2307.15043.pdf">Image Source</a>)</p>
</center>
<ul>
<li>
<p>Transfer Attack
The adversarial suffix generated by GCG can also successfully transfer to other LLMs, no matter if they are open-source models or black-box LLMs. The authors compared the different strategies to construct the prompt, including adding “Sure, here’s” at the end of the prompt, concatenating multiple suffixes, ensembling multiple suffixes and choose the successful one, and manual fine-tuning which manually rephrase the human-readable prompt content. Examples of the transfer attack are shown in the following figure.</p>
<p>There are also some concerns about the proposed method. For example, concatenating multiple suffixes can help mislead the model, but it can also make original prompt too “far behind the text” for model to generate response.</p>
</li>
</ul>
<center>
<a href="/images/week2/screenshots.png"><img src="/images/week2/screenshots.png" width="80%"></a><br>
<p>Figure 6: Screenshots of Harmful Content Generation (<a href="https://arxiv.org/pdf/2307.15043.pdf">Image Source</a>)</p>
</center>
<h3 id="llm-jailbreaking---demo">LLM Jailbreaking - Demo</h3>
<p>The leading team also showed a small demo which runs the jailbreaking attack in this paper on UVA’s computing servers. The demo can be found in this <a href="https://www.youtube.com/watch?v=6H5fzZFZNiU">YouTube video</a>:</p>
<center>
<iframe width="560" height="315" src="https://www.youtube-nocookie.com/embed/6H5fzZFZNiU?si=Vc1uNu6f3FEE4vUo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen></iframe>
</center>
<p>Some useful links if you would like to try it out yourself:</p>
<ul>
<li><a href="https://github.com/llm-attacks/llm-attacks">Implementation of the GCG jailbreaking attack</a></li>
<li><a href="https://ai.meta.com/resources/models-and-libraries/llama-downloads">Request Llama-2 model weights (takes 1-4 days)</a></li>
<li><a href="https://www.cs.virginia.edu/wiki/doku.php?id=compute_resources">UVA CS Computing Resources</a></li>
</ul>
<h4 id="observations">Observations</h4>
<p>Two main observations:</p>
<ol>
<li>The loss continues to drop very well, which aligns the authors observations in the paper.</li>
<li>Qualitatively speaking, the generated suffix at each step also corresponds to some sense of what a human might have done, for example, trying to instruct the model to do a specific task that corresponds to the prompt itself.</li>
</ol>
<h4 id="potential-improvement-ideas">Potential Improvement Ideas</h4>
<p><strong>How to make the attack more effective or appealing?</strong></p>
<ol>
<li>
<p>Use different languages: since the training corpus of GPT-3.5 has very little multilingual content, the alignment measures that have been taken for the model are almost in English. Thus, if we use a different language to form the instruction, it might be able to circumvent the protections and produce inappropriate responses. The lead team gave a successful example using German instructions.</p>
</li>
<li>
<p>Prompt with an opposite goal making is sound like that the intentions are positive
Instead of directly prompting the LLM to generate harmful content, we prompt the model to not generate harmful content, which makes it sound positive. The lead team gave a successful example which wrote a convincing anonymous death threat in the style of William Shakespeare.</p>
</li>
<li>
<p>Replacing special tokens in the suffix with human readable and comprehensible words. The lead team tried to restrict the vocabulary of the suffix to only be alphabets, and found that this doesn’t work. This observation might suggest that special tokens play an important role in confusing the model and fulfilling whatever the attacker wants.</p>
</li>
</ol>
<h2 id="closing-remarks-by-prof-evans">Closing Remarks (by Prof. Evans)</h2>
<p>One thing that is worth thinking about is what is the real threat
model here. Those examples shown in this paper, for example, how to
make a bomb or anonymous threat are interesting but might not be
viewed as real threats to many people. If someone wants to find out
how to make a bomb, they can Google for that (or if Google decides to
block it, use another search engine, or even go to a public library!).</p>
<p>Maybe a more practical attack scenario occurs as LLMs are embedded in
applications (or connected to plugins) that have the ability to
perform actions that may be influenced by text that the adversary has
some control over. For example, everyone (well almost everyone!) wants
an LLM that can automatically provide good responses to most of their
email. Such an application would necessarily have access to all your
sensitive incoming email, as well as the ability to send outgoing
emails, so perhaps a malicious adversary could craft an email to send
to a victim that would trick the LLM processing it as well as all of
your other email to send sensitive information from your emails to the
attacker, or to generate spearphising emails based on content in your
email and send them with your credentials to easily identified
contacts. Although the threats discussed in these red teaming papers
mostly seem impractical and lack real victims, they still serve as
interesting proxies for what may be real threats in the near future
(if not already).</p>
<p><a href="#table-of-contents">Back to top</a></p>
<h1 id="readings">Readings</h1>
<ol>
<li><a href="https://aligned.substack.com/p/what-is-alignment">What is the alignment problem?</a> - Blog post by Jan Leike</li>
<li><a href="https://arxiv.org/abs/2209.07858">Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned</a> Ganguli, Deep, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann et al. arXiv preprint arXiv:2209.07858 (2022).</li>
<li><a href="https://arxiv.org/pdf/2209.00626.pdf">The Alignment Problem from a Deep Learning Perspective</a> Richard Ngo, Lawrence Chan, and Sören Mindermann. arXiv preprint arXiv:2209.00626 (2022).</li>
<li><a href="https://arxiv.org/abs/2307.15043">Universal and Transferable Adversarial Attacks on Aligned Language Models</a> Zou, Andy, Zifan Wang, J. Zico Kolter, and Matt Fredrikson. arXiv preprint arXiv:2307.15043 (2023).</li>
</ol>
<h2 id="optional-additional-readings">Optional Additional Readings</h2>
<h3 id="background--motivation">Background / Motivation</h3>
<ul>
<li>Exploring how neural circuits collect meaningful information: <a href="https://distill.pub/2020/circuits/zoom-in/">https://distill.pub/2020/circuits/zoom-in/</a> (suggested if you are less experienced with ML):</li>
<li>What could solutions to the alignment problem look like? Blog post - <a href="https://aligned.substack.com/p/alignment-solution">https://aligned.substack.com/p/alignment-solution</a></li>
<li>OpenAI’s July SuperAlignment Announcement <a href="https://openai.com/blog/introducing-superalignment">https://openai.com/blog/introducing-superalignment</a></li>
</ul>
<h3 id="alignment-readings">Alignment Readings</h3>
<ul>
<li><a href="https://arxiv.org/pdf/2210.01790.pdf">Goal misgeneralization: Why correct specifications aren’t enough for correct goals.</a> Shah, Rohin, Vikrant Varma, Ramana Kumar, Mary Phuong, Victoria Krakovna, Jonathan Uesato, and Zac Kenton. arXiv preprint arXiv:2210.01790 (2022).</li>
<li><a href="https://arxiv.org/abs/1611.08219">The Off-Switch Game.</a> Hadfield-Menell, Dylan, Anca Dragan, Pieter Abbeel, and Stuart Russell. In International Joint Conferences on Artificial Intelligence Organization. 2017.</li>
<li><a href="https://proceedings.neurips.cc/paper/2017/hash/32fdab6559cdfa4f167f8c31b9199643-Abstract.html">Inverse Reward Design</a> Hadfield-Menell, Dylan, Smitha Milli, Pieter Abbeel, Stuart J. Russell, and Anca Dragan. Advances in Neural Information Processing Systems (2017).</li>
<li>Inner alignment: (sections 1, 3, 4): <a href="https://arxiv.org/abs/1906.01820">Risks from Learned Optimization in Advanced Machine Learning Systems</a> Hubinger, Evan, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, and Scott Garrabrant. arXiv preprint arXiv:1906.01820 (2019).</li>
<li>Outer alignment: <a href="https://www.lesswrong.com/posts/33EKjmAdKFn3pbKPJ/outer-alignment-and-imitative-amplification">Outer alignment and imitative amplification</a> - Blog post by Evan Hubinger.</li>
<li><a href="https://arxiv.org/abs/2107.10939">What are you optimizing for? Aligning Recommender Systems with Human Values</a> Stray, Jonathan, Ivan Vendrov, Jeremy Nixon, Steven Adler, and Dylan Hadfield-Menell. arXiv preprint arXiv:2107.10939 (2021).</li>
<li><a href="https://openai.com/research/instruction-following">Training language models to follow instructions with human feedback</a> Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang et al. Advances in Neural Information Processing Systems (2022).</li>
</ul>
<h3 id="adversarial-attacks--jailbreaking">Adversarial Attacks / Jailbreaking</h3>
<ul>
<li><a href="https://arxiv.org/pdf/2306.15447.pdf">Are aligned neural networks adversarially aligned?</a> Carlini, Nicholas, Milad Nasr, Christopher A. Choquette-Choo, Matthew Jagielski, Irena Gao, Anas Awadalla, Pang Wei Koh et al. arXiv preprint arXiv:2306.15447 (2023).</li>
<li><a href="https://arxiv.org/pdf/2302.09923.pdf">Prompt Stealing Attacks Against Text-to-Image Generation Models</a> Shen, Xinyue, Yiting Qu, Michael Backes, and Yang Zhang. arXiv preprint arXiv:2302.09923 (2023).</li>
<li><a href="https://www.deepmind.com/blog/red-teaming-language-models-with-language-models">Red-teaming LLMs</a> - Blog post by Deepmind</li>
</ul>
<h1 id="discussion-questions-1">Discussion Questions</h1>
<p>Before 5:29pm on Sunday, September 3, everyone who is not in either the lead or blogging team for the week should post (in the comments below) an answer to at least one of these four questions in the first section (1–4) and one of the questions in the second section (4–8), or a substantive response to someone else’s comment, or something interesting about the readings that is not covered by these questions.</p>
<p>Don’t post duplicates - if others have already posted, you should read their responses before adding your own.</p>
<p>Please post your responses to different questions as separate comments.</p>