-
Notifications
You must be signed in to change notification settings - Fork 12
/
index.html
3767 lines (3555 loc) · 240 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html>
<html class="no-js" lang="en-us">
<head>
<meta name="generator" content="Hugo 0.121.1">
<meta charset="utf-8">
<title>Risks (and Benefits) of Generative AI and Large Language Models</title>
<meta http-equiv="x-ua-compatible" content="ie=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<link rel="stylesheet" href="https://llmrisks.github.io/css/foundation.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/highlight.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/font-awesome.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/academicons.min.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/fonts.css">
<link rel="stylesheet" href="https://llmrisks.github.io/css/finite.css">
<link rel="shortcut icon" href="/images/uva.png">
<link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.css" integrity="sha384-AfEj0r4/OFrOo5t7NnNe46zW/tFgW6x/bCJG8FqQCEo3+Aro6EYUG4+cU+KJWu/X" crossorigin="anonymous">
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/katex.min.js" integrity="sha384-g7c+Jr9ZivxKLnZTDUhnkOnsh30B4H0rpLUpJ4jAIKs4fnJI+sEnkvrMWph2EDg4" crossorigin="anonymous"></script>
<script defer src="https://cdn.jsdelivr.net/npm/[email protected]/dist/contrib/auto-render.min.js" integrity="sha384-mll67QQFJfxn0IYznZYonOWZ644AWYC+Pt2cHqMaRhXVrursRwvLnLaebdGIlYNa" crossorigin="anonymous"
onload="renderMathInElement(document.body);"></script>
<script>
document.addEventListener("DOMContentLoaded", function() {
renderMathInElement(document.body, {
delimiters: [
{left: "$$", right: "$$", display: true},
{left: "$", right: "$", display: false}
]
});
});
</script>
</head>
<body>
<header>
<nav class="nav-bar">
<div class="title-bar" data-responsive-toggle="site-menu" data-hide-for="medium">
<button class="site-hamburger" type="button" data-toggle>
<i class="fa fa-bars fa-lg" aria-hidden="true"></i>
</button>
<div class="title-bar-title site-title">
<a href="https://llmrisks.github.io/">Risks (and Benefits) of Generative AI and Large Language Models</a>
</div>
<div class="title-bar-right pull-right">
</div>
</div>
<div class="top-bar" id="site-menu" >
<div class="top-bar-title show-for-medium site-title">
<a href="https://llmrisks.github.io/">Risks (and Benefits) of Generative AI and Large Language Models</a>
</div>
<div class="top-bar-left">
<ul class="menu vertical medium-horizontal">
</ul>
</div>
<div class="top-bar-right show-for-medium">
<p class="groupstyle">Computer Science</br>University of Virginia</p>
</div>
</div>
</nav>
</header>
<main>
<div class="container">
<div class="sidebar">
<p>
University of Virginia<Br>
cs6501 Fall 2023<br>
Risks and Benefits of Generative AI and LLMs
</p>
<p>
<p>
<a href="/syllabus"><b>Syllabus</b></a></br>
<a href="/schedule"><b>Schedule</b></a></br>
<a href="/readings"><b>Readings and Topics</b></a></br>
<p></p>
<a href="https://github.com/llmrisks/discussions/discussions/">Discussions</a><br>
<p>
<p></p>
</p>
<p>
<b><a href="/post/">Recent Posts</a></b>
<div class="posttitle">
<a href="/summary/">Summary of Semester</a>
</div>
<div class="posttitle">
<a href="/week14b/">Week 14b: Ethical AI</a>
</div>
<div class="posttitle">
<a href="/week14a/">Week 14a: Multimodal Models</a>
</div>
<div class="posttitle">
<a href="/week13/">Week 13: Regulating Dangerous Technologies</a>
</div>
<div class="posttitle">
<a href="/week12/">Week 12: LLM Agents</a>
</div>
<div class="posttitle">
<a href="/week11/">Week 11: Watermarking on Generative Models</a>
</div>
<div class="posttitle">
<a href="/week10/">Week 10: Data Selection for LLMs</a>
</div>
<div class="posttitle">
<a href="/week9/">Week 9: Interpretability</a>
</div>
<div class="posttitle">
<a href="/week8/">Week 8: Machine Translation</a>
</div>
<div class="posttitle">
<a href="/week7/">Week 7: GANs and DeepFakes</a>
</div>
<div class="posttitle">
<a href="/post/"><em>More...</em></a>
</div>
</p>
<p>
</p>
<p>
</div>
<div class="content">
<h1><a href="/summary/">Summary of Semester</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-12-12 00:00:00 +0000 UTC" itemprop="datePublished">12 December 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p>Here’s a summary of the topics for the semester:</p>
<p><a href="/week1">Week 1: Introduction</a></p>
<ul>
<li>Attention, Transformers, and BERT</li>
<li>Training LLMs, Risks and Rewards</li>
</ul>
<p><a href="/week2">Week 2: Alignment</a></p>
<ul>
<li>Introduction to AI Alignment and Failure Cases</li>
<li>Redteaming</li>
<li>Jail-breaking LLMs</li>
</ul>
<p><a href="/week3">Week 3: Prompting and Bias</a></p>
<ul>
<li>Prompt Engineering</li>
<li>Marked Personas</li>
</ul>
<p><a href="/week4">Week 4: Capabilities of LLMs</a></p>
<ul>
<li>LLM Capabilities</li>
<li>Medical Applications of LLMs</li>
</ul>
<p><a href="/week5">Week 5: Hallucination</a></p>
<ul>
<li>Hallucination Risks</li>
<li>Potential Solutions</li>
</ul>
<p>Week 6: Visit from <a href="https://www.korinek.com/">Anton Korinek</a></p>
<p><a href="/week7">Week 7: Generative Adversarial Networks and DeepFakes</a></p>
<ul>
<li>GANs and DeepFakes</li>
<li>Creation and Detection of DeepFake Videos</li>
</ul>
<p><a href="/week8">Week 8: Machine Translation</a></p>
<ul>
<li>History of Machine Translation</li>
<li>Neural Machine Translation</li>
</ul>
<p><a href="/week9">Week 9: Interpretability</a></p>
<ul>
<li>Introduction to Interpretability</li>
<li>Mechanistic Interpretability</li>
</ul>
<p><a href="/week10">Week 10: Data for Training</a></p>
<ul>
<li>Data Selection for Fine-tuning LLMs</li>
<li>Detecting Pretraining Data from Large Language Models</li>
<li>Impact of Data on Large Language Models</li>
<li>The Curse of Recursion: Training on Generated Data Makes Models Forget</li>
</ul>
<p><a href="/week11">Week 11: Watermarking</a></p>
<ul>
<li>Watermarking LLM Outputs</li>
<li>Watermarking Diffusion Models</li>
</ul>
<p><a href="/week12">Week 12: LLM Agents</a></p>
<ul>
<li>LLM Agents</li>
<li>Tools and Planning</li>
</ul>
<p><a href="/week13">Week 13: Regulating Dangerous Technologies</a></p>
<ul>
<li>Analogies from other technologies for regulating AI</li>
</ul>
<p><a href="/week14a">Week 14a: Multimodal Models</a><br>
<a href="/week14b">Week 14b: Ethical AI</a></p>
</div>
<hr class="post-separator"></hr>
<h1><a href="/week14b/">Week 14b: Ethical AI</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-12-04 00:00:00 +0000 UTC" itemprop="datePublished">4 December 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p><author>Presenting Team: Aparna Kishore, Elena Long, Erzhen Hu, Jingping Wan</author><br>
<author>Blogging Team: Haolin Liu, Haochen Liu, Ji Hyun Kim, Stephanie Schoch, Xueren Ge</author></p>
<p>Note: since the topics were unrelated, Week 14 is split into two posts:</p>
<ul>
<li><a href="/week14a">Monday, November 27: Multimodal Models</a></li>
<li><a href="/week14b">Wednesday, November 29: Ethical AI</a></li>
</ul>
<h1 id="wednesday-november-29-ethical-ai">Wednesday, November 29: Ethical AI</h1>
<table><tr>
<td><img src="../images/week14/day1/A.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<blockquote>
<p>Ben Shneiderman. <a href="https://dl.acm.org/doi/abs/10.1145/3419764"><em>Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-centered AI Systems</em></a>. ACM Transactions on Interactive Intelligent Systems, October 2020. <a href="https://dl.acm.org/doi/abs/10.1145/3419764">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day1/B.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Today’s topic is ethical AI, with a focus on human-centered AI (HCAI). From this perspective, AI is seen as amplifying the performance of humans.</p>
<p>Important to HCAI is the need for reliable, safe and trustworthy properties, through the collaboration of software engineers, companies, government, and society as a whole.</p>
<table><tr>
<td><img src="../images/week14/day1/C.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<ol>
<li>Reliable Systems: Soft Engineering</li>
<li>Safety Culture: Organizational Design</li>
<li>Trustworthy Certification: External Reviews</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/D.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Things that should be considered when developing ethical AI:</p>
<ol>
<li>Data quality</li>
<li>Training log analysis</li>
<li>Privacy and security of data</li>
</ol>
<p>Example:FDR has quantitative benchmark to see if a plane is safe/stable, which can help in designing the next generation of products</p>
<p>Analogy of FDR to AI: We could get quantitative feedback of the product or strategy we want to test: What data do we need, how do we analyze log data (or select useful data from operation logs), how to protect data from being attacked, etc.</p>
<p>Through a similar approach, we can say that AI is safe through testing and logs, rather than just ‘take our word for it’</p>
<table><tr>
<td><img src="../images/week14/day1/E.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Software Engineering workflows: AI workflow requires <em>goal-aligned update</em>.</p>
<p>Verification and validation testing:</p>
<ol>
<li>Design tests align with expectations, prevent harms</li>
<li>Goals of AI are more general or high-level than traditional software programs, so we need tests that are designed with user expectations rather than solely the technical details.</li>
</ol>
<p>Bias testing to enhance fairness:</p>
<ol>
<li>Test training data for opacity, scale, harm.</li>
<li>Use specialized tools for continuous monitoring.</li>
<li>After we have a trained model, we still need testing to check the risk, and may need a specific team in the organization or external company to test safety of model (should be continuous).</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/F.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Explainable user interfaces:</p>
<ol>
<li>Are difficult to achieve</li>
<li>Ensure system explainability for user understanding, meeting legal requirements</li>
<li>Intrinsic and post hoc explanations aid developer improvement.</li>
<li>Design a comprehensive user interface, considering user sentiments</li>
<li>Post hoc: no information about the technical details of the model, but rather need a broad level idea of the system</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/G.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>There are five principles to build safety cultures, which are mostly top-down approaches (see slides).</p>
<p>Leadership: create a safe team, make commitment to safety that is visible to employees so they know leaders are committed to safety.</p>
<p>Long-term investment: need safe developers to develop safe models.</p>
<p>Public can help monitor and improve as it creates public/external pressure, so companies may work harder to eliminate issues.</p>
<table><tr>
<td><img src="../images/week14/day1/H.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Internal Review Boards engage stakeholders in setting benchmarks and to make improvements for problems and future planning.</p>
<table><tr>
<td><img src="../images/week14/day1/I.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<table><tr>
<td><img src="../images/week14/day1/J.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Trustworthy certification by independent oversight:</p>
<ul>
<li>
<p>Purpose: Ensure continuous improvement for reliable, safe products. Helps to make a complete, trustworthy system.</p>
</li>
<li>
<p>Requirements: Respected leaders, conflict declaration, diverse membership.</p>
</li>
<li>
<p>Capacity: Examine private data, conduct interviews, issue subpoenas for evidence.</p>
</li>
</ul>
<table><tr>
<td><img src="../images/week14/day1/K.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Independent oversight is structured around three core methods:</p>
<ol>
<li>Planning</li>
<li>Monitoring</li>
<li>Conducting reviews or retrospectives</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/L.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>There are five paths for Trustworthy certification</p>
<ol>
<li>
<p>Government: Policy and Regulation, aligning with EU’s seven key principles(list on the top right) for transparency, reliability, safety, privacy, and fairness</p>
</li>
<li>
<p>Accounting Firms: Beyond the internal audits mentioned previously, external bodies should audit the entire industry</p>
</li>
<li>
<p>Insurance Companies: Adapting policies for emerging technologies like self-driving cars (details on next slide)</p>
</li>
<li>
<p>Non-government organizations: prioritizing the public’s interest</p>
</li>
<li>
<p>Professional organizations and research institutes</p>
</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/M.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<table><tr>
<td><img src="../images/week14/day1/N.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>As an activity, we tried role playing where each group will play different roles and think about following 15 principles in terms of “ethical AI”.</p>
<p>Ethical Team:</p>
<ol>
<li>Diagnosis for skin cancer, dataset quality is reliable (bias-skin color, state-laws passing for collecting data)</li>
<li>Various Metrics for evaluating AI</li>
<li>Come up an agreement with patients, doctors</li>
</ol>
<p>Healthcare Management/Organization:</p>
<ol>
<li>Reporting failures (missed diagnosis) for feedback</li>
<li>Data security, gathering FP, FN cases for further training</li>
<li>Educating staff</li>
<li>Establishing accuracy/certainty of threshold for AI diagnosing skin cancer, checking the standard of professional verification</li>
</ol>
<p>Independent oversight committee:</p>
<ol>
<li>Whether the dataset is not biased in every stage and is representing all race, gender, etc</li>
<li>Data source should be considered carefully (online, hospital)</li>
<li>Model explanation and transparency should be considered</li>
<li>Privacy of personal information of both the dataset and the users</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/O.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>There are 15 principles each group can take into consideration for the role-playing discussion.</p>
<table><tr>
<td><img src="../images/week14/day1/P.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Reorienting technical R&D emphasizes oversight, robustness, interpretability, inclusivity, risk assessment, and addressing emerging challenges.</p>
<p>Proposed governance measures include enforcing standards to prevent misuse, requiring registration of frontier systems, implementing whistleblower protections, and creating national and international safety standards. Additionally, the accountability of frontier AI developers and owners, along with AI companies promptly disclosing if-then commitments, is highlighted.</p>
<table><tr>
<td><img src="../images/week14/day1/Q.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>There are some ethical platforms for developing responsible AI product</p>
<ol>
<li>SUM Values: to provide a framework for moral scope of AI product</li>
<li>FAST Track Principles: to make sure AI project is fair, bias-mitigating and reliable</li>
<li>PBG Framework: to set up transparent process of AI product</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/R.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Putting the Ethical Platform into Practice needs three key steps: reflect, act and justify:</p>
<ol>
<li>Reflect using the SUM values: asking and answering questions about ethical purposes and assess the impacts of AI project</li>
<li>Act using FAST TRACK Principles: ensure every step of development produces safe, fair AI innovation</li>
<li>Justify Using the PBG Framework: set up governance process to ensure model transparency</li>
</ol>
<table><tr>
<td><img src="../images/week14/day1/S.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<h4 id="team-1">Team 1</h4>
<p>There are many trajectories that AI development could take, so it would be very difficult to completely discount something as a possibility. Related this to “Dark Matter” book by Blake Crouch.</p>
<p>Risk would primarily come from bad actors (specifically humans). Briefly touched on ‘what if the bad actor is the AI?’</p>
<h4 id="team-2">Team 2</h4>
<p>The potential downfall of humans would not be due to AI’s maliciousness.</p>
<p>In the post-autonomous era, concerns shift to the misuse of models for harmful purposes.</p>
<h4 id="team-3">Team 3</h4>
<p>The second question seems to be already happening.</p>
<p>Given the rapid technological progress in recent years, single prompt can result in losing control over AI models, and speculations around ‘Q*(Q-Star)’ suggest risk in losing control over AI models, however AI’s power-seeking behavior may still be overstated.</p>
<h2 id="readings">Readings</h2>
<ul>
<li><strong><code>Required</code></strong>: Yoshua Bengio, Geoffrey Hinton, Andrew Yao, Dawn Song, Pieter Abbeel, Yuval Noah Harari, Ya-Qin Zhang, Lan Xue, Shai Shalev-Shwartz, Gillian Hadfield, Jeff Clune, Tegan Maharaj, Frank Hutter, Atılım Güneş Baydin, Sheila McIlraith, Qiqi Gao, Ashwin Acharya, David Krueger, Anca Dragan, Philip Torr, Stuart Russell, Daniel Kahneman, Jan Brauner, Sören Mindermann. <a href="https://arxiv.org/abs/2310.17688">Managing AI Risks in an Era of Rapid Progress.</a> arXiv 2023. <a href="https://arxiv.org/abs/2310.17688">PDF</a></li>
<li><strong><code>Required</code></strong>: Ben Shneiderman. <a href="https://dl.acm.org/doi/abs/10.1145/3419764">Bridging the Gap Between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-centered AI Systems.</a> ACM Transactions on Interactive Intelligent Systems, October 2020. <a href="https://dl.acm.org/doi/abs/10.1145/3419764">PDF</a></li>
<li><strong><code>Optional</code></strong>: David Leslie. <a href="https://arxiv.org/abs/1906.05684">Understanding Artificial Intelligence Ethics And Safety.</a> arXiv 2019. <a href="https://arxiv.org/abs/1906.05684">PDF</a></li>
<li><strong><code>Optional</code></strong>: Joseph Carlsmith. <a href="https://arxiv.org/abs/2206.13353">Is Power-Seeking AI an Existential Risk?.</a> arXiv 2022. <a href="https://arxiv.org/abs/2206.13353">PDF</a></li>
<li><strong><code>Optional</code></strong>: Alice Pavaloiu, Utku Kose. <a href="https://arxiv.org/abs/1706.03021">Ethical Artificial Intelligence - An Open Question.</a> arXiv 2017. <a href="https://arxiv.org/abs/1706.03021">PDF</a></li>
</ul>
<h3 id="questions">Questions</h3>
<p><strong>(Post response by Tuesday, 28 November)</strong></p>
<p>Paper 1: <a href="https://drive.google.com/file/d/1Ok16aNvNLbdkBexcmt9dyVGPEpKYGXbH/view">Bridging the Gap Between Ethics and Practice</a></p>
<ol>
<li>The paper claims, “Human-centered Artificial Intelligence (HCAI) systems represent a second Copernican revolution that puts human performance and human experience at the center of design thinking." Do you agree with this quote?</li>
<li>Developers/teams, organizations, users and regulators often have different views on what constitutes reliability, safety, and trustworthiness in human-centered AI systems. What are the potential challenges and solutions for aligning them? Can you provide some specific examples where these views do not align?</li>
</ol>
<p>Paper 2: <a href="https://arxiv.org/pdf/2310.17688.pdf">Managing AI Risks in an Era of Rapid Progress</a></p>
<ol start="3">
<li>Do you think AI systems can be regulated over an international governance organization or agreement like nuclear weapons?</li>
<li>Consider this quote from the paper: “Without sufficient caution, we may irreversibly lose control of autonomous AI systems, rendering human intervention ineffective. Large-scale cybercrime, social manipulation, and other highlighted harms could then escalate rapidly. This unchecked AI advancement could culminate in a large-scale loss of life and the biosphere, and the marginalization or even extinction of humanity.” Do you agree with it? If so, do you think any of the measures proposed in the paper would be sufficient for managing such a risk? If not, what assumptions of the authors’ that led to this conclusion do you think are invalid or unlikely?</li>
</ol>
</div>
<hr class="post-separator"></hr>
<h1><a href="/week14a/">Week 14a: Multimodal Models</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-12-03 00:00:00 +0000 UTC" itemprop="datePublished">3 December 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p><author>Presenting Team: Aparna Kishore, Elena Long, Erzhen Hu, Jingping Wan</author><br>
<author>Blogging Team: Haolin Liu, Haochen Liu, Ji Hyun Kim, Stephanie Schoch, Xueren Ge</author></p>
<p>Note: since the topics were unrelated, Week 14 is split into two posts:</p>
<ul>
<li><a href="/week14a">Monday, November 27: Multimodal Models</a></li>
<li><a href="/week14b">Wednesday, November 29: Ethical AI</a></li>
</ul>
<h1 id="monday-november-27-multimodal-models">Monday, November 27: Multimodal Models</h1>
<p>Today’s topic is how to improve model performance by combining multiple modes.</p>
<table><tr>
<td><img src="../images/week14/day2/B.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>We will first introduce the multimodal foundations and then center
around CLIP, which is the most famous vision-language model.</p>
<table><tr>
<td><img src="../images/week14/day2/C.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>We live in a multimodal world, and our brains naturally learn to process multiple sensory signals received from the environment to help us make sense of the world around us. More specifically, vision is a large portion of how humans perceive, while language is a large portion of how humans communicate.</p>
<table><tr>
<td><img src="../images/week14/day2/D.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>When we talk about vision-language, there are two types of interations to consider: one is how can we produce visual data, and another is how can we consume visual information.</p>
<table><tr>
<td><img src="../images/week14/day2/E.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>For visual generation, popular models include GAN and diffusion models. What makes it multi-modal is that we can use other modalities to control the image we want to generate, for example, the text-to-image methods that can use text-conditioned visual generation, such as stable diffusion.</p>
<table><tr>
<td><img src="../images/week14/day2/F.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Another approach focuses on visual understanding, which studies how can we consume the visual information from the image, and further, how can we consume the audio, image, and different modalities from our surrounding environment.</p>
<table><tr>
<td><img src="../images/week14/day2/G.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Greg Brockman, who is one of the founders of OpenAI, showed ChatGPT a diagram of my joke website, which he sketched with a pencil. Then, ChatGPT outputs a functional website. This is quite remarkable as you can start to plug images into the language models.</p>
<p>Link: <a href="https://x.com/gdb/status/1635826383141376002?s=20"><em>https://x.com/gdb/status/1635826383141376002?s=20</em></a>.</p>
<table><tr>
<td><img src="../images/week14/day2/H.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>When we see the text “Sunshine, Sunny beach, Coconut, Straw hat”, we can visualize a picture of a beach with these components. This is because our mind not only receives multimodal information but also somehow aligns these modalities.</p>
<table><tr>
<td><img src="../images/week14/day2/I.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Now we move to the detailed algorithm of vision-language models. There are particular vision-language problem spaces or representative tasks that these models try to solve.</p>
<table><tr>
<td><img src="../images/week14/day2/J.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>The first question is how to train a vision-language model. We will discuss supervised pre-training and contrastive language-image pre-training, which is also known as CLIP.</p>
<table><tr>
<td><img src="../images/week14/day2/K.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Supervised learning will map an image to a discrete label that is associated with visual content. The drawback here is that we always need labeled data. However, human annotations can be expensive and labels are limited.</p>
<table><tr>
<td><img src="../images/week14/day2/L.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>The supervised learning method was deployed first. In 2021, OpenAI released Dall-E, which is a generative model that uses transformer architecture like GPT3. The model receives both text and image in the training process, and it can generate images from scratch based on natural language input.</p>
<p>As seen in the images above, it can combine disparate ideas to synthesize objects, even some of them are unlikely to exist in the real world.</p>
<h2 id="clip">CLIP</h2>
<blockquote>
<p>Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. <a href="https://arxiv.org/abs/2103.00020"><em>Learning Transferable Visual Models From Natural Language Supervision</em></a>. arXiv 2021. <a href="https://arxiv.org/abs/2103.00020">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day2/M.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Different from Dall-E, CLIP takes an image and text and connects them in non-generative way. The idea is that we can take an image, and the model can predict the text along with it.</p>
<p>Traditional image classification models are trained to identify objects from a predefined set of categories, for example, there are about 1000 categories in the ImageNet challenge. CLIP is trained to understand the semantics of images and text together. It is trained with a huge amount of data, 400 million images on the web and corresponding text data, and it can perform object identification in any category without re-training.</p>
<table><tr>
<td><img src="../images/week14/day2/N.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Since CLIP was trained using a combination of image and text, the training data is a batch of (image, text) pairs.</p>
<p>On top we have labels that belong to each image, the model tokenize it, passes it to text encoder, performs linear projection, and passes it along to a contrastive embedding screen. It does the same for images.</p>
<p>Then, in the contrastive embedding screen, the model takes the inner product of image vector and text vector. In contrastive learning, we want to increase the values of these blue squares to 1, which are original image and text pairs, and decrease the values of the white squares, which are not the classes they belong to. To achieve this, they compute the loss of these image-text vectors and text-image vectors and do back propagation</p>
<table><tr>
<td><img src="../images/week14/day2/O.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>We now elaborate more on the loss function of this training process. We have two vectors (text, image) here, $v$ represents the text vector, and $u$ represents the image vector, and $\tau$ here is a trainable parameter.</p>
<p>In the first text-to-image loss function, they take the cosine similarities of these two vectors, sum up all the rows in the denominator and normalized via softmax. As we can see it is an asymmetric problem, so to compute the image-to-text loss function, they sum up all the columns instead of the rows</p>
<p>After that, they compute a cross-entropy loss of these two probability distributions, sum up all the batches, and then average it out to get the final loss function.</p>
<table><tr>
<td><img src="../images/week14/day2/P.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>After pre-training, the second and third step are for the object identification. We have a new dataset with different classes and want to test CLIP on it. In step 2, we need to pass these classes to the pre-trained text encoder. Instead of passing class names alone, they use a prompt template, making a sentence out of these class names. Then the model will perform the same linear projection as in the pre-training and pass it into a contrastive space</p>
<p>Then in Step 3, we take the image we want to predict, pass it into the image encoder, do linear projection, go into the contrastive embedding space and take the inner products of this image vector and all text vectors in Step 2. The final prediction output will be the one that has the highest cosine similarity.</p>
<table><tr>
<td><img src="../images/week14/day2/Q.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>The authors share three main ideas behind this work:</p>
<ol>
<li>
<p>The need of a sufficiently large dataset. The simple truth is that existing manually labeled datasets are just way too small (100k samples) for training a natural language supervised model on the scale of GPT. The intuition is that the required dataset already exists on the web without the need to label data manually. So they created a new dataset of 400 million (image, text) pairs collected from a variety of publicly available sources on the Internet.</p>
</li>
<li>
<p>Efficient pre-training method. After experimenting with class-label prediction, the authors realized that the key to success was in predicting only which text as a whole is paired with which image, not the exact word of that text. This discovery led to the use of the loss function we introduced earlier, such that the cosine similarity for each correct pair of embeddings is maximized, and the cosine similarity of the rest of the pairings are minimized.</p>
</li>
<li>
<p>Using transformers. After some experiments, they selected a transformer as text encoder, and leave two options for image encoder. The image encoder is either a Vision Transformer or a modified ResNet-D with attention pooling instead of global average pooling.</p>
</li>
</ol>
<table><tr>
<td><img src="../images/week14/day2/R.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>The figure above shows that CLIP is by far much more data-efficient than the other methods.</p>
<table><tr>
<td><img src="../images/week14/day2/S.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>With prompt engineering and ensembling, models are more likely to achieve higher accuracy score rather than just simply having contextless class names.</p>
<p>One observation is that CLIP performs poorly on differentiating word sense when there’s only a label without context. For example, the label “crane” can mean construction crane or a crane that flies.</p>
<p>Another observation is that in their pre-training dataset text, it is relatively rare to see an image with just a single word as a label. So to bridge the distribution gap, they use a prompt template. Instead of a single label, they use template sentences like “a photo of a label”. They also found customizing the prompt text to each task can further improve performance.</p>
<table><tr>
<td><img src="../images/week14/day2/T.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>As we can see the error rate decreases smoothly as a function of model compute. However, they do note that there is lots of variance, this curve is the average. For individual datasets, it varies widely. It may be due to how the dataset is selected, how the prompt is engineered, or other unknown factors.</p>
<table><tr>
<td><img src="../images/week14/day2/U.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>For the evaluation in terms of performance, CLIP is compared with a linear probe on ResNet50. It is pretty impressive that zero-shot CLIP outperforms a fully trained model on many of the dataset, including ImageNet.</p>
<p>On the other side, CLIP is weak on several specialized, complex, or abstract tasks such as EuroSAT (satellite image classification), KITTI Distance (recognizing distance to the nearest car). This may because these are not the kinds of text and image found frequently on the Internet, or that these tasks are different enough from common image tasks but simple enough for a custom-trained model to do well.</p>
<table><tr>
<td><img src="../images/week14/day2/V.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Here we compare zero-shot CLIP with few-shot linear probes. This is where pre-training really comes in, as the model only see few examples per class.</p>
<p>Surprisingly, a zero-shot CLIP is comparable to a 16-shot BiT-M model, which is one of the best open models that does transfer learning in computer vision. If we linear probe the CLIP model, then it way outperform these other linear probe models.</p>
<table><tr>
<td><img src="../images/week14/day2/W.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>They also evaluate CLIP in terms of its robustness to perturbations. Here they compare zero-shot CLIP to models that have been trained on ImageNet, finding that zero-shot clip matches the performance of ResNet-101. As this classifier degrades as we go for harder and harder datasets overall, CLIP is more robust. This suggests that representation in CLIP should be nuanced enough, so it can pick up on different features than only distinguishing banana from other classes in the ImageNet dataset.</p>
<table><tr>
<td><img src="../images/week14/day2/X.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Here, they customize zero-shot CLIP to each dataset (adapt to class shift in purple) based on class names. While this supervised adaptation to class shift increases ImageNet accuracy by around 10 percent, it slightly reduces the average robustness. From the right side, the improvements are concentrated on only a few datasets.</p>
<p>On the other hand, when they adapt CLIP to fully supervised logistic regression classifiers on the best CLIP model’s features, it comes close to the standard ImageNet training in terms of robustness. Thus, it seems that the representation itself in zero-shot CLIP has more value with more stability and nuance.</p>
<table><tr>
<td><img src="../images/week14/day2/XXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>There are various works following CLIP based on this contrastive learning structure. The first extension is to further scale up texts and images. The second is to design better models.</p>
<h2 id="reproducible-scaling-laws">Reproducible Scaling Laws</h2>
<blockquote>
<p>Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, Jenia Jitsev. <a href="https://arxiv.org/abs/2212.07143"><em>Reproducible scaling laws for contrastive language-image learning</em></a>. CVPR 2023. <a href="https://arxiv.org/abs/2212.07143">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day2/XXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>This paper used the open larges scale LAION-2B dataset to pre-train
OpenCLIP across different scales.</p>
<h2 id="datacomp">Datacomp</h2>
<blockquote>
<p>Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt. <a href="https://arxiv.org/abs/2304.14108"><em>DataComp: In search of the next generation of multimodal datasets</em></a>. arxiv 2023. <a href="https://arxiv.org/abs/2304.14108">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day2/XXXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>This paper talked about how should we scale data? Should we scale it up with noisier and noisier data?</p>
<p>Their focus is to search the next-generation image-text datasets. Instead of fixing the dataset and designing different algorithms, the authors propose to fix the CLIP training method but vary the datasets instead. With this method, they come up with a high-quality large-scale dataset.</p>
<h2 id="filip">FILIP</h2>
<blockquote>
<p>Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, Chunjing Xu. <a href="https://arxiv.org/abs/2111.07783"><em>FILIP: Fine-grained Interactive Language-Image Pre-Training</em></a>. ICLR 2022. <a href="https://arxiv.org/abs/2111.07783">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day2/XXXXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>FILIP scales CLIP training via masking. It randomly masks out image patches with a high masking ratio, and only encodes the visible patches. It turns out this method does not hurt performance but improves training efficiency</p>
<h2 id="k-lite">K-Lite</h2>
<blockquote>
<p>Sheng Shen, Chunyuan Li, Xiaowei Hu, Jianwei Yang, Yujia Xie, Pengchuan Zhang, Zhe Gan, Lijuan Wang, Lu Yuan, Ce Liu, Kurt Keutzer, Trevor Darrell, Anna Rohrbach, Jianfeng Gao. <a href="https://arxiv.org/abs/2204.09222"><em>K-LITE: Learning Transferable Visual Models with External Knowledge</em></a>. NeurIPS 2022. <a href="https://arxiv.org/abs/2204.09222">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day2/XXXXXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Another line of work focuses on improving the language side model design of CLIP. The model K-Lite utilizes the Wiki definition of entities together with the original alt-text for contrastive pre-training. Such knowledge is useful for a variety of domains and datasets, making it possible to build a generic approach for task-level transfer.</p>
<table><tr>
<td><img src="../images/week14/day2/XXXXXXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Recall that in the motivating example, we argue that more modalities will enhance the learning process.</p>
<h2 id="imagebind">ImageBind</h2>
<blockquote>
<p>Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev Alwala, Armand Joulin, Ishan Misra. <a href="https://arxiv.org/abs/2305.05665"><em>ImageBind: One Embedding Space To Bind Them All</em></a>. arxiv 2023. <a href="https://arxiv.org/abs/2305.05665">PDF</a></p>
</blockquote>
<table><tr>
<td><img src="../images/week14/day2/XXXXXXXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>Imagebind tries to use more modalities to improve performance. However, one challenge here is that not all generated data are naturally aligned due to the lack of a corresponding relationship in the training set.</p>
<table><tr>
<td><img src="../images/week14/day2/XXXXXXXXXX.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>For ImageBind, there are different modalities include image, text, video, audio, depth, thermal, and IMU, which contains the accelerator, and gyroscope data. The goal of ImageBind is to learn a single joint embedding space for all the modalities, and then use image as the binding modality. Here <em>I</em> denotes image modality, and <em>M</em> denotes all the other modalities. They use deep neural networks as encoders to extract embeddings from each of the modalities, so each modality has it own encoder, just like CLIP.</p>
<p>During the training, the image and text modality was kept frozen, and the weights of other modalities were updated, and this freezing shows the alignment to emerge between other modalities for which we don’t have any natural alignment, for example, between audio, and depth.</p>
<p>The preprocessed inputs are passed through encoders and then passed through a simple linear layer to make sure they are of same dimension before being trained with the loss called infoNCE loss. This loss is a modified cross-entropy loss, which extends the contrastive learning to multiple modalities. Let the output for image be <em>q</em>, and the output for other modalities be <em>k</em>. The loss here tries to align image modality with all other modalities.</p>
<table><tr>
<td><img src="../images/week14/day2/YY.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>They study whether ImageBind’s embeddings can be used to compose information across modalities. The above figure shows image retrievals obtained by adding together image and audio embeddings. The joint embedding space allows for us to compose two embeddings: e.g., image of fruits on a table + sound of chirping birds and retrieve an image that contains both these concepts, i.e., fruits on trees with birds. Such emergent compositionality whereby semantic content from different modalities can be composed will likely enable a rich variety of compositional tasks.</p>
<table><tr>
<td><img src="../images/week14/day2/YYY.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>By utilizing the audio embedding of ImageBind, it is possible to design an audio-based detector that can detect and segment objects based on audio prompts.</p>
<table><tr>
<td><img src="../images/week14/day2/YYYY.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>As proposed in CLIP, replacing labels with textual descriptions and using a text encoder to encode them can feasibly convert closed-set problems to open-set ones. A number of works have been proposed to transform different computer vision tasks by replacing the label space with language space.</p>
<table><tr>
<td><img src="../images/week14/day2/Z.JPG" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>For the first question, we believe there are several differences between humans and machines for cognition. Although these models will outperform humans on several specific tasks, they also have limitations. For example, humans will perceive an image as a whole, but machines perceive it pixel by pixel. This ensures humans are good at using context to interpret images and text. While these models can recognize patterns and correlations between words and images, they may not fully grasp the broader context as humans do.</p>
<p>For the second question, the presenter gave an example that there is a specific food in Wuhan called “hot dry noodles”. When we give a picture of this kind of noodles with the caption “hot dry noodles in Wuhan”, the multi-mode models will output how this food is popular in Wuhan. However, if we replace the caption as “hot dry noodles in Shandong”, the model will still describe this noodles in Wuhan instead of Shandong. The presenter believes this is an example of bias because a lot of data on this noodles is associated with Wuhan. Thus, even though the caption of the image is changed, the model can not comprehend because the representation is fixed.</p>
<h1 id="readings-and-discussion-questions">Readings and Discussion Questions</h1>
<h2 id="monday-27-november-transferring-and-binding-multi-modal-capabilities">Monday 27 November: Transferring and Binding Multi-Modal Capabilities:</h2>
<h3 id="readings-for-monday">Readings for Monday:</h3>
<ul>
<li><strong><code>Required</code></strong>: Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, Ilya Sutskever. <a href="https://arxiv.org/abs/2103.00020">Learning Transferable Visual Models From Natural Language Supervision</a>. PMLR 2021. <a href="https://arxiv.org/abs/2103.00020">PDF</a></li>
<li><strong><code>Optional</code></strong>: OpenAI. <a href="https://openai.com/research/clip">CLIP: Connecting text and images.</a> Blog 2021.</li>
<li><strong><code>Required</code></strong>: Sandhini Agarwal, Gretchen Krueger, Jack Clark, Alec Radford, Jong Wook Kim, Miles Brundage. <a href="https://arxiv.org/abs/2108.02818">Evaluating CLIP: Towards Characterization of Broader Capabilities and Downstream Implications.</a> <a href="https://arxiv.org/abs/2108.02818">PDF</a></li>
<li><strong><code>Required</code></strong>: Meta AI. <a href="https://ai.meta.com/blog/imagebind-six-modalities-binding-ai/">ImageBind: Holistic AI Learning Across Six Modalities.</a> Blog 2023.</li>
<li><strong><code>Optional</code></strong>: Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Mannat Singh, Kalyan Vasudev, Alwala Armand, Joulin Ishan Misra. <a href="https://arxiv.org/abs/2309.10020">ImageBind: One Embedding Space To Bind Them All</a>. arXiv 2023. <a href="https://arxiv.org/abs/2309.10020">PDF</a></li>
<li><strong><code>Optional</code></strong>: Chunyuan Li, Zhe Gan, Zhengyuan Yang, Jianwei Yang, Linjie Li, Lijuan Wang, Jianfeng Gao. <a href="https://arxiv.org/abs/2309.10020">Multimodal Foundation Models: From Specialists to General-Purpose Assistants.</a> arXiv 2023. <a href="https://arxiv.org/abs/2309.10020">PDF</a> Chapter 1-2, p5 - p25.</li>
<li><strong><code>Optional</code></strong>: Jindong Gu, Zhen Han, Shuo Chen, Ahmad Beirami, Bailan He, Gengyuan Zhang, Ruotong Liao, Yao Qin, Volker Tresp, Philip Torr. <a href="https://arxiv.org/abs/2307.12980">A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.</a> arXiv 2023. <a href="https://arxiv.org/abs/2307.12980">PDF</a></li>
<li><strong><code>Optional</code></strong>: Anastasiya Belyaeva, Justin Cosentino, Farhad Hormozdiari, Krish Eswaran, Shravya Shetty, Greg Corrado, Andrew Carroll, Cory Y. McLean, Nicholas A. Furlotte. <a href="https://arxiv.org/abs/2307.09018">Multimodal LLMs for health grounded in individual-specific data.</a> arXiv 2023. <a href="https://arxiv.org/abs/2307.09018">PDF</a></li>
<li><strong><code>Optional</code></strong>: Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. <a href="https://arxiv.org/abs/2010.11929">An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.</a> ICLR 2021. <a href="https://arxiv.org/abs/2010.11929">PDF</a></li>
</ul>
<h3 id="questions">Questions</h3>
<p><strong>(Post response by Sunday, 26 November)</strong></p>
<ol>
<li>What are some potential real-world applications of CLIP and ImageBind? Could these technologies transform industries like healthcare, education, or entertainment?</li>
<li>How do CLIP and ImageBind mimic or differ from human cognitive processes in interpreting and linking visual and textual information?</li>
<li>What are potential challenges in creating datasets for training models like CLIP and ImageBind? How can the quality of these datasets be ensured?</li>
<li>What are the potential ethical implications of technologies like CLIP and ImageBind, especially in terms of privacy, bias, and misuse? How can these issues be mitigated?</li>
</ol>
</div>
<hr class="post-separator"></hr>
<h1><a href="/week13/">Week 13: Regulating Dangerous Technologies</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-11-20 00:00:00 +0000 UTC" itemprop="datePublished">20 November 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p>The slides are here: <a href="https://www.dropbox.com/scl/fi/ycrjkoau5kclxq09ckvx4/regulation-post.pdf?rlkey=28sxdj7pf4pzlbjtavn59bufl&dl=0">Regulating Dangerous Technologies</a> (I’ve included some slides in the posted slides that I didn’t present in class but you might find interesting, including some excerpts from a talk I gave in 2018 on <a href="https://speakerdeck.com/evansuva/mutually-assured-destruction-and-the-impending-ai-apocalypse"><em>Mutually Assured Destruction and the Impending AI Apocalypse</em></a>.)</p>
<p>Since one of the groups made the analogy to tobacco products, I also will take the liberty of pointing to a talk I gave at Google making a similar analogy: <a href="https://uvasrg.github.io/google-federated-privacy-2019-the-dragon-in-the-room/"><em>The Dragon in the Room</em></a>.</p>
<p>Stephanie made the point after class about how important individuals
making brave decisions is to things working out, in particular with
humanity (so far!) avoiding annihilating ourselves with nuclear
weapons. Stanislav Petrov may well have been the single person between
us and nuclear destruction in 1983, when he prevented an alert (which
he correctly determined was a false alarm) produced by the Soviet
detection system from going up the chain.</p>
<p>Here’s one (of many)
articles on this: <a href="https://www.washingtonpost.com/wp-srv/inatl/longterm/coldwar/shatter021099b.htm"><em>‘I Had A Funny Feeling in My
Gut’</em></a>,
Washington Post, 10 Feb 1999. There is still a lot of uncertainty and
skepticism if we should be fearing any kind of out-of-control AI risk,
but it is not so hard to imagine scenarios where our fate will
similarly come down to an individual’s decision at a critical
juncture. (On the other hand, this article argues that we shouldn’t
oversensationalize Petrov’s actions and there were many other
safeguards between him and nuclear war, and we really shouldn’t design
extinction-level systems in a way that they are so fragile to depend on an individual decision: <a href="https://russianforces.org/blog/2022/10/did_stanislav_petrov_save_the_.shtml"><em>Did Stanislav Petrov save the world in 1983? It’s complicated</em></a>, from a Russian perspective.)</p>
</div>
<hr class="post-separator"></hr>
<h1><a href="/week12/">Week 12: LLM Agents</a></h1>
<div class="post-metadata">
<span class="post-date">
<time datetime="2023-11-16 00:00:00 +0000 UTC" itemprop="datePublished">16 November 2023</time>
</span>
</div>
<div class="post-body" itemprop="articleBody">
<p><author>Presenting Team: Liu Zhe, Peng Wang, Sikun Guo, Yinhan He, Zhepei Wei</author></p>
<p><author>Blogging Team: Anshuman Suri, Jacob Christopher, Kasra Lekan, Kaylee Liu, My Dinh</author></p>
<h1 id="monday-november-13-llm-agents">Monday, November 13: LLM Agents</h1>
<table><tr>
<td><img src="../images/week12/day1/LLM_Agents_MondayPres_Page_02.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p><strong>LLM agents</strong> are the “next big thing”, with the potential to directly impact important fields like healthcare and education. Essentially, they are LLM-based systems that have the ability to use external tools, such as Internet browsing access and calculators, to augment their abilities.</p>
<table><tr>
<td><img src="../images/week12/day1/LLM_Agents_MondayPres_Page_03.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<h2 id="toolformer">Toolformer</h2>
<blockquote>
<p>Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. <a href="https://arxiv.org/abs/2302.04761"><em>Toolformer: Language Models Can Teach Themselves to Use Tools</em></a>. arXiv 2023. <a href="https://arxiv.org/abs/2302.04761">PDF</a></p>
</blockquote>
<p>LLMs have limitations that can potentially be addressed with these “tools”:</p>
<table><tr>
<td><img src="../images/week12/day1/LLM_Agents_MondayPres_Page_05.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<ul>
<li><strong>Outdated information</strong>: LLMs cannot access up-to-date information without access to external sources. Giving them the ability to access realtime information (via Internet queries) would lead to better responses, such as “who is the President of USA today?”</li>
<li><strong>Hallucination</strong>: External knowledge sources can help ground generation in facts and work to supplement the model’s knowledge, reducing the possibility of hallucinating.</li>
<li><strong>Lack of mathematical skills</strong>: Access to a calculator can help model generate correct responses and computations involving math. Using zero-shot learning can help reduce hallucination, but providing access to a calculator (assuming it is used correctly) can guarantee correct responses.</li>
</ul>
<p>Other limitations include limited multi-language usability, having no concept of “time”, etc.</p>
<table><tr>
<td><img src="../images/week12/day1/LLM_Agents_MondayPres_Page_09.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<h3 id="key-contributions">Key Contributions</h3>
<table><tr>
<td><img src="../images/week12/day1/LLM_Agents_MondayPres_Page_11.jpg" width="95%"></td>
</tr>
<td colspan=1 align="center"><b></b></td>
</table>
<p>The main idea is to develop a system that has the ability to use external tools (translation, calendar, search engine, etc.).
The key lies in knowing <em>when</em> to use a tool, <em>which</em> tool to use, and <em>how</em> to use it. Training is self-supervised, unlike other capability-enhancing techniques like RLHF.</p>
<h3 id="data-collection">Data Collection</h3>