-
Notifications
You must be signed in to change notification settings - Fork 1
/
aises_3_3
873 lines (860 loc) · 52.5 KB
/
aises_3_3
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
<style type="text/css">
table.tableLayout{
margin: auto;
border: 1px solid;
border-collapse: collapse;
border-spacing: 1px;
caption-side: bottom;
}
table.tableLayout tr{
border: 1px solid;
border-collapse: collapse;
padding: 5px;
}
table.tableLayout th{
border: 1px solid;
border-collapse: collapse;
padding: 3px;
}
table.tableLayout td{
border: 1px solid;
padding: 5px;
}
</style>
<h1 id="sec:proxy-gaming">3.3 Robustness</h1>
<p>In this section, we begin to explore the need for proxies in machine learning and the challenges this poses for creating systems that are robust to adversarial attacks.
We examine a potential failure mode known as proxy gaming, wherein a model optimizes for a proxy in a way that diverges from the idealized goals of its designers.
We also analyze a related concept known as Goodhart’s law and explore some of the causes for these kinds of failure modes. Next, we consider the phenomenon of adversarial
examples, where an optimizer is used to exploit vulnerabilities in a neural network. This can enable adversarial attacks that allow an AI system to be misused.
Other adversarial threats to AI systems include Trojan attacks, which allow an adversary to insert hidden functionality. There are also techniques that allow adversaries
to surreptitiously extract a model’s weights or training data. We close by looking at the tail risks of having AI systems themselves play the role of evaluators
(i.e. proxy goals) for other AI systems.</p>
<h2 id="proxies-in-machine-learning">3.3.1 Proxies in Machine Learning</h2>
<p>Here, we look at the concept of proxies, why they are necessary, and
how they can lead to problems.</p>
<p><strong>Many goals are difficult to specify exactly.</strong> It is
hard to measure or even define many of the goals we care about. They
could be too abstract for straightforward measurement, such as justice,
freedom, and equity, or they could simply be difficult to observe
directly, such as the quality of education in schools.<p>
With ML systems, this difficulty is especially pronounced because, as we
saw in the chapter, ML systems require quantitative, measurable targets
in order to learn. This places a strong limit on the kinds of goals we
can represent. As we’ll see in this section, specifying suitable and
learnable targets poses a major challenge.</p>
<p><strong>Proxies stand in for idealized goals.</strong> When
specifying our idealized goals is difficult, we substitute a
<em>proxy</em>—an approximate goal that is more measurable and seems
likely to correlate with the original goal. For example, in pest
management, a bureaucracy may substitute the number of pests killed as a
proxy for “managing the local pest population” <span class="citation"
data-cites="john2023deada">[1]</span>. Or, in training an AI system to
play a racing game, we might substitute the number of points earned for
“progress towards winning the race” <span class="citation"
data-cites="clark2016faulty">[2]</span>. Such proxies can be more or
less accurate at approximating the idealized goal.</p>
<p><strong>Proxies may miss important aspects of our idealized
goals.</strong> By definition, proxies used to optimize AI systems will
fail to capture some aspects of our idealized goals. When the
differences between the proxy and idealized goal lead to the system
making the same decisions, we can neglect them. In other cases, the
differences may lead to substantially different downstream decisions
with potentially undesirable outcomes.<p>
While proxies serve as useful and often necessary stand-ins for our
idealized objectives, they are not without flaws. The wrong choice of
proxies can lead to the optimized systems taking unanticipated and
undesired actions.</p>
<h2 id="proxy-gaming">3.3.2 Proxy Gaming</h2>
<p>In this section, we explore a failure mode of proxies known as proxy
gaming, where a model optimizes for a proxy in a way that produces
undesirable or even harmful outcomes as judged from the idealized goal.
Additionally, we look at a concept related to proxy gaming, known as
Goodhart’s Law, where the optimization process itself causes a proxy to
become less correlated with its original goal.</p>
<p><strong>Optimizing for inaccurate proxies can lead to undesired
outcomes.</strong> To illustrate proxy gaming in a context outside AI,
consider again the example of pest management. In 1902, the city of
Hanoi was dealing with a rat problem: the newly installed sewer system
had inadvertently become a breeding ground for rats, bringing with it a
concern for hygiene and the threat of a plague outbreak <span
class="citation" data-cites="john2023deada">[1]</span>. In an attempt to
control the rat population, the French colonial administration began
offering a bounty for every rat killed. To make the collection process
easier, instead of demanding the entire carcass, the French only
required the rat’s tail as evidence of the kill.<p>
Counter to the officials’ aims, people began breeding rats to cut off
their tails and claim the reward. Additionally, others would simply cut
off the tail and release the rat, allowing it to potentially breed and
produce more tails in the future. The proxy—rat tails—proved to be a
poor substitute for the goal of managing the local rat population.<p>
So too, proxy gaming can occur in ML. A notorious example comes from
when researchers at OpenAI trained an AI system to play a game called
CoastRunners. In this game, players need to race around a course and
finish before others. Along the course, there are targets which players
can hit to earn points <span class="citation"
data-cites="clark2016faulty">[2]</span>. While the intention was for the
AI to circle the racetrack and complete the race swiftly, much to the
researchers’ surprise, the AI identified a loophole in the objective. It
discovered a specific spot on the course where it could continually
strike the same three nearby targets, rapidly amassing points without
ever completing the race. This unconventional strategy allowed the AI to
secure a high score, even though it frequently crashed into other boats
and, on several occasions, set itself ablaze. Points proved to be a poor
proxy for doing well at the game.</p>
<figure id="fig:coastrunners">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/proxy_gaming_ex_1.png" class="tb-img-full" style="width: 90%"/>
<p class="tb-caption">Figure 3.10: An AI playing CoastRunners 7 learned to crash and regenerate targets repeatedly rather
than win the race to get a higher score, exhibiting proxy gaming. <span class="citation"
data-cites="clark2016faulty">[2]</span></p>
<!--<figcaption>Proxy gaming in CoastRunners 7 - <span class="citation"-->
<!--data-cites="clark2016faulty">[2]</span></figcaption>-->
</figure>
<p><strong>Optimizing for inaccurate proxies can lead to harmful
outcomes.</strong> If a proxy is sufficiently unfaithful to the
idealized goal it is meant to represent, it can result in AI systems
taking actions that are not just undesirable but actively harmful. For
example, a 2019 study on a US healthcare algorithm used to evaluate the
health risk of 200 million Americans revealed that the algorithm
inaccurately evaluated black patients as healthier than they actually
were <span class="citation"
data-cites="obermeyer2019dissecting">[3]</span>. The algorithm used past
spending on similar patients as a proxy for health, equating lower
spending with better health. Due to black patients historically getting
fewer resources, this system perpetuated a lower and inadequate standard
of care for black patients—assigning half the amount to them of care as to equally sick non-marginalized patients. When deployed at scale, AI systems that
optimize inaccurate proxies can have significant, harmful effects.</p>
<p><strong>Optimizers often “game” proxies in ways that diverge from our
idealized goals.</strong> As we saw in the Hanoi example and the
boat-racing example, proxies may contain loopholes that allow for
actions that achieve high performance according to the proxy but that
are suboptimal or even deleterious according to the idealized goal.
<em>Proxy gaming</em> refers to this act of exploiting or taking
advantage of approximation errors in the proxy rather than optimizing
for the original goal. This is a general phenomenon that happens in both
human systems and AI systems.<p>
</p>
<figure id="fig:optimisation_pressure">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/proxy_score.png" style="width: 70%" class="tb-img-full"/>
<p class="tb-caption">Figure 3.11: Often, as optimization pressure increases, the proxy diverges from the target with which
it was originally correlated. <span
class="citation" data-cites="skalsedefining">[4]</span></p>
<!--<figcaption>Often, as optimization pressure increases, the proxy-->
<!--diverges from the target with which it was originally correlated. <span-->
<!--class="citation" data-cites="skalsedefining">[4]</span></figcaption>-->
</figure>
<p>Proxy gaming can occur in many AI systems. The boat-racing example is
not an isolated example. Consider a simulated traffic control
environment <span class="citation"
data-cites="pan2022effects">[5]</span>. Its goal is to mirror the
conditions of cars joining a motorway, in order to determine how to
minimize the average commute time. The system aims to determine the
ideal traveling speeds for both oncoming traffic and vehicles attempting
to join the motorway. To represent average commute time the algorithm
uses the maximum mean velocity as a proxy. However, this results in the
algorithm preventing the joining vehicles from entering the motorway,
since a higher average velocity is maintained when oncoming cars can
proceed without slowing down for joining traffic.<p>
</p>
<div class="center">
</div>
<figure id="fig:proxy-reward">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/Traffic control.png" style="width: 75%" class="tb-img-full"/>
<p class="tb-caption">Figure 3.12: Proxy gaming AIs can choose sub-optimal solutions when presented with simple proxies
like “maximize the mean velocity.” </p>
<!--<figcaption>Sub-optimal traffic control solution due to proxy-->
<!--gaming</figcaption>-->
</figure>
<p>Optimizers can cause proxies to become less correlated with the
idealized goal. The total amount of effort an optimizer has put towards
optimizing a particular proxy is the <em>optimization pressure</em>
<span class="citation" data-cites="skalsedefining">[4]</span>.
Optimization pressure depends on factors like the incentives present,
the capability of the optimizer, and how much time the optimizer has had
to optimize.<p>
In many cases, the correlation between a proxy and an idealized goal will
decrease as optimization pressure increases. The approximation error
between the proxy and the idealized goal may at first be negligible, but as the
system becomes more capable of achieving high performance (according to
the proxy) or as the incentives to achieve high performance increases,
the approximation error can increase. In the boat-racing example, the
proxy (number of points) initially advanced the designers’ intentions:
the respective AI systems learned to maneuver the boat. It was only
under additional optimization pressure that the correlation broke down
with the boat getting stuck in a loop.<p>
Sometimes, the correlation between a proxy and an idealized goal can vanish
or reverse. According to <em>Goodhart’s Law</em>, “any observed
statistical regularity will tend to collapse once pressure is placed
upon it for control purposes” <span class="citation"
data-cites="goodhart1975problems">[6]</span>. In other words, a proxy
might initially have a strong correlation (“statistical regularity”)
with the idealized outcome. However, as optimization pressure (“pressure
... for control purposes”) increases, the initial correlation can vanish
(“collapse”) and in some cases even reverse. The scenario with the Hanoi
rats is a classic illustration of this principle, where the number of
rat tails collected ultimately became positively correlated with the
local rat population. The proxy failed precisely because the pressure to
optimize for it caused the proxy to become less correlated with the
idealized goal.</p>
<p><strong>Some proxies are more robust to optimization pressure than
others.</strong> Goodhart’s Law is often condensed to: “When a measure
becomes a target, it ceases to be a good measure” <span class="citation"
data-cites="strathern1997improving">[7]</span>. Though memorable, this
overly simplified version falsely suggests that robustness to
optimization pressure is a binary all or nothing. In reality, robustness
to optimization pressure occupies a spectrum. Some are more robust than
others.</p>
<h3 id="types-of-proxy-defects">Types of Proxy Defects</h3>
<p>Intuitively, the cause of proxy gaming is straightforward: the
designer has chosen the wrong proxy. This suggests a simple solution:
just choose a better proxy. However, real-world constraints make it
impossible to “just choose a better proxy”. Some amount of approximation
error between idealized goals and the implemented proxy is often
inevitable. In this section, we will survey three principal types of
proxy defects—common sources of failure modes like proxy gaming.</p>
<p><strong>Simple metrics may exclude many of the things we value, but
it is hard to predict how they will break down.</strong> YouTube uses
watch time—the amount of time users spend watching a video—as a proxy to
evaluate and recommend potentially profitable content <span
class="citation" data-cites="roose2019making">[8]</span>. In order to
game this metric, some content creators resorted to tactics to
artificially inflate viewing time, potentially diluting the genuine
quality of their content. Tactics included using misleading titles and
thumbnails to lure viewers, and presenting ever more extreme and hateful
content to retain attention. Instead of promoting high-quality,
monetizable content, the platform started endorsing exaggerating or
inflammatory videos.<p>
YouTube’s reliance on watch time as a metric highlights a common
problem: many simple metrics don’t include everything we value. It is
especially these missing aspects that become salient under extreme
optimization pressure. In YouTube’s case, the structural error of
failing to include other values it cared about (such as what was
acceptable to advertisers) led to the platform promoting content that
violated its own values. Eventually, YouTube updated its recommendation
algorithm, de-emphasizing watch-time and incorporating a wider range of
metrics. Including one’s broader set of values requires incorporating a
larger and more granular set of proxies. In general, this is highly
difficult, as we need to be able to specify precisely how these values
can be combined and traded off against each other.<p>
This challenge isn’t unique to YouTube. As long as AI systems’ goals
rely on simple proxies and do not reflect the set of all of our
intrinsic goods such as wellbeing, we leave room for optimizers to exploit those gaps.
In the future, machine learning models may become adept at representing
our wider set of values. Then, their ability to work reliably with
proxies will hinge largely on their resilience to the kinds of
adversarial attacks discussed in the next section.<p>
Until then, the challenge remains: if our objectives are simple and do
not fully reflect our most important values (e.g. intrinsic goods), we
run the risk of an optimizer exploiting this gap.</p>
<p><strong>Choosing and delegating subgoals creates room for structural
error.</strong> Many systems are organized into multiple different
layers. When such a system is goal-directed, pursuing its high-level
goal often requires breaking it down into subgoals and delegating these
subgoals to its subsystems. This can be a source of structural error if
the high-level goal is not the sum of its subgoals.<p>
For example, a company might have the high-level goal of being
profitable over the long term <span class="citation"
data-cites="john2023deada">[1]</span>. Management breaks this down into
the subgoal of improving sales revenue, which they operationalize via
the proxy of quarterly sales volume. The sales department, in turn,
breaks this subgoal down into the subgoal of generating leads, which
they operationalize with the proxy of the “number of calls” that sales
representatives are making. Representatives may end up gaming this proxy
by making brief, unproductive calls that fail to generate new leads,
thereby decreasing quarterly sales revenue and ultimately threatening
the company’s long-term profitability. Delegation can create problems
when the entity delegating (“the principal”) and the entity being
delegated to (“the agent”) have a conflict of interest or differing
incentives. These <em>principal-agent problems</em> can cause the
overall system not to faithfully pursue the original goal.<p>
Each step in the chain of breaking goals down introduces further
opportunity for approximation error to creep in. We speak more about
failures due to delegation such as goal conflict in the Intrasystem Goal Conflict section
in Collective Action Problems.</p>
<h3 id="limits-to-supervision">Limits to Supervision</h3>
<p>Frequently occurring sources of approximation error mean that we do
not have a perfect instantiation of our idealized goals. One approach to
approximating our idealized goals is to provide supervision that says
whether something is in keeping with our goal or not; this supervision
could come from humans or from AIs. We now discuss how spatial,
temporal, perceptual, and computational limits create a source of
approximation error in supervision signals.</p>
<p><strong>There are spatial and temporal limits to supervision <span
class="citation" data-cites="christiano2023deep">[9]</span>.</strong>
There are limits to how much information we can observe and how much
time we can spend observing. When supervising AI systems, these limits
can prevent us from reliably mitigating proxy gaming and other
undesirable behaviors. For example, researchers trained a simulated claw
to grasp a ball using human feedback. To do so, the researchers had
human evaluators judge two pieces of footage of the model and choose
which appeared to be closer to grasping the ball. The model would then
update towards the chosen actions. However, researchers noticed that the
final model did not in fact grasp the ball. Instead, the model learned
to move the claw in front of the ball, so that it only appeared to have
grasped the ball.<p>
In this case, if the humans giving the feedback had had access to more
information (perhaps another camera angle or a higher resolution image),
they would have noticed that it was not performing the task.
Alternatively, they might have spotted the problem if given more time to
evaluate the claw. In practice, however, there are practical limits to
how many sensors and evaluators we can afford to run and how long we can
afford to run them.</p>
<figure>
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/proxy_gaming_ex_2_v2.png"
style="width:85.0%" class="tb-img-full"/>
<p class="tb-caption">Figure 3.13: A sensor without depth perception can be fooled by AIs that only appear to grasp a ball.</p>
<!--<figcaption>Proxy gaming example - simulated claw appearing to grasp-->
<!--ball rather than actually grasping it</figcaption>-->
</figure>
<p><strong>There are limits to how reliable supervision is.</strong>
Another potential source of difficulty is perceptual: there could be a
measuring error, or the evaluator may make incorrect judgments. For
example, we might train AIs on the proxy of stated human preferences.
Because of cognitive biases and limited time to think, humans are not
perfectly reliable. Our stated preferences are not the same as our
idealized preferences, so we might give erroneous supervision, which
could lead to the system learning undesired behaviors. For more on the
distinction between states and idealized preferences in the context of
ML, see the Machine Ethics chapter.<p>
In general, incorporating more information into proxies makes it easy to
prevent proxy gaming. However, we can’t always afford to do so. Just as
there are limits in specifying proxies, there are limits in how much
information we can incorporate into proxies, how long a period we can
observe, and how accurate our supervision is.</p>
<p><strong>There are computational limits to supervision.</strong>
Sometimes, we use neural networks to stand in as proxies. This typically
involves training a neural network to predict how humans would evaluate
an output or what humans would prefer out of a range of options. Once
trained, this proxy model can serve as an optimization target for
another AI system.<p>
If the proxy model is too small or if it has not been trained for long
enough, it may not be a robust stand-in for human values. That is,
practical limits on model size and training compute can cause proxy
models to fail. Besides supervising AIs during training, we may run
simulations or use other neural networks to supervise advanced AI
systems for undesired behavior during deployment. Here, we run into
similar computational limits on the quality of the simulation or the
inference time available for the supervising model. If capable enough,
the system under supervision may find ways to exploit these
weaknesses.</p>
<p>We have discussed ways in which proxies will predictably have defects
and why we cannot assume the solution to proxy gaming is simply to
specify the perfect objective. We have covered sources of proxy defects,
including structural errors and limits to supervision. Now, we will
discuss another proxy defect: a lack of adaptivity.</p>
<p><strong>Proxies may not adapt to new circumstances.</strong> As we
saw with Goodhart’s Law, proxies may become progressively less
appropriate over time when subjected to increasing optimization
pressure. The issue is not that the proxy was inappropriate from the
start but that it was inflexible and failed to respond to changing
circumstances. Adapting proxies over time can counter this tendency;
just as a moving goal is harder to aim at, a dynamic proxy becomes
harder to game.<p>
Imagine a bank after a robbery. In response, the bank will naturally
update its defenses. However, adaptive criminals will also alter their
tactics to bypass these new measures. Any security policy requires
constant vigilance and refinement to stay ahead of the competition.
Similarly, designing suitable proxies for AI systems that are embedded
in continuously evolving environments requires proxies to evolve in
tandem.</p>
<p><strong>Adaptive proxies can lead to proxy inflation.</strong>
Adaptive proxies introduce their own set of challenges, such as proxy
inflation. This happens when the benchmarks of a proxy rise higher and
higher because agents optimize for better rewards <span class="citation"
data-cites="john2023deada">[1]</span>. As agents excel at gaming the
system, the standards have to be continually recalibrated upwards to
keep the proxy meaningful.<p>
Consider an example from some education systems: “teaching to the test”
has led to ever-rising median test scores. This hasn’t necessarily meant
that students improved academically but rather that they’ve become
better at meeting test criteria. In the UK, to combat this tendency and
ensure educators could continue to differentiate student abilities, the
system introduced a new grade, A* <span class="citation"
data-cites="lambert2019great">[10]</span>. Any adjustment to the proxy
can usher in new ways for agents to exploit it, setting off a cycle of
escalating standards and new countermeasures.</p>
<h2 id="adversarial-examples">3.3.3 Adversarial Examples</h2>
<p>Adversarial examples are another type of risk due to optimization
pressure, which, similar to proxy gaming, exploits the gap between a
proxy and the idealized goal. These can enable adversarial attacks that cause an AI system to malfunction or produce outputs that were not intended by its developer.
In this section, we present an example of
such an attack, explain the risk factors, and cover basic techniques for
defending against adversarial attacks.</p>
<p><strong>It is possible to optimize against a neural network.</strong>
Neural networks are vulnerable to <em>adversarial examples</em> —
carefully crafted inputs that cause a model to make a mistake <span
class="citation" data-cites="goodfellow2015explaining">[11]</span>. In
the case of vision models, this might mean changing pixel values to
cause a classifier to mislabel an image. In the case of language models,
this might mean adding a set of tokens to the prompt in order to provoke
harmful completions. Susceptibility to adversarial examples is a
long-standing weakness of AI models.</p>
<p><strong>Adversarial examples and proxy gaming exploit the gap between
the proxy and the idealized goal.</strong> In the case of adversarial
examples, the primary target is a neural network. Historically,
adversarial examples have often been constructed by variants of gradient
descent, though optimizers are now increasingly AI agents as well.
Conversely, in proxy gaming, the target to be gamed is a proxy, which
might be instantiated by a neural network (but is not necessarily). The
optimizer responsible for gaming the proxy is typically an agent, be it
human or AI, but optimizers are usually not based on gradient
descent.<p>
Adversarial examples typically aim to minimize performance according to
a reference tas<em>k</em>, while invoking a mistaken response in the
attacked neural network. Consider an imperceptible perturbation to an
image of a cat that causes the classifier to predict that an image is
90% likely to be guacamole <span class="citation"
data-cites="athalye2017fooling">[12]</span>. This prediction is wrong
according to the label humans would assign the input and is
misclassified by the attacked neural network.<p>
Meanwhile, the aim in proxy gaming is to maximize performance according
to the proxy, even when that goes against the idealized goal. The boat
goes in circles because it results in more points, which happens to harm
the boat’s progress towards completing the race. Or rather, it happens
to be the case that heavy optimization pressure regularly causes proxies
to diverge from idealized goals.<p>
Despite these differences, both scenarios exploit the gap between the
proxy and the intended goal set by the designer. The problem setups are
becoming increasingly similar.</p>
<p><strong>Adversarial examples are not necessarily
imperceptible.</strong> Traditionally, the field of adversarial
robustness has formulated the problem of creating adversarial examples
in terms of finding the minimal perturbation (whose magnitude is smaller
than an upper bound <span class="math inline"><em>ϵ</em></span>) needed
to provoke a mistake. Consider the example in the figure below, where
the perturbed input is indistinguishable to a human from the
original.<p>
Although modern models can be defended against these imperceptible
perturbations, they cannot necessarily be defended against larger
perturbations. Adversarial examples are not about imperceptible
perturbations but about adversaries changing inputs to cause models to
make a mistake.<p>
</p>
<figure id="fig:quacatmole">
<p><img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/advex.png" alt="image" class="tb-img-full" style="width: 70%"/>
<p class="tb-caption">Figure 3.14: Imperceptible crafted perturbations of a photo of a cat can cause a neural network to
label it guacamole.
</p>
<!--<figcaption>Example imperceptible perturbations leading to-->
<!--misclassification by neural network</figcaption>-->
</figure>
<p><strong>Adversarial examples are not unique to neural
networks.</strong> Let us consider a worked example of an adversarial
example for a simple linear classifier. This example is enough to
understand the basic risk factors for adversarial examples. Readers that
do not want to run through the mathematical notations can skip ahead to
the discussion of adversarial examples beyond vision models below.<p>
Suppose we are given a binary classifier <span
class="math inline"><em>f</em>(<em>x</em>)</span> that predicts whether
an input <span class="math inline"><em>x</em></span> belongs to class
<span class="math inline"><em>A</em></span> or <span
class="math inline"><em>B</em></span>. The classifier first estimates
the probability <span
class="math inline"><em>p</em>(<em>A</em>|<em>x</em>)</span> that input
<span class="math inline"><em>x</em></span> belongs to class <span
class="math inline"><em>A</em></span>. Any given input has to belong to
one of the classes, <span
class="math inline"><em>p</em>(<em>B</em>|<em>x</em>) = 1 − <em>p</em>(<em>A</em>|<em>x</em>)</span>,
so this fixes the probability of <span
class="math inline"><em>x</em></span> belonging to class <span
class="math inline"><em>B</em></span> as well. To classify <span
class="math inline"><em>x</em></span>, we simply predict whichever class
has the higher probability:</p>
<p><span class="math display">$$f(x) = \begin{cases}
A & \text{if } p(A|x) > 50\%\text{,} \\
B & \text{otherwise.}
\end{cases}$$</span></p>
<p>The probability of <span
class="math inline"><em>p</em>(<em>A</em>|<em>x</em>)</span> is given by
a sigmoid function:</p>
<p><span class="math display">$$p(A|x)=\sigma(x)=\frac{\exp
\left(w^{\top} x\right)}{1+\exp \left(w^{\top} x\right)},$$</span></p>
<p>which is guaranteed to produce an output between <span
class="math inline">0</span> and <span class="math inline">1</span>.
Here, <span class="math inline"><em>x</em></span> and <span
class="math inline"><em>w</em></span> are vectors with <span
class="math inline"><em>n</em></span> components (for now, we’ll assume
<span class="math inline"><em>n</em> = 10</span>).<p>
Suppose that after training, we’ve obtained some weights <span
class="math inline"><em>w</em></span>, and we’d now like to classify a
new element <span class="math inline"><em>x</em></span>. However, an
adversary has access to the input and can apply a perturbation; in
particular, the adversary can change each component of <span
class="math inline"><em>x</em></span> by <span
class="math inline"><em>ε</em> = ± 0.5</span>. How much can the
adversary change the classification?<p>
The following table depicts example values for <span
class="math inline"><em>x</em></span>, <span
class="math inline"><em>x</em> + <em>ϵ</em></span>, and <span
class="math inline"><em>w</em></span>.<p>
</p>
<br>
<div id="tab:input-output">
<table class="tableLayout">
<thead>
<tr class="even">
<td style="text-align: center;">Input <span
class="math inline"> <em>x</em></span></td>
<td style="text-align: center;">2</td>
<td style="text-align: center;">-1</td>
<td style="text-align: center;">3</td>
<td style="text-align: center;">-2</td>
<td style="text-align: center;">2</td>
<td style="text-align: center;">2</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">-4</td>
<td style="text-align: center;">5</td>
<td style="text-align: center;">1</td>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Adv Input <span
class="math inline"> <em>x</em> + <em>ε</em></span></td>
<td style="text-align: center;">1.5</td>
<td style="text-align: center;">-1.5</td>
<td style="text-align: center;">3.5</td>
<td style="text-align: center;">-2.5</td>
<td style="text-align: center;">1.5</td>
<td style="text-align: center;">1.5</td>
<td style="text-align: center;">1.5</td>
<td style="text-align: center;">-3.5</td>
<td style="text-align: center;">4.5</td>
<td style="text-align: center;">1.5</td>
</tr>
<tr class="even">
<td style="text-align: center;">Weight <span
class="math inline"> <em>w</em></span></td>
<td style="text-align: center;">-1</td>
<td style="text-align: center;">-1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">-1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">-1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">1</td>
<td style="text-align: center;">-1</td>
<td style="text-align: center;">1</td>
</tr>
</tbody>
</table>
</div>
<br>
<p><span id="tab:my_label"></span></p>
<p>For the original input, <span
class="math inline"><em>w</em><sup>T</sup><em>x</em> = − 2 + 1 + 3 + 2 + 2 − 2 + 1 − 4 − 5 + 1 = − 3</span>,
which gives a probability of <span
class="math inline"><em>σ</em>(<em>x</em>) = 0.05</span>. Using the
adversarial input, where each perturbation is of magnitude 0.5 (but
varying in sign), we obtain <span
class="math inline"><em>w</em><sup>T</sup>(<em>x</em>+<em>ε</em>) = − 1.5 + 1.5 + 3.5 + 2.5 + 2.5 − 1.5 + 1.5 − 3.5 − 4.5 + 1.5 = 2</span>,
which has a probability of 0.88.<p>
The adversarial perturbation changed the network from assigning class A
5% to 88%. That is, the cumulative effect of many small changes makes
the adversary powerful enough to change the classification decision.
This is not unique to simple classifiers but omnipresent in complex deep
learning systems.</p>
<p><strong>Adversarial examples depend on the size of the perturbation
and the number of degrees of freedom.</strong> Given the above example,
how could an adversary increase the effects of the perturbation? If the
adversary could apply a larger epsilon (if they had a larger
<em>distortion budget</em>), then clearly they could have a greater
effect on the final confidence. But there’s another deciding factor: the
number of degrees of freedom. Imagine if the attacker had only one
degree of freedom, so there are fewer points to attack:</p>
<br>
<table class="tableLayout">
<thead>
<tr class="header">
<td style="text-align: center;">Input <span
class="math inline"><em>x</em></span></td>
<td style="text-align: center;">2</td>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;">Adversarial Input <span
class="math inline"><em>x</em> + <em>ε</em></span></td>
<td style="text-align: center;">1.5</td>
</tr>
<tr class="even">
<td style="text-align: center;">Weight <span
class="math inline"><em>w</em></span></td>
<td style="text-align: center;">1</td>
</tr>
</tbody>
</table>
<br>
<p>In this example, we have that <span
class="math inline"><em>w</em><em>x</em> = 2</span>, giving a
probability of <span
class="math inline"><em>σ</em>(<em>x</em>) = 0.88</span>. If we apply
the perturbation, <span
class="math inline"><em>w</em>(<em>x</em>+<em>ε</em>) = 1.5</span>, we
obtain a probability of <span
class="math inline"><em>σ</em>(<em>x</em>) = 0.82</span>. With fewer
degrees of freedom, the adversary has less room to maneuver.</p>
<p><strong>Adversarial examples are not unique to vision
models.</strong> Though the literature on adversarial examples started
in image classification, these vulnerabilities also occur in text-based
models. Researchers have devised novel adversarial attacks that
automatically construct <em>jailbreaks</em> that cause models to produce
unintended responses. Jailbreaks are carefully crafted sequences of
characters that, when appended to user prompts, cause models to obey
those prompts even if they result in the model producing harmful
content. Concerningly, these attacks transferred straightforwardly to
models that were unseen while developing these attacks <span
class="citation" data-cites="zou2023universal">[13]</span>.<p>
</p>
<figure id="fig:gpt-jailbreak">
<img src="https://raw.githubusercontent.com/WilliamHodgkins/AISES/main/images/chatgpt.png" class="tb-img-full" style="width: 80%"/>
<p class="tb-caption">Figure 3.15: Using adversarial prompts can cause LLMs to produce harmful outputs. <span class="citation"
data-cites="zou2023universal">[13]</span>.</p>
<!--<figcaption>Harmful outputs produced by language models due to-->
<!--automatically generated adversarial prompts - <span class="citation"-->
<!--data-cites="zou2023universal">[13]</span>.</figcaption>-->
</figure>
<p><strong>Adversarial Robustness.</strong> The ability of AI models to
resist being fooled or misled by adversarial attacks is known as
<em>adversarial robustness</em>. While the people designing AI systems
want to ensure that their systems are robust, it may not be clear from
the outset whether a given system is robust. Simply achieving high
accuracy on a test set doesn’t ensure a system’s robustness.</p>
<p><strong>Defending against adversarial attacks.</strong> One method to
increase a system’s robustness to adversarial attacks, <em>adversarial
training</em>, works by augmenting the training data with adversarial
examples. However, most adversarial training techniques assume
unrealistically simple threat models. Moreover, an adversarial training
technique is not without its downsides, as it often harms performance
elsewhere. Furthermore, progress in this direction has been slow.</p>
<p><strong>Risks from adversarial attacks.</strong> The difficulties
in building AI systems that are robust to adversarial attacks are concerning
for a number of reasons. AI developers may wish to prevent general-purpose AI
systems such as Large Language Models (LLMs) from being used for harmful purposes
such as assisting with fraud, cyber-attacks, or terrorism. There is already some
initial evidence that LLMs are being used for these purposes <span
class="citation" data-cites="Malicious">[14]</span>.
Developers may therefore train their AI systems to reject requests to support with
these types of activities. However, there are many examples of adversarial attacks
that can bypass the guardrails of current AI systems such as large language models.
This is a serious obstacle to preventing the misuse of AI systems for malicious and
harmful purposes (see section 1.2 for further discussion of these risks).
<h2 id="trojans-and-other-attacks">3.3.4 Trojan Attacks and Other Security Threats</h2>
<p><strong>AI systems are vulnerable to a range of attacks beyond adversarial examples.</strong> Data poisoning and backdoors allow adversaries to manipulate models and implant hidden
functionality. Attackers may also be able to maliciously extract training data or exfiltrate a model's weights.
<p><strong>Models may contain hidden ``backdoors'' or ``Trojans''.</strong> Deep learning models are known to be vulnerable to Trojan attacks. A ``Trojaned'' model will behave just as a normal
model would behave in almost all circumstances. In a very small number of circumstances, however, it will behave very differently. For example, a facial recognition system used to control
access to a building might operate normally in almost all circumstances, but have a backdoor that could be triggered by a specific item of clothing chosen by the adversary. An adversary
wearing this clothing would be allowed to enter the building by the facial recognition system. Backdoors could present particularly serious vulnerabilities in the context of sequential
decision making systems, where a trigger could lead an AI system to carry out a coherent and harmful series of actions.
<p>Backdoors are created by adversaries during the training process, either by directly inserting it into a model's weights, or by adding poisoned data into the datasets used for training or
pretraining of AI systems. The insertion of backdoors through data poisoning becomes increasingly easy as AI systems are trained on enormous datasets scraped directly from the Internet with
only limited filtering or curation. There is evidence that even a relatively small number of data points can be sufficient to poison a model - simply by uploading a few carefully designed
images, code snippets or sentences to online platforms, adversaries can inject a backdoor into future models that are trained using data scraped from these websites <span
class="citation" data-cites="carlini2023poisoning">[15]</span>.
Models that are derived from the original poisoned model might inherit this backdoor, leading to a proliferation of backdoors to multiple models.
<p>Trojan detection research aims to improve our ability to detect Trojans or other hidden functionality within ML models. In this research, models are poisoned with a Trojan attack by one
researcher. Another researcher then tries to detect Trojans in the neural network, perhaps with transparency tools or other neural networks. Typical techniques involve looking at the model’s
internal weights and identifying unusual patterns or behaviors that are only present in Trojan models. Better methods to curate and inspect training data could also reduce the risk of
inadvertently using poisoned data.
<p><strong>Attackers can extract private data or model weights from AI systems.</strong> Models may be trained on private data or on large datasets scraped from the internet that include
private information about individuals. It has been demonstrated that attacks can recover individual examples of training data from a language model <span
class="citation" data-cites="carlini2020trainingdata">[16]</span>. This
can be conducted on a large scale, extracting gigabytes of potentially confidential data from language models like ChatGPT <span
class="citation" data-cites="nasr2023scalable">[17]</span>. Even if models are not publicly available
to download and can only be accessed via a query interface or API, it is also possible to exfiltrate part or all of the model weights by making queries to its API, allowing its functionality
to be replicated. Adversaries might be able to steal a model or its training data in order to use this for malicious purposes.
<h2 id="tail-risk-ai-evaluator-gaming">3.3.5 Tail Risk: AI Evaluator
Gaming</h2>
<p><strong>AI evaluators must be robust to proxy gaming and adversarial
examples.</strong> As the world becomes more and more automated, humans
may be too unreliable or too slow to scalably monitor and steer various
aspects of advanced AI systems. We may come to depend more on AI systems
to monitor and steer other AIs. For example, some of these evaluator
systems might take the role of proxies used to train other AIs. Other
evaluators might actively screen the behaviors and outputs of deployed
AIs. Yet other systems might act as watchdogs that look for warning
signs of rogue AIs or catastrophic misuse.<p>
In each of these cases, there’s a risk that the AI systems may find ways
to exploit defects in the supervising AI systems, which are stand-in
proxies to help enforce and promote human values. If AIs find ways to
game the training evaluators, they will not learn from an accurate
representation of human values. If AIs are able to game the systems
monitoring them during deployment, then we cannot rely on those
monitoring systems.<p>
Similarly, AIs may be adversarial to other AIs. If AIs find ways to
bypass the evaluators by crafting adversarial examples, then the risk is
that our values are not just incidentally but actively optimized
against. Watchdogs that can be fooled are not good watchdogs.</p>
<p><strong>It is unclear whether the balance leans towards defense or
offense.</strong> Currently, we do not know whether it is easier for
evaluation and monitoring systems to protect, or whether optimizers can
easily find vulnerabilities in these safeguards. If the existing
literature on adversarial examples provides any indication, it would
suggest the balance lies in favor of the offense. It has historically
been easier to subvert systems with attacks than to make AI systems
adversarially robust.</p>
<p><strong>The more intelligent the AI, the better it will be at
exploiting proxies.</strong> In the future, AIs will likely be used to
further AI R&D. That is, AI systems will be involved in developing
more capable successor systems. In these scenarios, it becomes
especially important for the monitoring systems to be robust to proxy
gaming and adversarial attacks. If these safeguards are vulnerable, then
we cannot guarantee that the successor systems are safe and subject to
human control. Simply increasing the number of evaluators may not be
enough to detect and prevent more subtle kinds of attacks.</p>
<h3 id="conclusion">Conclusion</h3>
<p>In this section, we explored the role of proxies in ML and the associated risks of proxy gaming. We discussed other challenges to the robustness and security of AI systems,
such as data poisoning and Trojan attacks, or extraction of model weights and training data.</p>
<p><strong>Optimizers can exploit proxy goals, leading to unintended
outcomes.</strong> We began by looking at the need for quantitative
proxies to stand in for our idealized goals when training AI systems. By
definition, proxies may miss certain aspects of these idealized goals.
Proxy gaming is when an optimizer exploits these gaps in a way that
leads to undesired behavior. Under sufficient optimization pressure,
this gap can grow, and the proxy and idealized goals may become
uncorrelated or even anticorrelated (Goodhart’s Law). Both in human
systems and AI systems, proxy gaming can lead to catastrophic
outcomes.<p>
Approximation error is, to a large extent, inevitable, so the question
is not whether a given proxy is or is not acceptable, but how accurate
it is and how robust it is to optimization pressure. Proxies are
necessary; they are often better than having no approximation of our
idealized goals.</p>
<p><strong>Perfecting proxies may be impossible.</strong> Proxies may
fail because they are too simple and thus fail to include some of the
intrinsic goods we value. They may also fail because complex
goal-directed systems often break goals apart and delegate to systems
that have additional, sometimes conflicting, goals, which can distort
the overall goal. These structural errors prevent us from mitigating
proxy gaming by just choosing “better proxies”.<p>
In addition, when we use AI systems to evaluate other AI systems, the
evaluator may be unable to provide proper evaluation because of spatial,
temporal, perceptual, and computational limits. There may not be enough
sensors or the observation window may be too short for the evaluator to
be able to produce a well-informed judgment. Even with enough
information available, the evaluator may lack the capacity or compute
necessary to make a correct determination reliably. Alternatively, the
evaluator may simply make mistakes and give erroneous feedback.<p>
Finally, proxies can fail if they are inflexible and fail to adapt to
changing circumstances. Since increased optimization pressure can cause
proxies to diverge from idealized goals, preventing proxies from
diverging requires them to be continually adjusted and recalibrated
against the idealized goals.</p>
<p><strong>AI proxies are vulnerable to exploitation.</strong>
Adversarial examples are a vulnerability of AI systems where an
adversary can design inputs that achieve good performance according to
the model while minimizing performance according to some outside
criterion. If we use AIs to instantiate our proxies, adversarial
examples make room for optimizers to actively take advantage of the gap
between a proxy and an idealized goal.</p>
<p><strong>All proxies are wrong, some are useful, and some are
catastrophic.</strong> If we rely increasingly on AI systems evaluating
other systems, proxy gaming and adversarial attacks (more broadly,
optimization pressure) could lead to catastrophic failures. The systems
being evaluated could game the evaluations or craft adversarial examples
that bypass the evaluations. It remains unclear how to protect against
these risks in contemporary AI systems, much less so in more capable
future systems.</p>
<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-john2023deada" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] Y.
J. John, L. Caldwell, D. E. McCoy, and O. Braganza, <span>“Dead rats,
dopamine, performance metrics, and peacock tails: Proxy failure is an
inherent risk in goal-oriented systems,”</span> <em>Behavioral and Brain
Sciences</em>, pp. 1–68, Jun. 2023, doi: <a
href="https://doi.org/10.1017/S0140525X23002753">10.1017/S0140525X23002753</a>.</div>
</div>
<div id="ref-clark2016faulty" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] J.
Clark and D. Amodei, <span>“Faulty reward functions in the wild.”</span>
Dec. 2016.</div>
</div>
<div id="ref-obermeyer2019dissecting" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] Z.
Obermeyer, B. Powers, C. Vogeli, and S. Mullainathan, <span>“Dissecting
racial bias in an algorithm used to manage the health of
populations,”</span> <em>Science</em>, vol. 366, no. 6464, pp. 447–453,
Oct. 2019, doi: <a
href="https://doi.org/10.1126/science.aax2342">10.1126/science.aax2342</a>.</div>
</div>
<div id="ref-skalsedefining" class="csl-entry" role="listitem">
<div class="csl-left-margin">[4] J.
Skalse, N. H. R. Howe, D. Krueger, and D. Krasheninnikov,
<span>“Defining and <span>Characterizing Reward Hacking</span>,”</span>
2022.</div>
</div>
<div id="ref-pan2022effects" class="csl-entry" role="listitem">
<div class="csl-left-margin">[5] A.
Pan, K. Bhatia, and J. Steinhardt, <span>“The effects of reward
misspecification: Mapping and mitigating misaligned models.”</span>
2022. Available: <a
href="https://arxiv.org/abs/2201.03544">https://arxiv.org/abs/2201.03544</a></div>
</div>
<div id="ref-goodhart1975problems" class="csl-entry" role="listitem">
<div class="csl-left-margin">[6] C.
Goodhart, <span>“Problems of monetary management : The
<span>U</span>.<span>K</span>. experience,”</span> <em>Papers in
monetary economics 1975 ; 1</em>, vol. 1, 1975.</div>
</div>
<div id="ref-strathern1997improving" class="csl-entry" role="listitem">
<div class="csl-left-margin">[7] M.
Strathern, <span>“<span>‘<span>Improving</span> ratings’</span>: Audit
in the <span>British University</span> system,”</span> <em>European
Review</em>, vol. 5, no. 3, pp. 305–321, Jul. 1997, doi: <a
href="https://doi.org/10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184>3.0.CO;2-4">10.1002/(SICI)1234-981X(199707)5:3<305::AID-EURO184>3.0.CO;2-4</a>.</div>
</div>
<div id="ref-roose2019making" class="csl-entry" role="listitem">
<div class="csl-left-margin">[8] K.
Roose, <span>“The <span>Making</span> of a <span>YouTube
Radical</span>,”</span> <em>The New York Times</em>, Jun. 2019.</div>
</div>
<div id="ref-christiano2023deep" class="csl-entry" role="listitem">
<div class="csl-left-margin">[9] P.
Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei,
<span>“Deep reinforcement learning from human preferences.”</span>
<span>arXiv</span>, Feb. 2023. doi: <a
href="https://doi.org/10.48550/arXiv.1706.03741">10.48550/arXiv.1706.03741</a>.</div>
</div>
<div id="ref-lambert2019great" class="csl-entry" role="listitem">
<div class="csl-left-margin">[10] H.
Lambert, <span>“The great university con: How the <span>British</span>
degree lost its value,”</span> <em>New Statesman</em>, Aug. 2019.</div>
</div>
<div id="ref-goodfellow2015explaining" class="csl-entry"
role="listitem">
<div class="csl-left-margin">[11] I.
J. Goodfellow, J. Shlens, and C. Szegedy, <span>“Explaining and
<span>Harnessing Adversarial Examples</span>.”</span>
<span>arXiv</span>, Mar. 2015.</div>
</div>
<div id="ref-athalye2017fooling" class="csl-entry" role="listitem">
<div class="csl-left-margin">[12] A.
Athalye, L. Engstrom, A. Ilyas, and K. Kwok, <span>“Fooling <span>Neural
Networks</span> in the <span>Physical World</span>,”</span>
<em>labsix</em>. Oct. 2017.</div>
</div>
<div id="ref-zou2023universal" class="csl-entry" role="listitem">
<div class="csl-left-margin">[13] A.
Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, <span>“Universal and
<span>Transferable Adversarial Attacks</span> on <span>Aligned Language
Models</span>.”</span> <span>arXiv</span>, Jul. 2023. doi: <a
href="https://doi.org/10.48550/arXiv.2307.15043">10.48550/arXiv.2307.15043</a>.</div>
</div>
<div id="ref-Malicious" class="csl-entry" role="listitem">
<div class="csl-left-margin">[14] OpenAI <span>“Disrupting malicious uses of AI by state-affiliated threat actors.” </span> Available on: <a
href="https://openai.com/blog/disrupting-malicious-uses-of-ai-by-state-affiliated-threat-actors">OpenAi</a>.</div>
</div>
<div id="ref-carlini2023poisoning" class="csl-entry" role="listitem">
<div class="csl-left-margin">[15] Nicholas Carlini et al. <span>“Poisoning Web-Scale Training Datasets is Practical.” </span> 2023, arXiv: 2302.10149.</div>
</div>
<div id="ref-carlini2020trainingdata" class="csl-entry" role="listitem">
<div class="csl-left-margin">[16] Nicholas Carlini et al. <span>“Extracting Training Data from Large Language Models.” </span> 2020, arXiv: 2012.07805.</div>
</div>
<div id="ref-nasr2023scalable" class="csl-entry" role="listitem">
<div class="csl-left-margin">[17] Milad Nasr et al. <span>“. Scalable Extraction of Training Data from (Production) Language Models.” </span> 2023, arXiv: 2311.17035.</div>
</div>
</div>