-
Notifications
You must be signed in to change notification settings - Fork 1
/
aises_4_4
332 lines (331 loc) · 20.5 KB
/
aises_4_4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
<h1 id="safe-design-principles">4.4 Safe Design Principles</h1>
<p>We can reduce both the probability and severity of a system failure
by following certain <em>safe design principles</em> when designing it.
These general principles have been identified by safety engineering and
offer practical ways of reducing the risk associated with all kinds of
systems. They should be incorporated from the outset, rather than being
retrofitted later. This strategy attempts to “build safety into” a
system, and is more robust than building the system without safety
considerations and then attempting to fix individual problems if and
when they become apparent.<p>
Note that these principles are not only useful in building an AI itself,
but also the system around it. For example, we can incorporate them into
the design of the cyber-security system that controls who is able to
access an AI, and into the operations of the organization, or system of
humans, that is creating an AI.<p>
We will now explore eight of these principles and how they might be
applied to AI systems:</p>
<ol>
<li><p><strong>Redundancy</strong>: having multiple backup components
that can perform each critical function, so that a single component
failing is not enough to cause an accident.</p></li>
<li><p><strong>Transparency</strong>: ensuring that operators have
enough knowledge of how a system functions under various circumstances
to interact with it safely.</p></li>
<li><p><strong>Separation of duties</strong>: having multiple agents in
charge of subprocesses so that no individual can misuse the entire
system alone.</p></li>
<li><p><strong>Principle of least privilege</strong>: giving each agent
the minimum access necessary to complete their tasks.</p></li>
<li><p><strong>Fail-safes</strong>: ensuring that the system will be
safe even if it fails.</p></li>
<li><p><strong>Antifragility</strong>: learning from previous failures
to reduce the likelihood of failing again in future.</p></li>
<li><p><strong>Negative feedback mechanisms</strong>: building in
processes that naturally down-regulate operations in the event that
operators lose control of the system.</p></li>
<li><p><strong>Defense in depth</strong>: employing multiple safe design
principles rather than relying on just one, since any safety feature
will have weaknesses.</p></li>
</ol>
<p>Note that, depending on the exact type of system, some of these safe
design principles might be less useful or even counterproductive. We
will discuss this later on in the chapter. However, for now, we will
explore the basic rationale behind why each one improves safety.</p>
<h2 id="redundancy">4.4.1 Redundancy</h2>
<p><strong>Redundancy means having multiple “backup” components <span
class="citation" data-cites="Hendrycks2022xrisk">[1]</span>.</strong>
Having multiple braking systems in a vehicle means that, even if the
foot brake is not working well, the handbrake should still be able to
decelerate the vehicle in an emergency. A failure of a single brake
should therefore not be enough to cause an accident. This is an example
of <em>redundancy</em>, where multiple components can perform a critical
function, so that a single component failing is not enough to cause the
whole system to fail. In other words, redundancy removes single points
of failure. Other examples of redundancy include saving important
documents on multiple hard drives, in case one of them stops working,
and seeking multiple doctors’ opinions, in case one of them gets a
diagnosis wrong.<p>
A possible use of redundancy in AI would be having an inbuilt “moral
parliament” (see 6.9 Moral Uncertainty). If an AI agent has to make decisions
with moral implications, we are faced with the question of which theory
of morality it should follow; there are many of these, and each often
has counterintuitive recommendations in extreme cases. Therefore, we
might not want an AI to adhere strictly to just one theory. Instead, we
could use a moral parliament, in which we emulate representatives of
stakeholders or moral theories, let them negotiate and vote, and then do
what the parliament recommends. The different theories would essentially
be redundant components, each usually recommending plausible actions but
unable to dictate what happens in extreme cases, reducing the likelihood
of counterintuitive decisions that we would consider harmful.</p>
<h2 id="separation-of-duties">4.4.2 Separation of Duties</h2>
<p><strong>Separation of duties means no single agent can control or
misuse the system alone <span class="citation"
data-cites="Hendrycks2022xrisk">[1]</span>.</strong> Consider a system
where one person controls all the different components and processes. If
that person decides to pursue a negative outcome, they will be able to
leverage the whole system to do so. On the other hand, we could separate
duties by having multiple operators, each in charge of a different
aspect. In this case, if one individual decides to pursue a negative
outcome, their capacity to do harm will be smaller.</p>
<p>For example, imagine a lab that handles two chemicals that, if mixed
in high enough quantities, could cause a large explosion. To avoid this
happening, we could keep the stock of the two chemicals in separate
cupboards, and have a different person in charge of supplying each one
in small quantities to researchers. This way, no individual has access
to a large amount of both chemicals.</p>
<p><strong>We could focus on multiple narrow AI models, instead of a
single general one.</strong> In designing AI systems, we could follow
this principle by having multiple agents, each of which is highly
specialized for a different task. Complex processes can then be carried
out collectively by these agents working together, rather than having a
single agent conducting the whole process alone.<p>
This is exemplified by an approach to AI development called
“comprehensive AI services.” This views AI agents as a class of
service-providing products. Adopting this mindset might mean tailoring
AI systems to perform highly specific tasks, rather than speculatively
trying to improve general and flexible capabilities in a single
agent.</p>
<h2 id="principle-of-least-privilege">4.4.3 Principle of Least Privilege</h2>
<p><strong>Each agent should have only the minimum power needed to
complete their tasks <span class="citation"
data-cites="Hendrycks2022xrisk">[1]</span>.</strong> As discussed above,
separating duties should reduce individuals’ capacity to misuse the
system. However, separation of duties might only work if we also ensure
individuals do not have access to parts of the system that are not
relevant to their tasks. This is called the principle of least
privilege. In the example above, we ensured separation of duties by
putting chemicals in different cupboards with different people in charge
of them. To make this more likely to mitigate risks, we might want to
ensure that these cupboards are locked so that everyone else cannot
access them at all.<p>
Similarly, for systems involving AIs, we should ensure that each agent
only has access to the necessary information and power to complete its
tasks with a high level of reliability. Concretely, we might want to
avoid plugging AIs into the internet or giving them high-level admin
access to confidential information. In the Single Agent Control chapter,
we considered how AIs might be power-seeking; by ensuring AIs have only
the minimum required amount of power they need to accomplish the goals
we assign them, we can reduce their ability to gain power.</p>
<h2 id="fail-safes">4.4.4 Fail-Safes</h2>
<p><strong>Fail-safes are features that aim to ensure a system will be
safe even if it fails <span class="citation"
data-cites="Hendrycks2022xrisk">[1]</span>.</strong> When systems fail,
they stop performing their intended function, but some failures also
cause harm. Fail-safes aim to limit the amount of harm caused even if
something goes wrong. Elevator brakes are a classic example of a
fail-safe feature. They are attached to the outside of the cabin and are
held open only by the tension in the cables that the cabin is suspended
on. If tension is lost in the cables, the brakes automatically clamp
shut onto the rails in the elevator shaft. This means that, even if the
cables break, the brakes should prevent the cabin from falling; even if
the system fails in its function, it should at least be safe.<p>
A possible fail-safe for AI systems might be a component that tracks the
level of confidence an agent has in its own decisions. The system could
be designed to stop enacting decisions if this component falls below a
critical level of certainty that the decision is correct. There could
also be a component that monitors the probability of the agent’s
decisions causing harm, and the system could be designed to stop acting
on decisions if it reaches a specified likelihood of harm.</p>
<h2 id="antifragility">4.4.5 Antifragility</h2>
<p><strong>Antifragile systems become stronger from encountering
adversity <span class="citation"
data-cites="taleb2012antifragile">[2]</span>.</strong> The idea of an
antifragile system is that it will not only recover after a failure or a
near miss but actually become more robust from these “stressors” to
potential future failures. Antifragile systems are common in the natural
world and include the human body. For example, weight-bearing exercises
put a certain amount of stress on the body, but bone density and muscle
mass tend to increase in response, improving the body’s ability to lift
weight in the future.<p>
Similarly, after encountering or becoming infected with a pathogen and
fighting it off, a person’s immune system tends to become stronger,
reducing the likelihood of reinfection. Groups of people working
together can also be antifragile. If a team is working toward a given
goal and they experience a failure, they might examine the causes and
take steps to prevent it from happening again, leading to fewer failures
in the future.<p>
Designing AI systems to be antifragile would mean allowing them to
continue learning and adapting while they are being deployed. This could
give an AI the potential to learn when something in its environment has
caused it to make a bad decision. It could then avoid making the same
mistake if it finds itself in similar circumstances again.</p>
<p><strong>Antifragility can require adaptability.</strong> Creating
antifragile AIs often means creating adaptive ones: the ability to
change in response to new stressors is key to making AIs robust. If an
AI continues learning and adapting while being deployed, it could learn
to avoid hazards, but it could also develop unanticipated and
undesirable behaviors. Adaptive AIs might be harder to control. Such AIs
are likely to continuously evolve, creating new safety challenges as
they develop different behaviors and capabilities. This tendency of
adaptive systems to evolve in unexpected ways increases our exposure to
emergent hazards.<p>
A case in point is the chatbot Tay, which was released by Microsoft on
Twitter in 2016. Tay was designed to simulate human conversation and to
continue improving by learning from its interactions with humans on
Twitter. However, it quickly started tweeting offensive remarks,
including seemingly novel racist and sexist comments. This suggested
that Tay had statistically identified and internalized some biases that
it could then independently assert. As a result, the chatbot was taken
offline after only 16 hours. This illustrates how an adaptive,
antifragile AI can develop in unpredicted and undesirable ways when
deployed in natural settings. Human operators cannot control natural
environments, so system designers should think carefully about whether
to use adaptive AIs.</p>
<h2 id="negative-feedback-mechanisms">4.4.6 Negative Feedback Mechanisms</h2>
<p><strong>When one change triggers more changes, feedback loops can
emerge.</strong> To understand positive and negative feedback
mechanisms, consider the issue of climate change and melting ice. As
global temperatures increase, more of Earth’s ice melts. This means
ice-covered regions shrink, and therefore reflect a smaller amount of
the sun’s radiation back into space. More radiation is therefore
absorbed by the atmosphere, further increasing global temperatures and
causing even more ice to melt. This is a positive feedback loop: a
circular process that amplifies the initial change, causing the system
to continue escalating itself unchecked. We discuss feedback loops in
greater detail in the Complex Systems chapter.</p>
<p><strong>Negative feedback mechanisms act to down-regulate and
stabilize systems.</strong> If we have mechanisms in place that
naturally down-regulate the initial change, we are likely to enter an
equilibrium rather than explosive change. Many negative feedback
mechanisms are found within the body; for example, if a person’s body
temperature increases, they will begin to sweat, cooling down as that
sweat evaporates. If, on the other hand, they get cold, they will begin
to shiver, generating heat instead. These negative feedback mechanisms
act against any changes and thus stabilize the temperature within the
required range. Incorporating negative feedback mechanisms in a system’s
design can improve controllability, by preventing changes from
escalating too much <span class="citation"
data-cites="Marsden2017">[3]</span>.</p>
<p><strong>We can use negative feedback loops to control AIs.</strong>
If we are concerned that AIs might get too powerful for us to control,
we can create negative feedback loops in the environment to ensure that
any increases in an AI’s power are met with changes that make it less
powerful. There would be two parts to this process. First, we would want
better monitoring tools to look for anomalies, such as AI watch dogs.
These would track when an AI is getting powerful (or displaying
hazardous behavior) in order to trigger some feedback mechanism—the
second part of the task. The feedback mechanism might be a drastic
measure like disconnecting an AI from the internet, resetting an AI to a
previous version, or using other AIs trained to disempower a powerful
AI. Such mechanisms would act as automatic checks and balances on AIs’
power.</p>
<h2 id="transparency">4.4.7 Transparency</h2>
<p><strong>Transparency means people know enough about a system to
interact with it safely <span class="citation"
data-cites="Hendrycks2022xrisk">[1]</span>.</strong> If operators do not
have sufficient knowledge of a system’s functions, then it is possible
they could inadvertently cause an accident while interacting with it. It
is important that a pilot knows how a plane’s autopilot system works,
how to activate it, and how to override it. That way, they will be able
to override it when they need to, and they will know how to avoid
activating it or overriding it accidentally when they do not mean to.
This safe design principle is called transparency.<p>
Research into AI transparency aims to design deep learning systems in
ways that give operators a greater understanding of their internal
decision-making processes. This would help operators maintain control,
anticipate situations in which systems might make poor or deceptive
decisions, and steer them away from hazards.</p>
<h2 id="defense-in-depth">4.4.8 Defense in Depth</h2>
<p><strong>Including many layers of defense is usually more effective
than relying on just one <span class="citation"
data-cites="Hendrycks2022xrisk">[1]</span>.</strong> A final safe design
principle is defense in depth, which means including multiple layers of
defense. That way, if one or more defenses fail, there should still be
others in place that can mitigate damage. In general, the more defenses
a system has in place, the less likely it is that all the defenses will
fail simultaneously. The core idea of defense in depth is that it is
unlikely that any one layer of defense is foolproof; we are usually
engaging in risk reduction, not risk elimination. For example, an
individual is less likely to be infected by a virus if they take
multiple measures, such as wearing a mask, social distancing, and
washing their hands, than if they rely on just one of these (and no
single one is going to work). We will explore this in greater depth in
the context of the Swiss cheese model in the next section.<p>
One caveat to note here is that increasing layers of defense can make a
system more complex. If the different defenses interact with one
another, there is a chance that this might produce unintended side
effects. In the case of a virus, for example, reduced social contact and
an individual’s immune system can be thought of as two separate layers
of defense. However, if an individual has very little social contact,
and therefore little exposure to pathogens, their immune system could
become weaker as a result. While multiple layers of defense can make a
system more robust, system designers should be aware that layers might
interact in complex ways. We will discuss this further later in the
chapter.</p>
<p><strong>Layers of safety features can generally be preventative or
protective.</strong> There are two ways in which safety measures can
reduce risk: preventative measures reduce the probability of an accident
occurring, while protective measures reduce the harm if an accident does
occur. For example, avoiding large gatherings and washing hands are
actions that aim to prevent an individual from becoming infected with a
virus, while maintaining a healthy lifestyle and getting vaccinated can
help reduce the severity of an infection if it does occur.<p>
We can think about this in terms of the risk equations. Preventative
measures reduce the probability of an accident occurring, either by
reducing the inherent probability of the hazard or by reducing exposure
to it. Protective measures, meanwhile, decrease the severity an accident
would have, either by reducing the inherent severity of the hazard or by
making the system less vulnerable to it.</p>
<p><strong>In general, prevention is more efficient than cure, but both
should be included in system design.</strong> An oft-quoted aphorism is
that “an ounce of prevention is worth a pound of cure,” highlighting
that it is much better—–and often less costly–—to prevent an accident
from happening than to try and fix it afterward. It might therefore be
wise for system designers to place more emphasis on preventative
features than on protective ones.<p>
Nevertheless, protective features should not be neglected. This is
illustrated by the sinking of the Titanic, whose preventative design
features included the hull being divided into watertight compartments.
There was much faith that these features rendered it unsinkable.
However, the ship did not carry enough lifeboats to hold all its
passengers. This was, in part, because lifeboats were largely intended
to transport passengers to another ship in the event of sinking, rather
than to hold all of them at once. Still, the insufficient provision
meant that there was not enough space on the lifeboats for all the
passengers when the ship sank. This can be considered an example of
inadequate protective measures. We will explore this distinction more in
the next section, where we discuss the bow tie model.</p>
<h2 id="review-of-safe-design-principles">4.4.9 Review of Safe Design
Principles</h2>
<p>There are multiple features we can build into a system from the
design stage to make it safer. We have discussed redundancy, separation
of duties, the principle of least privilege, fail-safes, antifragility,
negative feedback mechanisms, transparency and defense in depth as eight
examples of such principles. Each one gives us concrete recommendations
about how to design (or how not to design) AI systems to ensure that
they are safer for humans to use.</p>
<br>
<br>
<h3>References</h3>
<div id="refs" class="references csl-bib-body" data-entry-spacing="0"
role="list">
<div id="ref-Hendrycks2022xrisk" class="csl-entry" role="listitem">
<div class="csl-left-margin">[1] D.
Hendrycks and M. Mazeika, <span>“X-risk analysis for AI
research.”</span> 2022. Available: <a
href="https://arxiv.org/abs/2206.05862">https://arxiv.org/abs/2206.05862</a></div>
</div>
<div id="ref-taleb2012antifragile" class="csl-entry" role="listitem">
<div class="csl-left-margin">[2] N.
N. Taleb, <em>Antifragile: Things that gain from disorder</em>. in
Incerto. Random House Publishing Group, 2012. Available: <a
href="https://books.google.com.au/books?id=5fqbz_qGi0AC">https://books.google.com.au/books?id=5fqbz_qGi0AC</a></div>
</div>
<div id="ref-Marsden2017" class="csl-entry" role="listitem">
<div class="csl-left-margin">[3] E.
Marsden, <span>“Designing for safety: Inherent safety, designed
in.”</span> Accessed: Jul. 31, 2017. [Online]. Available: <a
href="https://risk-engineering.org/safe-design/">https://risk-engineering.org/safe-design/</a></div>
</div>
</div>