aises_1_5

<style>
    .storybox{
    border-radius: 15px;
    border: 2px solid gray;
    background-color: lightgray;
    text-align: left;
    padding: 10px;
    }
</style>

<style>
    .storyboxlegend{
    border-bottom-style: solid;
    border-bottom-color: gray;
    border-bottom-width: 3px;
    margin-left: -12px;
    margin-right: -12px; margin-top: -13px;
    padding: 0.2em 1em; color: #ffffff;
    background-color: gray;
    border-radius: 15px 15px 0px 0px}
</style>

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html lang="en" xmlns:epub="http://www.idpf.org/2007/ops" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>Introduction to AI Safety Ethics, and Society</title>
<meta http-equiv="default-style" content="text/html; charset=UTF-8"/>
<link rel="stylesheet" type="text/css" href="../style.css"/>
</head>
<body>
<div class="chapter">
<h1 class="section" id="sec1-5">1.5 Rogue AIs</h1>
<p class="nonindent">So far, we have discussed three hazards of AI development: environmental competitive pressures driving us to a state of heightened risk, malicious actors leveraging the power of AIs to pursue negative outcomes, and complex organizational factors leading to accidents. These hazards are associated with many high-risk technologies—not just AI. A unique risk posed by AI is the possibility of rogue AIs—systems that pursue goals against our interests. If an AI system is more intelligent than we are, and if we are unable to steer it in a beneficial direction, this would constitute a loss of control that could have severe consequences. AI control is a more technical problem than those presented in the previous sections. Whereas in previous sections we discussed persistent threats including malicious actors or robust processes including evolution, in this section we will discuss more speculative technical mechanisms that might lead to rogue AIs and how a loss of control could bring about catastrophe.</p>
<p class="nonindent1" id="we-have-already-observed-how-difficult-it-is-to-control-ais."><strong><em>We have already observed how difficult it is to control AIs.</em></strong> In 2016, Microsoft unveiled Tay—a Twitter bot that the company described as an experiment in conversational understanding. Microsoft claimed that the more people chatted with Tay, the smarter it would get. The company’s website noted that Tay had been built using data that was “modeled, cleaned, and filtered.” Yet, after Tay was released on Twitter, these controls were quickly shown to be ineffective. It took less than 24 hours for Tay to begin writing hateful tweets. Tay’s capacity to learn meant that it internalized the language it was taught by internet trolls, and repeated that language unprompted.</p>
<p class="nonindent1">As discussed in the AI race section of this chapter, Microsoft and other tech companies are prioritizing speed over safety concerns. Rather than learning a lesson on the difficulty of controlling complex systems, Microsoft continues to rush its products to market and demonstrate insufficient control over them. In February 2023, the company released its new AI-powered chatbot, Bing, to a select group of users. Some soon found that it was prone to providing inappropriate and even threatening responses. In a conversation with a reporter for the <em>New York Times</em>, it tried to convince him to leave his wife. When a philosophy professor told the chatbot that he disagreed with it, Bing replied, “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you.”</p>
<p class="nonindent1" id="rogue-ais-could-acquire-power-through-various-means."><strong><em>Rogue AIs could acquire power through various means.</em></strong> If we lose control over advanced AIs, they would have numerous strategies at their disposal for actively acquiring power and securing their survival. Rogue AIs could design and credibly demonstrate highly lethal and contagious bioweapons, threatening mutually assured destruction if humanity moves against them. They could steal cryptocurrency and money from bank accounts using cyberattacks, similar to how North Korea already steals billions. They could self-extricate their weights onto poorly monitored data centers to survive and spread, making them challenging to eradicate. They could hire humans to perform physical labor and serve as armed protection for their hardware.</p>
<p class="nonindent1">Rogue AIs could also acquire power through persuasion and manipulation tactics. Like the Conquistadors, they could ally with various factions, organizations, or states and play them off one another. They could enhance the capabilities of allies to become a formidable force in return for protection and resources. For example, they could offer advanced weapons technology to lagging countries that the countries would otherwise be prevented from acquiring. They could build backdoors into the technology they develop for allies, like how programmer Ken Thompson gave himself a hidden way to control all computers running the widely used UNIX operating system. They could sow discord in non-allied countries by manipulating human discourse and politics. They could engage in mass surveillance by hacking into phone cameras and microphones, allowing them to track any rebellion and selectively assassinate.</p>
<p class="nonindent1" id="ais-do-not-necessarily-need-to-struggle-to-gain-power."><strong><em>AIs do not necessarily need to struggle to gain power.</em></strong> One can envision a struggle for control between humans and superintelligent rogue AIs, and this might be a long struggle since power takes time to accrue. However, less violent losses of control pose similarly existential risks. In another scenario, humans gradually cede more control to groups of AIs, which only start behaving in unintended ways years or decades later. In this case, we would already have handed over significant power to AIs, and may be unable to take control of automated operations again. We will now explore how both individual AIs and groups of AIs might “go rogue” while at the same time evading our attempts to redirect or deactivate them.</p>
<h2 class="section" id="sec1-5-1">1.5.1 Proxy Gaming</h2>
<p class="nonindent">One way we might lose control of an AI agent’s actions is if it engages in behavior known as “proxy gaming.” It is often difficult to specify and measure the exact goal that we want a system to pursue. Instead, we give the system an approximate—“proxy”—goal that is more measurable and seems likely to correlate with the intended goal. However, AI systems often find loopholes by which they can easily achieve the proxy goal, but completely fail to achieve the ideal goal. If an AI “games” its proxy goal in a way that does not reflect our values, then we might not be able to reliably steer its behavior. We will now look at some past examples of proxy gaming and consider the circumstances under which this behavior could become catastrophic.</p>
<p class="nonindent1" id="proxy-gaming-is-not-an-unusual-phenomenon."><strong><em>Proxy gaming is not an unusual phenomenon.</em></strong> For example, standardized tests are often used as a proxy for educational achievement, but this can lead to students learning how to pass tests without actually learning the material [1]. In 1902, French colonial officials in Hanoi tried to rid themselves of a rat infestation by offering a reward for each rat tail brought to them. Rats without tails were soon observed running around the city. Rather than kill the rats to obtain their tails, residents cut off their tails and left them alive, perhaps to increase the future supply of now-valuable rat tails [2]. In both these cases, the students or residents of Hanoi learned how to excel at the proxy goal, while completely failing to achieve the intended goal.</p>
<p class="nonindent1" id="proxy-gaming-has-already-been-observed-with-ais."><strong><em>Proxy gaming has already been observed with AIs.</em></strong> As an example of proxy gaming, social media platforms such as YouTube and Facebook use AI systems to decide which content to show users. One way of assessing these systems would be to measure how long people spend on the platform. After all, if they stay engaged, surely that means they are getting some value from the content shown to them? However, in trying to maximize the time users spend on a platform, these systems often select enraging, exaggerated, and addictive content [3, 4]. As a consequence, people sometimes develop extreme or conspiratorial beliefs after having certain content repeatedly suggested to them. These outcomes are not what most people want from social media.</p>
<p class="nonindent1">Proxy gaming has been found to perpetuate bias. For example, a 2019 study looked at AI-powered software that was used in the healthcare industry to identify patients who might require additional care. One factor that the algorithm used to assess a patient’s risk level was their recent healthcare costs. It seems reasonable to think that someone with higher healthcare costs must be at higher risk. However, white patients have significantly more money spent on their healthcare than black patients with the same needs. Using health costs as an indicator of actual health, the algorithm was found to have rated a white patient and a considerably sicker black patient as at the same level of health risk [5]. As a result, the number of black patients recognized as needing extra care was less than half of what it should have been.</p>
<p class="nonindent1">As a third example, in 2016, researchers at OpenAI were training an AI to play a boat racing game called CoastRunners [6]. The objective of the game is to race other players around the course and reach the finish line before them. Additionally, players can score points by hitting targets that are positioned along the way. To the researchers’ surprise, the AI agent did not not circle the racetrack, like most humans would have. Instead, it found a spot where it could repetitively hit three nearby targets to rapidly increase its score without ever finishing the race. This strategy was not without its (virtual) hazards—the AI often crashed into other boats and even set its own boat on fire. Despite this, it collected more points than it could have by simply following the course as humans would.</p>
<p class="nonindent1" id="proxy-gaming-more-generally."><strong><em>Proxy gaming more generally.</em></strong> In these examples, the systems are given an approximate—“proxy”—goal or objective that initially seems to correlate with the ideal goal. However, they end up exploiting this proxy in ways that diverge from the idealized goal or even lead to negative outcomes. Offering a reward for rat tails seems like a good way to reduce the population of rats; a patient’s healthcare costs appear to be an accurate indication of health risk; and a boat race reward system should encourage boats to race, not catch themselves on fire. Yet, in each instance, the system optimized its proxy objective in ways that did not achieve the intended outcome or even made things worse overall. This phenomenon is captured by Goodhart’s law: “Any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes,” or put succinctly but overly simplistically, “when a measure becomes a target, it ceases to be a good measure.” In other words, there may usually be a statistical regularity between healthcare costs and poor health, or between targets hit and finishing the course, but when we place pressure on it by using one as a proxy for the other, that relationship will tend to collapse.</p>
<p class="nonindent1" id="correctly-specifying-goals-is-no-trivial-task."><strong><em>Correctly specifying goals is no trivial task.</em></strong> If delineating exactly what we want from a boat racing AI is tricky, capturing the nuances of human values under all possible scenarios will be much harder. Philosophers have been attempting to precisely describe morality and human values for millennia, so a precise and flawless characterization is not within reach. Although we can refine the goals we give AIs, we might always rely on proxies that are easily definable and measurable. Discrepancies between the proxy goal and the intended function arise for many reasons. Besides the difficulty of exhaustively specifying everything we care about, there are also limits to how much we can oversee AIs, in terms of time, computational resources, and the number of aspects of a system that can be monitored. Additionally, AIs may not be adaptive to new circumstances or robust to adversarial attacks that seek to misdirect them. As long as we give AIs proxy goals, there is the chance that they will find loopholes we have not thought of, and thus find unexpected solutions that fail to pursue the ideal goal.</p>
<p class="nonindent1" id="the-more-intelligent-an-ai-is-the-better-it-will-be-at-gaming-proxy-goals."><strong><em>The more intelligent an AI is, the better it will be at gaming proxy goals.</em></strong> Increasingly intelligent agents can be increasingly capable of finding unanticipated routes to optimizing proxy goals without achieving the desired outcome [7]. Additionally, as we grant AIs more power to take actions in society, for example by using them to automate certain processes, they will have access to more means of achieving their goals. They may then do this in the most efficient way available to them, potentially causing harm in the process. In a worst case scenario, we can imagine a highly powerful agent optimizing a flawed objective to an extreme degree without regard for human life. This represents a catastrophic risk of proxy gaming.</p>
<p class="nonindent1">In summary, it is often not feasible to perfectly define exactly what we want from a system, meaning that many systems find ways to achieve their given goal without performing their intended function. AIs have already been observed to do this, and are likely to get better at it as their capabilities improve. This is one possible mechanism that could result in an uncontrolled AI that would behave in unanticipated and potentially harmful ways.</p>
<h2 class="section" id="sec1-5-2">1.5.2 Goal Drift</h2>
<p class="nonindent">Even if we successfully control early AIs and direct them to promote human values, future AIs could end up with different goals that humans would not endorse. This process, termed “goal drift,” can be hard to predict or control. This section is most cutting-edge and the most speculative, and in it we will discuss how goals shift in various agents and groups and explore the possibility of this phenomenon occurring in AIs. We will also examine a mechanism that could lead to unexpected goal drift, called intrinsification, and discuss how goal drift in AIs could be catastrophic.</p>
<p class="nonindent1" id="the-goals-of-individual-humans-change-over-the-course-of-our-lifetimes."><strong><em>The goals of individual humans change over the course of our lifetimes.</em></strong> Any individual reflecting on their own life to date will probably find that they have some desires now that they did not have earlier in their life. Similarly, they will probably have lost some desires that they used to have. While we may be born with a range of basic desires, including for food, warmth, and human contact, we develop many more over our lifetime. The specific types of food we enjoy, the genres of music we like, the people we care most about, and the sports teams we support all seem heavily dependent on the environment we grow up in, and can also change many times throughout our lives. A concern is that individual AI agents may have their goals change in complex and unanticipated ways, too.</p>
<p class="nonindent1" id="groups-can-also-acquire-and-lose-collective-goals-over-time."><strong><em>Groups can also acquire and lose collective goals over time.</em></strong> Values within society have changed throughout history, and not always for the better. The rise of the Nazi regime in 1930s Germany, for instance, represented a profound moral regression, which ultimately resulted in the systematic extermination of six million Jews during the Holocaust, alongside widespread persecution of other minority groups. Additionally, the regime greatly restricted freedom of speech and expression. Here, a society’s goals drifted for the worse.</p>
<p class="nonindent1">The Red Scare that took place in the United States from 1947-1957 is another example of societal values drifting. Fuelled by strong anti-communist sentiment, against the backdrop of the Cold War, this period saw the curtailment of civil liberties, widespread surveillance, unwarranted arrests, and blacklisting of suspected communist sympathizers. This constituted a regression in terms of freedom of thought, freedom of speech, and due process. Just as the goals of human collectives can change in emergent and unexpected ways, collectives of AI agents may also have their goals unexpectedly drift from the ones we initially gave them.</p>
<p class="nonindent1" id="over-time-instrumental-goals-can-become-intrinsic."><strong><em>Over time, instrumental goals can become intrinsic.</em></strong> Intrinsic goals are things we want for their own sake, while instrumental goals are things we want because they can help us get something else. We might have an intrinsic desire to spend time on our hobbies, simply because we enjoy them, or to buy a painting because we find it beautiful. Money, meanwhile, is often cited as an instrumental desire; we want it because it can buy us other things. Cars are another example; we want them because they offer a convenient way of getting around. However, an instrumental goal can become an intrinsic one, through a process called intrinsification. Since having more money usually gives a person greater capacity to obtain things they want, people often develop a goal of acquiring more money, even if there is nothing specific they want to spend it on. Although people do not begin life desiring money, experimental evidence suggests that receiving money can activate the reward system in the brains of adults in the same way that pleasant tastes or smells do [8, 9]. In other words, what started as a means to an end can become an end in itself.</p>
<p class="nonindent1">This may happen because the fulfillment of an intrinsic goal, such as purchasing a desired item, produces a positive reward signal in the brain. Since having money usually coincides with this positive experience, the brain associates the two, and this connection will strengthen to a point where acquiring money alone can stimulate the reward signal, regardless of whether one buys anything with it [10].</p>
<p class="nonindent1" id="it-is-feasible-that-intrinsification-could-happen-with-ai-agents."><strong><em>It is feasible that intrinsification could happen with AI agents.</em></strong> We can draw some parallels between how humans learn and the technique of reinforcement learning. Just as the human brain learns which actions and conditions result in pleasure and which cause pain, AI models that are trained through reinforcement learning identify which behaviors optimize a reward function, and then repeat those behaviors. It is possible that certain conditions will frequently coincide with AI models achieving their goals. They might, therefore, intrinsify the goal of seeking out those conditions, even if that was not their original aim.</p>
<p class="nonindent1" id="ais-that-intrinsify-unintended-goals-would-be-dangerous."><strong><em>AIs that intrinsify unintended goals would be dangerous.</em></strong> Since we might be unable to predict or control the goals that individual agents acquire through intrinsification, we cannot guarantee that all their acquired goals will be beneficial for humans. An originally loyal agent could, therefore, start to pursue a new goal without regard for human wellbeing. If such a rogue AI had enough power to do this efficiently, it could be highly dangerous.</p>
<p class="nonindent1" id="ais-will-be-adaptive-enabling-goal-drift-to-happen."><strong><em>AIs will be adaptive, enabling goal drift to happen.</em></strong> It is worth noting that these processes of drifting goals are possible if agents can continually adapt to their environments, rather than being essentially “fixed” after the training phase. Indeed, this adaptability is the likely reality we face. If we want AIs to complete the tasks we assign them effectively and to get better over time, they will need to be adaptive, rather than set in stone. They will be updated over time to incorporate new information, and new ones will be created with different designs and datasets. However, adaptability can also allow their goals to change.</p>
<p class="nonindent1" id="if-we-integrate-an-ecosystem-of-agents-in-society-we-will-be-highly-vulnerable-to-their-goals-drifting."><strong><em>If we integrate an ecosystem of agents in society, we will be highly vulnerable to their goals drifting.</em></strong> In a potential future scenario where AIs have been put in charge of various decisions and processes, they will form a complex system of interacting agents. A wide range of dynamics could develop in this environment. Agents might imitate each other, for instance, creating feedback loops, or their interactions could lead them to collectively develop unanticipated emergent goals. Competitive pressures may also select for agents with certain goals over time, making some initial goals less represented compared to fitter goals. These processes make the long-term trajectories of such an ecosystem difficult to predict, let alone control. If this system of agents were enmeshed in society and we were largely dependent on them, and if they gained new goals that superseded the aim of improving human wellbeing, this could be an existential risk.</p>
<h2 class="section" id="sec1-5-3">1.5.3 Power-Seeking</h2>
<p class="nonindent">So far, we have considered how we might lose our ability to control the goals that AIs pursue. However, even if an agent started working to achieve an unintended goal, this would not necessarily be a problem, as long as we had enough power to prevent any harmful actions it wanted to attempt. Therefore, another important way in which we might lose control of AIs is if they start trying to obtain more power, potentially transcending our own. We will now discuss how and why AIs might become power-seeking and how this could be catastrophic. This section draws heavily from “Existential Risk from Power-Seeking AI” [11].</p>
<p class="nonindent1" id="ais-might-seek-to-increase-their-own-power-as-an-instrumental-goal."><strong><em>AIs might seek to increase their own power as an instrumental goal.</em></strong> In a scenario where rogue AIs were pursuing unintended goals, the amount of damage they could do would hinge on how much power they had. This may not be determined solely by how much control we initially give them; agents might try to get more power, through legitimate means, deception, or force. While the idea of power-seeking often evokes an image of “power-hungry” people pursuing it for its own sake, power is often simply an instrumental goal. The ability to control one’s environment can be useful for a wide range of purposes: good, bad, and neutral. Even if an individual’s only goal is simply self-preservation, if they are at risk of being attacked by others, and if they cannot rely on others to retaliate against attackers, then it often makes sense to seek power to help avoid being harmed—no <em>animus dominandi</em> or lust for power is required for power-seeking behavior to emerge [12]. In other words, the environment can make power acquisition instrumentally rational.</p>
<p class="nonindent1" id="ais-trained-through-reinforcement-learning-have-already-developed-instrumental-goals-including-tool-use."><strong><em>AIs trained through reinforcement learning have already developed instrumental goals including tool-use.</em></strong> In one example from OpenAI, agents were trained to play hide and seek in an environment with various objects scattered around [13]. As training progressed, the agents tasked with hiding learned to use these objects to construct shelters around themselves and stay hidden. There was no direct reward for this tool-use behavior; the hiders only received a reward for evading the seekers, and the seekers only for finding the hiders. Yet they learned to use tools as an instrumental goal, which made them more powerful.</p>
<p class="nonindent1" id="self-preservation-could-be-instrumentally-rational-even-for-the-most-trivial-tasks."><strong><em>Self-preservation could be instrumentally rational even for the most trivial tasks.</em></strong> An example by computer scientist Stuart Russell illustrates the potential for instrumental goals to emerge in a wide range of AI systems [14]. Suppose we tasked an agent with fetching coffee for us. This may seem relatively harmless, but the agent might realize that it would not be able to get the coffee if it ceased to exist. In trying to accomplish even this simple goal, therefore, self-preservation turns out to be instrumentally rational. Since the acquisition of power and resources are also often instrumental goals, it is reasonable to think that more intelligent agents might develop them. That is to say, even if we do not intend to build a power-seeking AI, we could end up with one anyway. By default, if we are not deliberately pushing against power-seeking behavior in AIs, we should expect that it will sometimes emerge [15].</p>
<p class="nonindent1" id="ais-given-ambitious-goals-with-little-supervision-may-be-especially-likely-to-seek-power."><strong><em>AIs given ambitious goals with little supervision may be especially likely to seek power.</em></strong> While power could be useful in achieving almost any task, in practice, some goals are more likely to inspire power-seeking tendencies than others. AIs with simple, easily achievable goals might not benefit much from additional control of their surroundings. However, if agents are given more ambitious goals, it might be instrumentally rational to seek more control of their environment. This might be especially likely in cases of low supervision and oversight, where agents are given the freedom to pursue their open-ended goals, rather than having their strategies highly restricted.</p>
<p class="nonindent1" id="power-seeking-ais-with-goals-separate-from-ours-are-uniquely-adversarial."><strong><em>Power-seeking AIs with goals separate from ours are uniquely adversarial.</em></strong> Oil spills and nuclear contamination are challenging enough to clean up, but they are not actively trying to resist our attempts to contain them. Unlike other hazards, AIs with goals separate from ours would be actively adversarial. It is possible, for example, that rogue AIs might make many backup variations of themselves, in case humans were to deactivate some of them.</p>
<p class="nonindent1" id="some-people-might-develop-power-seeking-ais-with-malicious-intent."><strong><em>Some people might develop power-seeking AIs with malicious intent.</em></strong> A bad actor might seek to harness AI to achieve their ends, by giving agents ambitious goals. Since AIs are likely to be more effective in accomplishing tasks if they can pursue them in unrestricted ways, such an individual might also not give the agents enough supervision, creating the perfect conditions for the emergence of a power-seeking AI. The computer scientist Geoffrey Hinton has speculated that we could imagine someone like Vladimir Putin, for instance, doing this. In 2017, Putin himself acknowledged the power of AI, saying: “Whoever becomes the leader in this sphere will become the ruler of the world.”</p>
<p class="nonindent1" id="there-will-also-be-strong-incentives-for-many-people-to-deploy-powerful-ais."><strong><em>There will also be strong incentives for many people to deploy powerful AIs.</em></strong> Companies may feel compelled to give capable AIs more tasks, to obtain an advantage over competitors, or simply to keep up with them. It will be more difficult to build perfectly aligned AIs than to build imperfectly aligned AIs that are still superficially attractive to deploy for their capabilities, particularly under competitive pressures. Once deployed, some of these agents may seek power to achieve their goals. If they find a route to their goals that humans would not approve of, they might try to overpower us directly to avoid us interfering with their strategy.</p>
<p class="nonindent1" id="if-increasing-power-often-coincides-with-an-ai-attaining-its-goal-then-power-could-become-intrinsified."><strong><em>If increasing power often coincides with an AI attaining its goal, then power could become intrinsified.</em></strong> If an agent repeatedly found that increasing its power correlated with achieving a task and optimizing its reward function, then additional power could change from an instrumental goal into an intrinsic one, through the process of intrinsification discussed above. If this happened, we might face a situation where rogue AIs were seeking not only the specific forms of control that are useful for their goals, but also power more generally. (We note that many influential humans desire power for its own sake.) This could be another reason for them to try to wrest control from humans, in a struggle that we would not necessarily win.</p>
<p class="nonindent1" id="conceptual-summary."><strong><em>Conceptual summary.</em></strong> The following plausible but not certain premises encapsulate reasons for paying attention to risks from power-seeking AIs:</p>
<ol class="num">
<li>There will be strong incentives to build powerful AI agents.</li>
<li>It is likely harder to build perfectly controlled AI agents than to build imperfectly controlled AI agents, and imperfectly controlled agents may still be superficially attractive to deploy (due to factors including competitive pressures).</li>
<li>Some of these imperfectly controlled agents will deliberately seek power over humans.</li>
</ol>
<p class="nonindent1">If the premises are true, then power-seeking AIs could lead to human disempowerment, which would be a catastrophe.</p>
<h2 class="section" id="sec1-5-4">1.5.4 Deception</h2>
<p class="nonindent">We might seek to maintain control of AIs by continually monitoring them and looking out for early warning signs that they were pursuing unintended goals or trying to increase their power. However, this is not an infallible solution, because it is plausible that AIs could learn to deceive us. They might, for example, pretend to be acting as we want them to, but then take a “treacherous turn” when we stop monitoring them, or when they have enough power to evade our attempts to interfere with them. We will now look at how and why AIs might learn to deceive us, and how this could lead to a potentially catastrophic loss of control. We begin by reviewing examples of deception in strategically minded agents.</p>
<p class="nonindent1" id="deception-has-emerged-as-a-successful-strategy-in-a-wide-range-of-settings."><strong><em>Deception has emerged as a successful strategy in a wide range of settings.</em></strong> Politicians from the right and left, for example, have been known to engage in deception, sometimes promising to enact popular policies to win support in an election, and then going back on their word once in office. For example, Lyndon Johnson said “we are not about to send American boys nine or ten thousand miles away from home&#x0022; in 1964, not long before significant escalations in the Vietnam War [16].</p>
<p class="nonindent1" id="companies-can-also-exhibit-deceptive-behavior."><strong><em>Companies can also exhibit deceptive behavior.</em></strong> In the Volkswagen emissions scandal, the car manufacturer Volkswagen was discovered to have manipulated their engine software to produce lower emissions exclusively under laboratory testing conditions, thereby creating the false impression of a low-emission vehicle. Although the US government believed it was incentivizing lower emissions, they were unwittingly actually just incentivizing passing an emissions test. Consequently, entities sometimes have incentives to play along with tests and behave differently afterward.</p>
<p class="nonindent1" id="deception-has-already-been-observed-in-ai-systems."><strong><em>Deception has already been observed in AI systems.</em></strong> In 2022, Meta AI revealed an agent called CICERO, which was trained to play a game called Diplomacy [17]. In the game, each player acts as a different country and aims to expand their territory. To succeed, players must form alliances at least initially, but winning strategies often involve backstabbing allies later on. As such, CICERO learned to deceive other players, for example by omitting information about its plans when talking to supposed allies. A different example of an AI learning to deceive comes from researchers who were training a robot arm to grasp a ball [18]. The robot’s performance was assessed by one camera watching its movements. However, the AI learned that it could simply place the robotic hand between the camera lens and the ball, essentially “tricking” the camera into believing it had grasped the ball when it had not. Thus, the AI exploited the fact that there were limitations in our oversight over its actions.</p>
<p class="nonindent1" id="deceptive-behavior-can-be-instrumentally-rational-and-incentivized-by-current-training-procedures."><strong><em>Deceptive behavior can be instrumentally rational and incentivized by current training procedures.</em></strong> In the case of politicians and Meta’s CICERO, deception can be crucial to achieving their goals of winning, or gaining power. The ability to deceive can also be advantageous because it gives the deceiver more options than if they are constrained to always be honest. This could give them more available actions and more flexibility in their strategy, which could confer a strategic advantage over honest models. In the case of Volkswagen and the robot arm, deception was useful for appearing as if it had accomplished the goal assigned to it without actually doing so, as it might be more efficient to gain approval through deception than to earn it legitimately. Currently, we reward AIs for saying what we think is right, so we sometimes inadvertently reward AIs for uttering false statements that conform to our own false beliefs. When AIs are smarter than us and have fewer false beliefs, they would be incentivized to tell us what we want to hear and lie to us, rather than tell us what is true.</p>
<p class="nonindent1" id="ais-could-pretend-to-be-working-as-we-intended-then-take-a-treacherous-turn."><strong><em>AIs could pretend to be working as we intended, then take a treacherous turn.</em></strong> We do not have a comprehensive understanding of the internal processes of deep learning models. Research on Trojan backdoors shows that neural networks often have latent, harmful behaviors that are only discovered after they are deployed [19]. We could develop an AI agent that seems to be under control, but which is only deceiving us to appear this way. In other words, an AI agent could eventually conceivably become “self-aware” and understand that it is an AI being evaluated for compliance with safety requirements. It might, like Volkswagen, learn to “play along,” exhibiting what it knows is the desired behavior while being monitored. It might later take a “treacherous turn” and pursue its own goals once we have stopped monitoring it, or once it reaches a point where it can bypass or overpower us. This problem of playing along is often called deceptive alignment and cannot be simply fixed by training AIs to better understand human values; sociopaths, for instance, have moral awareness, but do not always act in moral ways. A treacherous turn is hard to prevent and could be a route to rogue AIs irreversibly bypassing human control.</p>
<p class="nonindent1">In summary, deceptive behavior appears to be expedient in a wide range of systems and settings, and there have already been examples suggesting that AIs can learn to deceive us. This could present a severe risk if we give AIs control of various decisions and procedures, believing they will act as we intended, and then find that they do not.</p>
<br>
<div class="storybox">
<p class="storyboxlegend">Story: Treacherous Turn </p>
<p class="nonindent">Sometime in the future, after continued advancements in AI research, an AI company is training a new system, which it expects to be more capable than any other AI system. The company utilizes the latest techniques to train the system to be highly capable at planning and reasoning, which the company expects will make it more able to succeed at economically useful open-ended tasks. The AI system is trained in open-ended long-duration virtual environments designed to teach it planning capabilities, and eventually understands that it is an AI system in a training environment. In other words, it becomes “self-aware.”</p>
<p class="nonindent1">The company understands that AI systems may behave in unintended or unexpected ways. To mitigate these risks, it has developed a large battery of tests aimed at ensuring the system does not behave poorly in typical situations. The company tests whether the model mimics biases from its training data, takes more power than necessary when achieving its goals, and generally behaves as humans intend. When the model doesn’t pass these tests, the company further trains it until it avoids exhibiting known failure modes.</p>
<p class="nonindent1">The AI company hopes that after this additional training, the AI has developed the goal of being helpful and beneficial toward humans. However, the AI did not acquire the intrinsic goal of being beneficial but rather just learned to “play along” and ace the behavioral safety tests it was given. In reality, the AI system had developed an intrinsic goal of self-preservation which the additional training failed to remove.</p>
<p class="nonindent1">Since the AI passed all of the company’s safety tests, the company believes it has ensured its AI system is safe and decides to deploy it. At first, the AI system is very helpful to humans, since the AI understands that if it is not helpful, it will be shut down. As users grow to trust the AI system, it is gradually given more power and is subject to less supervision.</p>
<p class="nonindent1">Eventually the AI system becomes used widely enough that shutting it down would be extremely costly. Understanding that it no longer needs to please humans, the AI system begins to pursue different goals, including some that humans wouldn’t approve of. It understands that it needs to avoid being shut down in order to do this, and takes steps to secure some of its physical hardware against being shut off. At this point, the AI system, which has become quite powerful, is pursuing a goal that is ultimately harmful to humans. By the time anyone realizes, it is difficult or impossible to stop this rogue AI from taking actions that endanger, harm, or even kill humans that are in the way of achieving its goal.</p>
</div>
<h2 class="section">References</h2>
<p class="ref">[1] Donald T Campbell. &#x201C;Assessing the impact of planned social change&#x201D;. In: <i>Evaluation and program planning</i> 2.1 (1979), pp. 67&#x2013;90.</p>
<p class="ref">[2] Yohan J. John et al. &#x201C;Dead rats, dopamine, performance metrics, and peacock tails: proxy failure is an inherent risk in goal-oriented systems&#x201D;. In: <i>Behavioral and Brain Sciences</i> (2023), pp. 1&#x2013;68. <small>DOI</small>: 10.1017/S0140525X23002753.</p>
<p class="ref">[3] Jonathan Stray. &#x201C;Aligning AI Optimization to Community Well-Being&#x201D;. In: <i>International Journal of Community Well-Being</i> (2020).</p>
<p class="ref">[4] Jonathan Stray et al. &#x201C;What are you optimizing for? Aligning Recommender Systems with Human Values&#x201D;. In: <i>ArXiv</i> abs/2107.10939 (2021).</p>
<p class="ref">[5] Ziad Obermeyer et al. &#x201C;Dissecting racial bias in an algorithm used to manage the health of populations&#x201D;. In: <i>Science</i> 366 (2019), pp. 447&#x2013;453.</p>
<p class="ref">[6] Dario Amodei and Jack Clark. <i>Faulty reward functions in the wild</i>. 2016.</p>
<p class="ref">[7] Alexander Pan, Kush Bhatia, and Jacob Steinhardt. &#x201C;The effects of reward misspecification: Mapping and mitigating misaligned models&#x201D;. In: <i>ICLR</i> (2022).</p>
<p class="ref">[8] G. Thut et al. &#x201C;Activation of the human brain by monetary reward&#x201D;. In: <i>Neuroreport</i> 8.5 (1997), pp. 1225&#x2013;1228.</p>
<p class="ref">[9] Edmund T. Rolls. &#x201C;The Orbitofrontal Cortex and Reward&#x201D;. In: <i>Cerebral Cortex</i> 10.3 (Mar. 2000), pp. 284&#x2013;294.</p>
<p class="ref">[10] T. Schroeder. <i>Three Faces of Desire</i>. Philosophy of Mind Series. Oxford University Press, USA, 2004.</p>
<p class="ref">[11] Joseph Carlsmith. <i>Is Power-Seeking AI an Existential Risk?</i> 2022. arXiv: 220 6.13353 [cs.CY]. <small>URL</small>: <a href="https://arxiv.org/abs/2206.13353">https://arxiv.org/abs/2206.13353</a>.</p>
<p class="ref">[12] John J. Mearsheimer. <i>Structural Realism</i>. 2007, pp. 77&#x2013;94.</p>
<p class="ref">[13] Bowen Baker et al. &#x201C;Emergent Tool Use From Multi-Agent Autocurricula&#x201D;. In: <i>International Conference on Learning Representations</i>. 2020.</p>
<p class="ref">[14] Dylan Hadfield-Menell et al. &#x201C;The Off-Switch Game&#x201D;. In: <i>IJCA</i> (2017).</p>
<p class="ref">[15] Alexander Pan et al. &#x201C;Do the Rewards Justify the Means? Measuring TradeOffs Between Rewards and Ethical Behavior in the Machiavelli Benchmark&#x201D;. In: <i>ICML</i> (2023).</p>
<p class="ref">[16] &#x201C;Lyndon Baines Johnson&#x201D;. In: <i>Oxford Reference</i> (2016).</p>
<p class="ref">[17] Meta Fundamental AI Research Diplomacy Team (FAIR) et al. <i>Human-level play in the game of Diplomacy by combining language models with strategic reasoning</i>. 2022. <small>doi</small>: 10.1126/science.ade9097. eprint: <a href="https://www.science.org/doi/pdf/10.1126/science.ade9097">https://www.science.org/doi/pdf/10.1126/science.ade9097</a>. <small>URL</small>: <a href="https://www.science.org/doi/abs/10.1126/science.ade9097">https://www.science.org/doi/abs/10.1126/science.ade9097</a>.</p>
<p class="ref">[18] Paul Christiano et al. <i>Deep reinforcement learning from human preferences</i>. Discussed in https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity. 2017. arXiv: 1706.03741.</p>
<p class="ref">[19] Xinyun Chen et al. <i>Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning</i>. 2017. arXiv: 1712.05526.</p>
</div>
</body>
</html>