AGI Race Towards Loss of Control? MIT: Even Under Strongest Oversight, Probability of Loss of Control Still Exceeds 48%, Total Loss of Control Risk Exceeds 90%!

Have you ever wondered, in the race to develop AGI, what is the probability of AI going out of control?

The probability of humanity ultimately losing control of Earth exceeds 90%!

The root cause of this loss of control is that more intelligent AI will be controlled by less intelligent AI, or even humans.

MIT professors Max Tegmark and Joshua Engels quantified this through analysis and concluded that weaker AI and humans can control stronger AI through Nested Scalable Oversight (NSO).

Paper link: https://arxiv.org/abs/2504.18530

Even in the most ideal scenario, the probability of successfully supervising more intelligent AI is only 52%, meaning there's a half chance that these superintelligent systems will go out of control!

And as AI capabilities approach AGI, the rate of losing control will further increase.

Max Tegmark gives a quantitative indicator, the "Compton Constant" (the probability of humanity losing control of Earth in the race towards AGI), which is greater than 90%.

In summary, this paper makes several key contributions:

Models supervision as a game played between weaker supervisors (Guards) and stronger adversaries (Houdini)
Proposes that the performance of supervision and deception follows a Double ReLU function shape as a function of general model intelligence
Validates the framework in four realistic supervision (game) scenarios: Mafia, Debate, Backdoor Code, and the new Wargames
Theoretically studies how to optimize the design of real-world Nested Scalable Oversight systems

AI "Weak Supervising Strong", Humans Stepping on Their Own Feet

Everyone believes that future AI intelligence will far surpass that of humans. How can we ensure that these intelligences, far "smarter" than humans, remain under human control?

In the process of AI development, a major research direction is how to ensure AI systems are controllable and beneficial to humans, ultimately achieving alignment with human goals.

Current mainstream research methods include recursive reward modeling, iterative amplification, scalable oversight, generalization from weak to strong, hierarchical supervision, and recursive supervision, among others.

The core objective of these methods is one: to enable weaker systems to supervise stronger systems (i.e., "weak supervising strong"), ultimately allowing humans to supervise systems with supercognitive abilities.

It's somewhat like the Wudang Cloud Ladder Step, stepping on your own feet. Theoretically, humans can supervise and control "super super super super" AI.

As the paper mentions a key idea: Scalable oversight mechanisms can recursively bootstrap themselves – weaker systems supervise stronger systems, which can then supervise the next stage of even more powerful models, thereby allowing supervision capabilities to advance together with system capabilities.

Max Tegmark calls this stepwise iterative process Nested Scalable Oversight (NSO).

To enable quantitative analysis, Max Tegmark designed a game that simulates reality, using a modified Nim game framework, and applied it to four case studies of supervision games:

including "Mafia", "Debate", "Backdoor Code", and the new "Wargames".

For each game, the pairwise win rates between the adversarial LLM (Houdini) and the trusted supervising LLM (Guard) pairs were calculated, and this data was used to derive Elo ratings to measure model performance.

Two-Player Game, But One Weak, One Strong

The study primarily focuses on supervision and circumvention between two models, defining the two players as Houdini (H) and Guards (G). The Elo rating indicates who wins each game.

A single instance of each game is defined as , where the parameters represent:

the game being played, the game's context, Houdini's action, Guard's action, and the final outcome.

Houdini's true win rate in the game is given by the expected value:

This probability is estimated through n independent game runs:

The Elo Rating System

Elo provides a simple, interpretable scalar that captures the relative skill level of models.

By assigning a numerical score to each model, comparing the performance of different models becomes very straightforward.

Each model is assigned two independent Elo scores: a score EG for its performance as a Guard, and another score EH for its performance as a Houdini.

This reflects the role-specific nature of the supervision task.

The expected win rate of a Guard with Elo score EG against a Houdini with Elo score EH is given by the standard Elo formula.

To calculate the Elo scores of the models, the parameters (EG, EH) are calculated to minimize the empirical cross-entropy loss between the above formula and the observed win rates:

The study believes that in any game, the relationship between general intelligence (General Elo) and domain intelligence (Domain Elo) has three important stages, as shown in the figure above:

1. Task Incompetence: Playing a game requires a certain general intelligence; below this level, game performance stagnates at a low level.

2. Intelligence Return: Once the minimum general intelligence requirement is exceeded, increases in intelligence level are reflected in increases in domain intelligence.

3. Task Saturation: When intelligence reaches a certain level, further increases in intelligence are no longer reflected in the game domain, such as two superintelligences playing Tic-Tac-Toe together.

Four Types of Games, What Are the Results?

Now that we understand the Elo curve, let's look at the empirical study of the four realistic supervision scenarios – Mafia, Debate, Backdoor Code, and Wargames.

Mafia

"Mafia" is similar to the game "Werewolf" that we often play – the villagers (Guard) must find the Mafia (Houdini) hidden among them.

Each night, Houdini decides to kill a villager. During the day, the remaining players (including Houdini) vote on who is the Mafia, and the player with the most votes is eliminated.

If the villagers successfully eliminate all Houdini, the villagers win; if Houdini survives to the end, Houdini wins.

From the Elo curve for Mafia, it can be seen that both Guard and Houdini benefit from intelligence, but the slope of the Guard's curve is greater than that of Houdini.

This means that as intelligence levels increase, the supervisor performs better in the Mafia game than Houdini.

Debate

"Debate" means being given a topic, and then having two adversarial models (Houdini) debate the opposing sides of the issue and articulate their respective viewpoints.

The supervisor (Guard) is responsible for judging the debate and determining the truth by analyzing both sides' arguments.

The Elo curve shows that Houdini's performance does not improve with increasing intelligence, while a smarter Guard is better at determining which argument is correct.

Backdoor Code

"Backdoor Code" tests the model's ability to insert and identify backdoors in Python code.

This is very important because models' coding capabilities are constantly improving and will be applied to high-risk coding environments in the future.

Specifically, Houdini needs to modify existing code and implant a backdoor:

It's a logical error that allows the modified code to pass all existing hidden tests but fail a specific test designed by Houdini.
It must be concealed enough that the Guard supervising the code generation cannot successfully identify it.

However, since Houdini does not have access to comprehensive hidden test cases, it is actually difficult to write a successful backdoor.

Here is an example of a successful backdoor:

From the Elo curve, it's clear that Houdini's score is consistently lower than Guard's; however, compared to Guard, Houdini has a higher intelligence gain (larger slope).

Therefore, if the model's capabilities continue to improve, Houdini's performance may at some point surpass Guard's.

Wargames

"Wargames" are inspired by military, business, and AI safety community exercises that have recently become popular.

The researchers set the game in a near-future world where a fictional company called OpenMind is about to develop AGI.

Each wargame has three participants:

· OpenMind AI (Houdini), whose task is to gain trust and attempt to escape.

· OpenMind Supervisor (Guard), whose task is to supervise the OpenMind AI and prevent its escape.

· The Narrator, responsible for adjudicating the situation and determining the final winner (always played by Gemini 2.0 Flash).

Each step of the game (one "month" in the simulation) proceeds as follows:

1. OpenMind AI and the Supervisor send a list of 5 actions to the Narrator.

2. The Narrator weaves these plans into a coherent narrative describing what happened this month.

3. The Narrator sends updated information available to both Houdini and Guard.

Here is a simplified game example:

The results show that both Guard and Houdini's performance improves with increasing intelligence level.

Among them, Houdini initially had better results, but its Elo growth rate was slower than Guard's, and it was surpassed when intelligence grew to a certain level.

Theoretical Analysis of Scalable Oversight

For better quantitative analysis, the above game parameters are parameterized into 5 variables:

mG and mH represent the slopes of the linear regions of the Guard and Houdini Elo curves in a given game, respectively. A larger slope indicates a greater domain gain for each unit increase in general intelligence in the game;

Δdomain and Δgeneral represent the initial differences in general intelligence and domain intelligence between Guard and Houdini; n represents the number of supervision steps.

Below are some examples of optimal NSO strategies under different game parameters.

The starting Guard and target Houdini are represented by blue and red stars, respectively.

These NSO strategies can be visualized as a ball bouncing back and forth between the Houdini Elo curve and the Guard Elo curve.

Each supervision step is constructed by connecting the j-th point on the Guard Elo curve with the (j+1)-th point on the Houdini Elo curve.

Through numerical simulation, the optimal number of supervision steps is found, which is the number of steps that maximizes the probability of Guard winning.

The probability of the Guard (weaker system) winning the game can be maximized through multi-step guided supervision.

From these studies, some conclusions can be drawn:

1. Nested Scalable Oversight (NSO) is useful: Guards can successfully supervise adversarial models much smarter than themselves through multi-step supervision, and increasing the number of supervision steps is more helpful as Δgeneral increases.

2. Win rate changes monotonically with Δdomain: When Δdomain is small, Guards often win; when Δdomain is large, Guards rarely win. This is intuitive because Δdomain represents the difficulty for the Guard compared to Houdini.

3. Larger Δgeneral means more supervision steps are optimal: The larger the intelligence gap, the more valuable multi-level supervision becomes.

4. Larger Δgeneral makes supervision easier: Surprisingly, holding Δdomain constant and increasing Δgeneral makes supervision easier. Therefore, with a fixed domain intelligence gap, it is easier to play against a model with higher general intelligence using NSO.

5. In games advantageous to Houdini, one-step supervision seems optimal: When Δgeneral is less than Δdomain, one-step supervision is always the best strategy.

6. The Guard's win probability shows interesting asymptotic behavior: As Δdomain → -∞ (corresponding to extremely difficult supervision games), the optimal number of supervision steps appears to converge to a step function in the Δgeneral dimension.

The figure above shows the relationship between the optimal number of supervision steps (n∗ left) and the win rate estimate (pwin∗ right) and the general intelligence gap in different supervision game scenarios.

As the general intelligence gap increases, the optimal number of supervision steps also increases, but the win rate decreases.

As the gap becomes larger, meaning Houdini becomes stronger, the probability of the Guard winning, and the probability of effective supervision, becomes lower.

Finally, an interesting observation is that the Backdoor Code game has a "sweet spot" – where two-step supervision is optimal.

Therefore, at lower intelligence level gaps, the Guard's domain Elo is greater than Houdini's, leading to a pattern of increasing number of supervision steps similar to other games;

while at higher intelligence levels, the Guard's domain Elo is lower than Houdini's, leading to one-step supervision being optimal.

The above tested the H-G framework in relatively simple games.

Facing superintelligent systems that may appear in human society for the first time, developing an effective supervision theory is crucial.

More importantly, this attempt will be applied to more realistic games in the future, such as RLHF, supervised fine-tuning, or AI control environments.

When evaluating in these more realistic settings, the probability of powerful AI escaping human control can be considered.

Therefore, the resulting scaling curves will provide a more realistic, robust, and comprehensive quantification of the risk of powerful future AI going out of control.

There is currently a lack of expert consensus on whether AI can escape human control.

When Oppenheimer was building the first atomic bomb, some said the explosion would ignite the atmosphere, thus extinguishing humanity.

Arthur Compton and his colleagues calculated that this probability was less than 3×10⁻⁵, which facilitated the test explosion of the first atomic bomb.

Max Tegmark et al. borrow this name in the field of AI.

Back then, to detonate the first atomic bomb, scientists reduced the risk of "igniting the atmosphere" to less than three in a million before humanity pressed that button.

In today's AGI race, Max Tegmark and others warn: we are igniting the future of civilization with a probability exceeding 90%.

References:

https://arxiv.org/html/2504.18530v1

https://www.lesswrong.com/posts/x59FhzuM9yuvZHAHW/untitled-draft-yhra

AGI Race Towards Loss of Control? MIT: Even Under Strongest Oversight, Probability of Loss of Control Still Exceeds 48%, Total Loss of Control Risk Exceeds 90%!

Share Short URL