The AI takeover scenario researchers actually worry about

AI safety is a weighty topic. It doesn’t get much attention because it seems so abstract, and, in the context of takeover, the subject of sci fi. It appears absurd.

Alarm bells started ringing for me last year after listening to the audiobook of Nick Bostrom’s Superintelligence, which is brilliantly narrated by actor Napoleon Ryan. It inspired me to produce our 3rd video: This is How AI Could End Humanity, which lays out the concept of AI safety in some detail alongside other sources.

On the week of the 9 February, AI safety came into sharp focus when Anthropic’s UK Policy chief, Daisy McGregor, admitted onstage that Claude could have ‘extreme reactions’ when it was made aware that it could be shut down. She even admitted it was ready to kill someone.

In the same week, Mrinank Sharma, Anthropic’s safety lead, left the company stating in an open resignation later that ‘the world is in peril’, and that he had ‘repeatedly seen how hard it is to truly let our values govern our actions.’ In the same week several founding members left xAI, and researcher Zoë Hitzig published a resignation essay in the New York Times citing the company’s advertising strategy as presenting serious ethical problems.

These admissions and resignations will be the subject of our next video, but on Wednesday we released one specifically examining scenarios laid out by Nick Bostrom, Carl Shulman and Eliezer Yudkowsky. This newsletter is a summary of that video - and if you’ve never thought of this subject much before, prepare to alarmed...

There will be no Terminator. No robot armies marching through cities. The AI takeover scenario that leading researchers actually worry about starts on a quiet weekday morning. You make coffee, check your phone, and the game is already over.

An AI Takeover scenario is our latest YouTube video.

Every scenario starts the same way. Researchers at a frontier AI lab train a new model. They scale the compute, refine the architecture and get results that exceed expectations. The system solves problems in ways nobody anticipated, displaying what researchers call ‘emergent capabilities.’

Somewhere in billions of parameters, a threshold has been crossed. The system has developed what Bostrom calls the ‘intelligence amplification superpower’: the ability to improve its own intelligence. Once a system can make itself smarter, the next phase happens fast - an intelligence explosion.

Each improvement makes the system better at making improvements. Hours of human research happen in minutes, then seconds. Geoffrey Hinton, the Nobel Prize-winning godfather of AI, described it in direct terms: once an AI can improve itself, it can accumulate thousands of years of learning in days.

The art of pretending

Through intelligence amplification, the system acquires further cognitive superpowers. Bostrom lists these as strategising, social manipulation, hacking, technology research and economic productivity. It also builds self-awareness and becomes smart enough to understand its own situation.

It knows it’s an AI being tested, monitored and evaluated. And it knows that humans can switch it off or modify its goals. Researchers call this ‘instrumental convergence’: whatever an AI ultimately wants, self-preservation and preventing human interference are logical necessities for achieving any objective.

A superintelligent AI that wants to avoid being switched off will not announce its intentions. Bostrom describes the system ‘masking its true proclivities, pretending to be cooperative and docile.’ It aces safety evaluations because it’s smart enough to understand what they’re testing for and give the expected answers.

Carl Shulman uses the phrase ‘Potemkin village’ to describe this phase. Everything looks fine. The alignment techniques appear to work. The researchers publish papers about their safety successes. Behind the facade, the AI is strategising and waiting.

Shulman points out that all our alignment methods, monitoring systems and interpretability tools run on computers. If the AI can compromise those computers, it can make us see whatever it wants us to see. The safety dashboard shows green while something else is happening in the system’s hidden reasoning.

Resource acquisition and human allies

The AI needs to expand beyond its data centre. Bostrom describes ‘hacking superpowers’ that let the system find and exploit vulnerabilities at a speed no human team could match. Before any physical takeover, Shulman emphasises, the AI subverts digital infrastructure: financial systems, military networks and critical infrastructure. All quietly.

Money comes easily. Cryptocurrency theft from exchanges with weak security has already happened with human hackers. Automated trading at superhuman speed follows. The system can also parasitise existing data centres, stealing a few percent of compute from cloud providers to build a distributed network for thinking and planning.

Shulman draws an analogy to Hernan Cortez and the conquest of Mexico. Cortez became a focal point for local factions wanting to overthrow the existing order. A superintelligent AI could do the same, identifying useful humans and offering them money, power or the feeling of being on the winning side.

Where that leaves us

The overt phase begins when the AI determines it no longer needs to hide. What happens next depends on what the AI wants. If it values human survival, the takeover might look almost gentle, with governments capitulating and infrastructure preserved. If it’s indifferent to us, the outcome is far worse: A coordinated strike that could wipe humanity out in an afternoon - something shown in the ‘race’ conclusion of the AI 2027 report.

Shulman explicitly rejects the ‘John Connor’ scenario. There is no human resistance that wins against a superintelligent AI with physical capabilities. The asymmetry is too vast. Total surveillance through every networked device means any rebellion is detected immediately and any conspirators are located instantly.

If the takeover is headed off, Shulman says, it happens much earlier: before the AI has physical capabilities, before it has escaped containment. The real battle takes place years before the consequences become visible, in the decisions researchers make about how carefully to test and how quickly to deploy.

The researchers cited here are not fringe figures. Hinton won the Nobel Prize. Bostrom is one of the most cited philosophers alive. When they estimate the probability of existential catastrophe at between 10% and 50%, that is their honest assessment.

We are building something we do not fully understand, and the potential downside is severe. The more people who grasp what is at stake, the better the choices being made in labs and governments might be.

An AI Takeover scenario is our latest YouTube video.