AI trained on the Internet. Now it's destroying it.

In partnership with

In 2023, StackOverflow was the largest programming knowledge base on Earth, built over 15 years by millions of developers sharing solutions. By 2025, question volume had collapsed by 78%. The cause was ChatGPT, trained on StackOverflow’s own content, now giving developers those same answers for free.

StackOverflow announced mass layoffs. Chegg, the student homework platform, saw its stock fall 99%. And in late 2024, OpenAI CEO Sam Altman acknowledged the Dead Internet Theory, the idea that genuine human content is vanishing from the web. He called it “basically right.”

This week’s video is AI Trained on the Internet. Now It's Destroying It.

Partner message

Turn AI Into Extra Income

You don’t need to be a coder to make AI work for you. Subscribe to Mindstream and get 200+ proven ideas showing how real people are using ChatGPT, Midjourney, and other tools to earn on the side.

From small wins to full-on ventures, this guide helps you turn AI skills into real results, without the overwhelm.

Get Your Free Guide

The problem goes deeper than dying platforms. AI researchers have documented that when models train on content generated by other AI models, their outputs degrade rapidly. And the internet these systems depend on is filling up with that content.

Training an AI like GPT or Claude requires staggering quantities of text. The models learn from the entire indexable internet, plus books (some pirated, ugghh), academic papers, and code. The quality of that data is everything. Rich, diverse, genuinely human content produces capable models. Thin, repetitive material produces weaker ones.

For decades, the internet provided that rich substrate. Developers answered questions on StackOverflow to build reputations. Journalists wrote articles because publishers monetised traffic. Bloggers shared knowledge because they cared. All of it became, in effect, an unpaid training data pipeline for AI companies.

Model collapse

In July 2024, a team led by Ilia Shumailov at the University of Oxford published a paper in Nature that gave this problem a name: model collapse. They showed that when AI models train on data from other AI models, each generation becomes more generic, more repetitive, and less diverse.

The rare but valuable outputs disappeared first. Unusual ideas, creative phrasings, edge-case solutions. A separate study from Rice University called it Model Autophagy Disorder, suggesting AI systems training on AI content are consuming themselves.

The mechanism is straightforward. When a language model generates text, it favours high-probability outputs. Unusual phrasing and minority viewpoints are less likely to appear. The next model trains on this narrower output and learns an even narrower distribution. The researchers compared it to repeatedly photocopying a photocopy: each generation loses detail.

A 2024 projection by Epoch AI estimated that high-quality text training data could be exhausted by 2028. Meanwhile, research from the University of Waterloo found that synthetic content on the web is growing exponentially, already outnumbering human-written content in some categories.

The knowledge loop is breaking

StackOverflow was more than a reference site. Every day, developers encountered novel bugs and undocumented edge cases, posted them, and other developers proposed solutions. It was a living record of new knowledge being created in real time. When a developer now asks ChatGPT instead, the question and whatever novel solution it might have generated never enters the public record.

Emily Bender, a computational linguist at the University of Washington, has argued that large language models create an “illusion of understanding.” They recombine existing knowledge with fluency, but they cannot generate genuinely new knowledge. If the systems that produce new human knowledge atrophy, AI is left recombining an increasingly stale corpus.

The economic spiral compounds the problem. Less traffic to original sources means less revenue, which means fewer professional writers and experts creating content, which means less high-quality training data, which means worse AI outputs. The feedback loop is self-reinforcing.

There are already signs of stratification. Reddit has grown since 2023, partly because users append “reddit” to searches to find human perspectives. Substack has thrived as readers seek verified human voices. Access to authentic human knowledge may become a luxury good, stratified by digital literacy and willingness to pay.

Reasons for cautious optimism

DeepSeek’s R1 model, released in early 2025, showed that AI systems can develop reasoning capabilities through self-play and reinforcement learning, with minimal reliance on human examples. If models can improve by generating their own training data, the dependence on the open web may matter less than the collapse research implies.

There is also a historical argument. The internet disrupted previous knowledge systems too. Newspaper and encyclopaedia sales fell. And yet Wikipedia replaced Britannica with something more comprehensive, and YouTube created an entirely new category of educational content. AI’s disruption may produce new platforms we cannot yet anticipate.

Some platforms are adapting. StackOverflow has integrated AI tools. Publishers are experimenting with licensing deals and paywalled content. The market is responding, though unevenly.

The question is whether these adaptations will arrive quickly enough. The StackOverflow that took 15 years to build was gutted in under three. Chegg went from market leader to near-bankruptcy in 18 months. The internet was built by humans sharing what they knew. What happens when AI makes that sharing unsustainable will shape the information landscape for decades.