All Insights
Artificial Intelligence

The Ancestor's Error

Shumailov's Nature paper proved the mechanism in a closed loop. Ahrefs found 74% of new web pages contained AI text. The thought experiment is no longer hypothetical.

The Ancestor's Error

In medieval scriptoria, monks copied manuscripts by hand. Each copy introduced small errors. A transposed letter. A skipped line. A marginal gloss absorbed silently into the body of the text. Over generations the errors accumulated. A manuscript copied ten times from the original was recognizably the same text. A manuscript copied ten times from copies was something else, a document that preserved the broad structure of the source while having lost the details and the unusual choices that no individual copyist had decided to remove.

Textual scholars developed a discipline, stemmatology, to trace the errors backward and reconstruct the original from the pattern of mutations across surviving copies. The discipline existed because generational copying is degenerative. Each generation preserves most of what is there and loses some. The losses are small enough to be invisible in any single copy and large enough, compounded across generations, to transform the text.

In July 2024 a team led by Ilia Shumailov published a paper in Nature showing that large language models undergo an analogous process. They named it model collapse.

What the paper showed is severe. When a language model is trained on data that includes the output of prior language models, the next model's output drifts away from the source distribution. The center of the distribution amplifies. The tails attenuate. Rare patterns and unusual constructions, the parts of language that carry information precisely because they are rare, disappear first. They disappear because the statistical mechanics of generational training selects against them. Nobody decided they were unimportant.

The paper framed it as a thought experiment about a closed loop. Train a model on its own output. Train the next model on that. Repeat. The loop converges to a narrow, homogenized mean. The paper defines the phenomenon as a degenerative process affecting generations of learned generative models, in which the data they generate end up polluting the training set of the next generation.

The thought experiment is no longer hypothetical.

In April 2025 Ahrefs published an analysis of nine hundred thousand newly created web pages. Seventy-four percent of them contained some AI-generated content. The pure-AI fraction, where almost everything was generated, was 2.5 percent. The rest is mixed. The web in 2025 is a layered substrate where almost every new page has some model output baked into it. Any current language model that ingests web data ingests this substrate. It cannot distinguish a paragraph generated by GPT-4 in 2024 and posted to a corporate blog from a paragraph written by a human in 2010 and indexed by the same crawler. The two paragraphs are weighted identically.

The monks at least knew which manuscripts were copies. They could weight earlier copies more heavily, treat the lineage as a clue. The web has no provenance layer. There is no marker on the synthetic paragraph to distinguish it from the human one, and the absence of the marker is, structurally, what the monks would have called a copying error too dangerous to tolerate.

Here is what I think the cultural stakes are.

Information theory tells us the rare patterns carry the most meaning. The signal of a message is inversely proportional to its probability. The unusual word doing more work per unit length than the common one is not poetic flourish. It is Shannon. Model collapse selectively destroys the part of language that carries the information. What it preserves is the part that is fluent, structurally plausible, and instantly forgettable. A photocopy of a photocopy. Recognizable enough to function as the source, missing whatever made the source worth keeping.

The asymmetry is the part that makes this hard to fix. Contamination is automatic. Every blog drafted with ChatGPT, every comment finished with autocomplete, every product description spun out of a marketing tool adds to the substrate. Decontamination requires deliberate effort. Detection of AI-generated text is, the literature shows, a worsening problem as models improve. Provenance systems exist in prototype but are not deployed at scale. Curated human-verified datasets are expensive and slow. The thermodynamic gradient runs one way. The work to reverse it runs uphill, against an incentive structure that pays for contamination and refuses to pay for cleanup.

I do not know whether the models in training right now have already crossed a homogenization threshold. The published literature is thin and contradictory. What I know is the mechanism. The mechanism does not need anyone's permission to operate. It does not need a decision. It runs because the inputs to the next model are increasingly the outputs of the last one, and that is a sentence I have written, in slightly different forms, in three different essays this year.

The medieval parallel is more useful than I would have guessed. The manuscripts that survived the centuries were not the rare ones. They were the popular ones. Natural selection, operating on the population of texts, favored the common over the rare. We are now running the same selection at digital speed. The common patterns survive. The rare patterns perish.

Nobody is burning the library. The library is being rewritten, slowly, by machines that cannot tell the difference between the original and the copy, into a version of itself that preserves everything except the parts that mattered most.

The error is ancestral. The next generation of models did not choose it. They will inherit it from the substrate they had to train on, and the inheritance will be invisible from inside, the way it was invisible to the scribe in 1180 who wrote down what the manuscript he was copying said, and got most of it right, and got something wrong, and could not have known which.

Initiate Contact

Ready to transform your decision architecture?

Tell us about the decision you're trying to improve. We'll schedule a briefing with our principals to understand your environment and explore a potential fit.

Schedule a Briefing