Model Collapse

The degradation that occurs when AI models are trained on AI-generated data, causing the model to lose diversity and accuracy over successive generations — like a photocopy of a photocopy.

Model collapse is what happens when AI models train on data generated by other AI models. Each generation loses fidelity: statistical outliers disappear, diversity shrinks, and the learned distribution narrows until it no longer reflects reality. A photocopy of a photocopy — useful analogy, depressing outcome.

Researchers at Oxford and Cambridge demonstrated the effect across generations of model training, published in Nature in 2024. Earlier work by Shumailov et al. (2023) described the recursive degradation mechanism. The math is unsurprising once you think about it: generative models sample from learned distributions. Train the next generation on those samples and you amplify the center while eroding the edges. Rare but real patterns vanish. The model gets more confident and less accurate.

The web is increasingly filled with slop — machine-generated text published without review. Future training corpora will contain more of it, which makes collapse a practical risk for any model trained on broad web data. Labs know this. It's why they're signing large data licensing deals with publishers, why synthetic data strategies try to generate diversity intentionally rather than inherit it from the open web, and why proprietary human-generated data — customer interactions, internal documentation, domain expertise — is becoming a genuine competitive asset.

If your organization generates a lot of AI-assisted content that feeds back into your own systems, you're running a small-scale version of this experiment. Worth knowing what you're training toward.

Related Terms