< Blog roll

Why isn‘t anyone talking about model collapse?

This is the last topic I want to write about#

And I‘m fairly sure that you all out there are also starting to get sick of all the talk about AI on the interwebz these days. The tech world has seen its biggest content creators just go wild about AI. Matt Pocock, “the Typescript” guy hasn‘t posted about Typescript in 8 months or so (and his comment sections let him know it on every new AI video). Half of my Bluesky feed is filled with AI talk, and lets not mention LinkedIn shall we?

But in all the discourse, paid shilling, hype-mongering, and anti-AI sentiment alike, this one topic doesn‘t seem to come up nearly as much as I think it should and I have questions.

Why aren‘t many people talking about model collapse?

What is model collapse?#

Let me start by explaining the concept just in case it might be new to you. I‘ll use the Wikipedia definition, which defines the term as

Model collapse: a phenomenon noted in artificial intelligence studies, where machine learning models gradually degrade due to errors coming from uncurated synthetic data, or training on the outputs of another model, such as prior versions of itself.

Basically, the more these LLMs are trained on their own slop, the more worthless they become.

I was sent this video by by George D. Montañez, PhD of Theos Theory on YouTube. Its one of the better talks about AI that I‘ve seen, though I admit I‘m a bit biased because of its content. In that very well formatted presentation, George highlights that by somewhere around the 9th generation of training an LLM on its own output, the output it can generate is basically gobbledegook.

That same Wikipedia article I pulled the definition from also says that this concept of model collapse is widely known by the AI research community. Its not a secret that this kind of thing happens.

In my mind, the biggest question that industries poised to consume these models at scale is: What specifically, if anything, is being done to prevent model collapse?

So I went researching (the old fashioned way by typing my query into a Google search input) about what folks say about preventing model collapse.

Is model collapse even a problem?#

I found at least one study from 2024 that seems to show that as long as the AI output doesn‘t replace the human training data, then everything will be ok. So there‘s nothing to worry about right? I guess that depends on what we think will happen to the whole ocean of digital content in the near future. Sure, some parts of it won‘t get replaced by AI output. The full text of Harry Potter and the Sorcerer‘s Stone which is already in the models practically word for word won‘t change because (hopefully) none of the trillion AI rewrites of it will get published. Or will they? I think there‘s a really good chance that Temu Harry Potter Stories will start popping up all over the internet. Our society has already fallen in love with fanfics, so it would not surprise me one bit to see new AI output chapters exist alongside the real thing.

Aside—This hopefully goes without saying, but fuck J.K. Rowling even if she did create a pretty cool universe absolutely fucking filled to the brim with plot holes and mildly racist character names.

Take Harry Potter AI rewrites for the lulz, and multiply it by a billion and I think we could agree that its not terribly far-fetched that even if AI content doesn‘t “replace” the human-made originals, the presence of that generated content alongside the original certainly could muddy the waters quite a bit. What if the AI generated content gets picked up as training data because its posted on some blog, but the original doesn‘t because the publishing company sues the AI bros and disallows it? Then what?

But we‘re coders you and me, right? Code is our domain. So in addition to the potential difficulty with the loads of training data written in natural languages, we also have to concern ourselves with the question of how likely do we think it is that AI generated code replaces or outpaces human code in the future? And if that does happen, how long does it take, how much AI code would it take to poison a model beyond repair? No one knows the future, but here‘s my leading question about all of this:

Isn‘t the proliferation of AI generated output what these billionaire AI tycoons want? Isn‘t that their whole goal and business model?

In a world where AI output proliferates massively, then what?#

I think we could all agree that the AI business bros are pushing heavily to make sure that every human adopts their tools. That‘s how they make money. So do we really think that OpenAI and Anthropic and the like survive purely on their chatbot apps that only produce content to be read by their human users, but never to be published back out into the environment? I doubt it. I don‘t see adoption of these models at the scale that seems to be needed. The muggles (non-tech folk) don‘t really seem to be adopting AI en masse in their daily lives. Sure they might use it, but paying for it? Doubt. I‘ve also red that there is a fair bit of negative sentiment about AI in general because of the explosion of useless, pointless chat bots in every single app and website ever made.

I had ChatGPT installed on my personal phone for a while, but I never gave them a dime of my personal money. I‘m a techie and I still rarely typed at the word prediction machine very often. I have since deleted the app because of the shit they pulled in making a deal with this current dumpster fire of an administration. But I won‘t miss it at all. Not even a little. I have used an AI chat bot in my personal life maybe 10 times in all. And I would bet that even though OpenAI has a ton of users, they aren‘t going to make a ton of money unless companies are paying en masse to create things. Creating things means putting that content where? Back out there into the space where OpenAI and Anthropic pulled their training data from.

So if we agree that these companies NEED for folks to actually create things using their tools in order for them to make the money they want, then the amount of AI generated content will explode as they achieve their goals.

What then? What happens when Anthropic gets their wish and LOTS and LOTS of things are created and put back out into the universe with their tool? Doesn‘t that eventuality virtually guarantee model collapse?

How do the business models of these AI companies align with the concept of model collapse?#

How can two opposite things be true simultaneously? How can AI content proliferate so massively that these companies are able to achieve the necessary ROI for their investors without causing model collapse at the same time? OpenAI is said to be adding ads to their platform, but that won‘t get the ROI they need either.

Are these CEOs knowingly driving all of us towards a future where their products inevitably fail? How can they make money and it not also be true that so much AI content is generated that the models collapse? We devs know that 9 training generations isn‘t that many minor version bumps. I‘ve seen precious few questions about this topic, and the answer the CEOs seem to always give is “We will have to be very careful about what we put into our training data set.” As if that explains everything. I still have questions.

Ok, so you can limit the training set to exclude enough AI content. How?#

So let‘s assume that just not including AI content in the training set is the answer to model collapse. Cools. What does that look like exactly? How would these models expect to improve and get better? If more AI content is being produced, but these companies cant use any of it, doesn‘t that mean they depend on more human generated content for their models to improve? Am I the only one seeing a catch 22 here? These companies are driving us to a world where there will necessarily be less of the thing their product depends on to improve? What?

And if excluding AI content prevents model collapse, how exactly could that be done? I know that us humans can all tell when something was written by a bot, but how can that distinction be made programmatically at the scale needed by these models? Its not like every website is going to have some comment in the code saying // bot code, please ignore. How exactly would AI content be excluded? I have not seen an interview with any of these CEOs where they explain a process they would have to identify and segregate AI content from human content. I don‘t know anything about AI training, but I am a dev, so I would tend to think that task to be largely impossible.

Why is no one pressing these CEOs about model collapse? Why is no one asking questions about what steps these companies are already taking to mitigate it? As the hype train tramples us underneath, I think its dragging us to a world where these models get worse not better. I would love to see someone ask an AI component CEO why that isn‘t case just to see what they say.

I think they know what they are doing#

I am eternally a cynic. Especially in this cases like this where the hype of thing is SO overblown compared what I‘ve actually seen in real life. I know its cliche at this point, but I had the same reaction to NFTs. Here‘s me hoping that these LLMs go the same route. And I think they will. And I think the CEOs know it. My personal opinion is that this might be one of the biggest pump and dump schemes of all time. I have a sinking feeling that the purveyors of these models know full well that model collapse is inevitable if they achieve their goals, so they just want to get in, pump their stocks and valuations to record highs, sell out a bigger fish and then leave the FAANG companies holding the massive bag of negative sentiment once it all comes crashing down. I don‘t believe any of them for a single second when they say that all of the things they are doing are “about humanity” or “the greater good” or whatever line it is they are hoping we buy today. I don‘t believe these LLMs are capable of any of the things that the hype train says they are. Go watch that presentation from George. These models don‘t reason, and it doesn‘t matter that Anthropic calls it a “reasoning model”. Its a trick. And its on purpose. And its going to cost our society a shit ton of money and who knows what else along the way.

The End