Back to Articles
What is 'Model Collapse'? An Expert Explains the Rumours About an Impending AI Doom

ABC News

SKIPPED

Details

Date Published
25 Aug 2024
Priority Score
3
Australian
Yes
Created
8 Mar 2025, 12:05 pm

Authors (2)

Description

Generative AI needs tons of data to learn. It also generates new data. So, what happens when AI starts training on AI-made content?

Summary

The concept of 'model collapse' involves a hypothetical scenario where AI systems deteriorate in performance due to training on data increasingly generated by other AI systems. This scenario raises concerns about the quality and diversity of data inputs necessary to develop reliable AI models as the prevalence of AI-generated content grows. The article highlights ongoing debates about the sustainability of AI systems and the necessity of preserving access to high-quality, human-generated data. It also considers the regulation needed to maintain competitive and diverse AI development efforts. Australian interim legislation has been mentioned in the context of labeling AI-generated content, emphasizing the need to protect digital public goods and socio-cultural diversity in the face of rapidly increasing AI-generated content. The narrative underscores the global relevance of these issues and the potential risk of losing diversity in content creation.

Body

analysisWhat is 'model collapse'? An expert explains the rumours about an impending AI doomBy Aaron J. SnoswellThe ConversationTopic:Artificial IntelligenceSun 25 AugSunday 25 AugustSun 25 Aug 2024 at 9:09am"Model collapse" refers to a hypothetical scenario where future AI systems get progressively dumber due to the increase of AI-generated data on the internet.(AP: Michael Dwyer/File)Artificial intelligence (AI)prophetsandnewsmongersare forecasting the end of the generative AI hype, with talk of an impending catastrophic "model collapse".But how realistic are these predictions? And what is model collapse anyway?Discussed in2023, but popularisedmore recently, "model collapse" refers to a hypothetical scenario where future AI systems get progressively dumber due to the increase of AI-generated data on the internet.Loading...The need for dataModern AI systems are built using machine learning. Programmers set up the underlying mathematical structure, but the actual "intelligence" comes from training the system to mimic patterns in data.But not just any data. The current crop of generative AI systems needshigh qualitydata, and lots of it.To source this data, big tech companies such as OpenAI, Google, Meta and Nvidia continually scour the internet, scooping upterabytes of contentto feed the machines. But since the advent ofwidely availableandusefulgenerative AI systems in 2022, people are increasingly uploading and sharing content that is made, in part or whole, by AI.In 2023, researchers started wondering if they could get away with only relying on AI-created data for training, instead of human-generated data.There are huge incentives to make this work. In addition to proliferating on the internet, AI-made content ismuch cheaperthan human data to source. It also isn'tethicallyandlegallyquestionableto collect en masse.However, researchers found that without high-quality human data, AI systems trained on AI-made dataget dumber and dumberas each model learns from the previous one. It's like a digital version of the problem of inbreeding.Loading Twitter contentThis "regurgitive training" seems to lead to a reduction in the quality and diversity of model behaviour. Quality here roughly means some combination of being helpful, harmless and honest. Diversity refers to the variation in responses, and which people's cultural and social perspectives are represented in the AI outputs.In short: by using AI systems so much, we could be polluting the very data source we need to make them useful in the first place.Avoiding collapseCan't big tech just filter out AI-generated content? Not really. Tech companies already spend a lot of time and money cleaning and filtering the data they scrape, with one industry insider recently sharing they sometimes discardas much as 90per centof the data they initially collect for training models.These efforts might get more demanding as the need to specifically remove AI-generated content increases. But more importantly, in the long term it will actually get harder and harder to distinguish AI content. This will make the filtering and removal of synthetic data a game of diminishing (financial) returns.Ultimately, the research so far shows we just can't completely do away with human data. After all, it's where the "I" in AI is coming from.Are we headed for a catastrophe?There are hints developers are already having to work harder to source high-quality data. For instance,the documentationaccompanying the GPT-4 release credited an unprecedented number of staff involved in the data-related parts of the project.We may also be running out of new human data.Some estimatessay the pool of human-generated text data might be tapped out as soon as 2026.It's likely why OpenAI and others areracing to shore up exclusive partnershipswith industry behemoths such asShutterstock,Associated PressandNewsCorp. They own large proprietary collections of human data that aren't readily available on the public internet.However, the prospects of catastrophic model collapse might be overstated. Most research so far looks at cases where synthetic data replaces human data. In practice, human and AI data are likely to accumulate in parallel, whichreduces the likelihood of collapse.The most likely future scenario will also see an ecosystem of somewhat diverse generative AI platforms being used to create and publish content, rather than one monolithic model. This also increases robustness against collapse.It's a good reason for regulators to promote healthy competition bylimiting monopoliesin the AI sector, and to fundpublic interest technology development.The real concernsThere are also more subtle risks from too much AI-made content.A flood of synthetic content might not pose an existential threat to the progress of AI development, but it does threaten the digital public good of the (human) internet.For instance, researchersfound a 16per centdropin activity on the coding website StackOverflow one year after the release of ChatGPT. This suggests AI assistance may already be reducing person-to-person interactions in some online communities.Hyperproductionfrom AI-powered content farms is also making it harder to find content that isn'tclickbait stuffed with advertisements.It's becoming impossible to reliably distinguish between human-generated and AI-generated content. One method to remedy this would be watermarking or labelling AI-generated content, as I and many others haverecently highlighted, and as reflected in recent Australian governmentinterim legislation.There's another risk, too. As AI-generated content becomes systematically homogeneous, we risk losingsocio-cultural diversityand some groups of people could even experiencecultural erasure. We urgently needcross-disciplinary researchon thesocial and cultural challengesposed by AI systems.Human interactions and human data are important, and we should protect them. For our own sakes, and maybe also for the sake of the possible risk of a future model collapse.Aaron J. Snoswell is a Research Fellow in AI Accountability at Queensland University of Technology. This piece first appeared onThe Conversation.