Understanding Model Collapse in Artificial Intelligence

Explore top LinkedIn content from expert professionals.

Summary

Model collapse in artificial intelligence refers to the decline in AI performance when models are trained on data generated by other AI systems, instead of diverse, real-world data. Over time, this leads to a loss of accuracy, creativity, and ability to handle complex scenarios, akin to a copy of a copy losing quality.

Prioritize real-world data: Ensure your AI models are trained on authentic, diverse, and human-generated data to maintain accuracy and avoid losing important nuances.
Monitor training workflows: Implement systems to track the quality and origin of training data and prevent over-reliance on synthetic or AI-generated content.
Incorporate human oversight: Actively involve human reviewers in the data preparation and model evaluation process to catch subtle errors and guide quality improvement.

Summarized by AI based on LinkedIn member posts

Brian Mullin

CEO at Karlsgate

2,288 followers 4mo
Report this post
AI model collapse isn’t theoretical. It’s what happens when you train models on outputs of other models or synthetic data meant to mimic real behavior. The result is subtle at first: fewer edge cases, weaker predictions, more normalization. But over time, the model stops producing realistic results. The decline is surprisingly fast. Only real-world data reflects the complexity and entropy that anchors the AI model weights in truth. The need is to bring high-fidelity signals into AI training workflows while preventing private data exposure. In the end, this is a data logistics challenge that can’t be faked or avoided without noticeable degradation. A robust AI infrastructure depends on data operations that enable de-identified dataflows at scale so that the models resist collapse. #AIModelCollapse #RealWorldData #DataFidelity #PrivacyTech #Karlsgate

26 Comments
Like Comment
Montgomery Singman Montgomery Singman is an Influencer

Managing Partner @ Radiance Strategic Solutions | xSony, xElectronic Arts, xCapcom, xAtari

26,636 followers 1y
Report this post
AI models are at risk of degrading in quality as they increasingly train on AI-generated data, leading to what researchers call "model collapse." New research published in Nature reveals a concerning trend in AI development: as AI models train on data generated by other AI, their output quality diminishes. This degradation, likened to taking photos of photos, threatens the reliability and effectiveness of large language models. The study highlights the importance of using high-quality, diverse training data and raises questions about the future of AI if the current trajectory continues unchecked. 🖥️ Deteriorating Quality with AI Data: Research indicates that AI models progressively degrade in output quality when trained on content generated by preceding AI models, a cycle that exacerbates each generation. 📉 The phenomenon of Model Collapse: Described as the process where AI output becomes increasingly nonsensical and incoherent, "model collapse" mirrors the loss seen in repeatedly copied images. 🌐 Critical Role of Data Quality: High-quality, diverse, and human-generated data is essential to maintaining the integrity and effectiveness of AI models and preventing the degradation observed with synthetic data reliance. 🧪 Mitigating Degradation Strategies: Implementing measures such as allowing models to access a portion of the original, high-quality dataset has been shown to reduce some of the adverse effects of training on AI-generated data. 🔍 Importance of Data Provenance: Establishing robust methods to track the origin and nature of training data (data provenance) is crucial for ensuring that AI systems train on reliable and representative samples, which is vital for their accuracy and utility. #AI #ArtificialIntelligence #ModelCollapse #DataQuality #AIResearch #NatureStudy #TechTrends #MachineLearning #DataProvenance #FutureOfAI

AI trained on AI garbage spits out AI garbage technologyreview.com

8 Comments
Like Comment
Lewis Z. Liu

AI Entrepreneur & Pioneer | Ex-Founder & CEO of Eigen (Acquired)

8,747 followers 1y
Report this post
AI-Cannibalism. At least this is what I'm calling it now. This concept is not new, but I'm glad it is getting more coverage, this week from The Wall Street Journal (https://lnkd.in/d3RFm8DR), and earlier this year the Financial Times wrote a great piece about this too (https://lnkd.in/d7f6A_H2). A few thoughts: 1) #ModelCollapse is the situation where #AI generated output is being fed back in to train the next generation of AI models. After enough generations, the model 'collapses', and the AI model outputs only complete gibberish. I like to call this AI-Cannibalism, but it has also been referred to as the AI Ouroboros. 2) A Nature Magazine paper from July, whilst not the first to point this out, provides a mathematically intuitive rationale for why this is happening (https://lnkd.in/daRQfV_F). They argue that model collapse is inevitable as they say, 'We argue that the process of model collapse is universal among generative models that recursively train on data generated by previous generations.' The basic intuition is as follows: Probable events are over-estimated (think the racial biased problems in image generation: https://lnkd.in/dD_4nHew), and improbable (but still real) events are under-estimated (think of the viral article about where AI cannot generate an image of an Asian man and white woman: https://lnkd.in/dQCGuRUi). Repeating this process again and again creates an "AI-style echo chamber" that filters out real, though less probable, events and leaves only the most likely ones. It's a bit like the genetic risks of incest over multiple generations—where rare, recessive gene disorders, though initially unlikely, become increasingly probable. If you rely on #syntheticdata to generate additional AI training data, you should think very hard about the mathematical viability of your solution. 3) Long term, this means that all the activity around the #IP and #copyright debate such as the The New York Times and Dow Jones court case play a critical role for the future of AI. If we want to improve our models in the medium to long term, we must allow humans to continue creating. We must incentivize our species to continue to right and paint and produce. I have a strong desire to advance AI technology as we move closer to #AGI, but I feel like my industry isn't giving enough thought to the human creators behind it all. And by creators, I mean that in both senses of the word: the artists, writers, and coders on one hand, and on the other, the very Creators of AI itself—because in the end, we're one and the same. #genAI #chatgpt Eigen Technologies Sirion
No more previous content

No more next content
16 Comments
Like Comment
Dr. Jason Cohen Dr. Jason Cohen is an Influencer

Solutions Architecture Leader @ Amazon Ads | I Write About Leading with Consciousness. Tech, and Systems

20,285 followers 10mo
Report this post
A groundbreaking study in Nature reveals a critical challenge for AI development: AI models trained on AI-generated content begin to "collapse," similar to how making copies of cassette tapes leads to quality degradation. Think back to the days of cassette tapes: When you made a copy of a copy of a copy, each generation lost some of the original audio quality. By the 4th or 5th copy, the music would become noticeably distorted and muffled. The researchers found that AI models face a similar problem. When new AI models are trained on content generated by previous AI models (instead of human-created content), they lose important information and nuances - particularly rare or unusual examples. The AI's outputs become increasingly distorted from reality with each generation, just like those tape copies. Why does this matter? As AI-generated content floods the internet, future AI models trained on this data may become less capable of understanding and representing the full spectrum of human knowledge and expression. The study suggests that maintaining access to original, human-generated content will be crucial for developing better AI systems. The researchers' conclusion is clear: just as audiophiles kept original recordings to maintain quality, we must preserve and prioritize human-generated content to ensure AI systems continue learning and accurately representing our world. What do you think? Link to study in the comments. #ArtificialIntelligence #MachineLearning #Technology #DataScience #Research
No more previous content

No more next content
6 Comments
Like Comment
Sohrab Rahimi

Partner at McKinsey & Company | Head of Data Science Guild in North America

20,245 followers 4mo
Report this post
A debate is quietly reshaping how we think about reasoning in LLMs, and it has real implications for how we build AI systems today. In 𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴 recently published by Apple , researchers tested reasoning-augmented LLMs on structured problems like Tower of Hanoi, River Crossing, and Blocks World. The results were sharp. As task complexity increased, even models trained for reasoning began to fail. Performance dropped, not just in output quality, but in the effort models applied to thinking. The conclusion: reasoning in LLMs may appear to exist on the surface, but collapses when deeper, compositional logic is required. They argue that we should not mistake verbal fluency for true reasoning capability. A recent response, 𝗧𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝘁𝗵𝗲 𝗜𝗹𝗹𝘂𝘀𝗶𝗼𝗻 𝗼𝗳 𝗧𝗵𝗶𝗻𝗸𝗶𝗻𝗴, offers a different angle. The authors do not dispute that models fail in some of these tasks. But they show that many of those failures are a result of poor task design. Some models were asked to generate outputs that exceeded their token limits. Others were penalized for correctly stating that a task had no solution. When tasks were reframed more realistically, such as asking the model to generate an algorithm instead of every step, models performed well. Their conclusion is that what looks like reasoning failure is often a mismatch between evaluation expectations and what the model is actually being asked to do. Taken together, these papers provide a much-needed framework for thinking about when LLMs and reasoning-focused models (LRMs) are useful and where they are not. For simple tasks like summarization, retrieval, or classification, classic LLMs work well. They are fast, general, and effective. Adding reasoning often adds cost and confusion without improving performance. For medium-complexity tasks like applying policy logic, referencing context, or handling multi-turn interactions, LRMs offer clear value. Their planning ability, when structured well, improves accuracy and consistency. For complex tasks like symbolic reasoning, recursive planning, or solving puzzles with deep constraints, both LLMs and LRMs fail more often than they succeed. They either give up early, apply shallow logic, or lose coherence midway. These tasks require additional architecture: modular agents, memory-aware execution, or fallback control. Take a contact center automation as an example. For routine account questions, classic LLMs may suffice. For dynamic policy explanation or billing disputes, LRMs can help. For high-stakes calls involving eligibility, compliance, or contract negotiation, more structure is required. But this is just one example. The bigger lesson is this. We should stop assuming reasoning scales cleanly with model size or prompt complexity. It does not. Reasoning has limits, and those limits depend on how we frame the task, what we ask the model to output, and how we measure success.
No more previous content

No more next content
10 Comments
Like Comment
Armine Papikyan

I talk about AI

6,873 followers 4mo
Report this post
The AI industry might be poisoning itself, and nobody wants to talk about it. Since #ChatGPT blew up in 2022, companies have rushed to train new AI models on fresh internet data. But here’s the problem: a lot of that “new” internet content is already written by AI. So when AI models train on AI-generated content, they’re learning from machines, not real people. Think of it like copying someone’s homework… when that person already copied someone else’s bad homework. This creates what some experts call model collapse – when AIs start to get worse because they’re learning from junk instead of real, high-quality human-created information. To fix it, companies are turning to #RAG, which lets models look things up online instead of relying only on what they were trained on. Sounds smart, but not really. The internet is now packed with low-effort, AI-written junk. So when the model “retrieves” information, it often finds bad answers – and then gives you those same bad answers in a confident tone. So the fix might actually be making the problem worse. Honestly, the only thing that keeps this whole system from spiraling is a bit of good old-fashioned human judgment. There's more than one proof: 🔹 Meta's $15B investment in human data, 🔹 Andrej Karpathy on 'keeping AI on a tight leash' 🔹 Ali Ghodsi on how hard full automation is and the need for human supervision At SuperAnnotate, we’ve seen how much of a difference it makes when #humans are part of the loop – reviewing data, checking outputs, guiding quality. Because if AI’s only learning from itself, someone has to break the loop — or we just keep training tomorrow’s models on yesterday’s mistakes. #AI #data #HumanInTheLoop #SyntheticData #ModelCollapse
No more previous content

No more next content
22 Comments
Like Comment
Stefaan Verhulst, PhD

27,081 followers 1y
Report this post
🔍 The Evidence of an Emergent Data Winter keeps on Growing ❄️ New research highlights a significant challenge for #AI: the use of computer-generated data to train models can lead to nonsensical results, suggesting a looming "data winter." 👉Major AI companies like OpenAI and Microsoft are exploring "synthetic" data as they reach the limits of human-generated #data. 🤔 However, research published in Nature Magazine suggests this approach could degrade AI models rapidly. Key Findings: ✅ Synthetic data quickly leads to errors; ✅ AI models can collapse over time due to accumulating mistakes, losing variance, and producing gibberish; ✅ Problems worsen when synthetic data is used recursively, leading to repetitive and erroneous outputs. 👉Mitigation efforts, such as embedding “watermarks” to flag AI-generated content, require significant coordination among tech companies. 👉 There's also a first-mover advantage for companies using pre-AI internet data, as their models better represent the real world. 💻 Read my initial piece introducing the emergence of a Data Winter here: https://lnkd.in/eE7KYT-5 ➡️ Requires the sector to establish new Data Commons for an AI-age > Read our piece on the 10 areas where we need to innovate toward establishing Data Commons that balance innovation and prevent "the tragedy of the commons": https://lnkd.in/egr4ZTD2 - We will explore this further the next few months - if of interest, let me know! 💻 See Financial Times article: "The problem of ‘model collapse’: how a lack of human data limits AI progress": https://lnkd.in/eE44_5SE 💻 See Nature Magazine paper: "AI models collapse when trained on recursively generated data": https://lnkd.in/e4uX3Gxr #AI #artificialintelligence #DataWinter #SyntheticData #MachineLearning #Research
No more previous content

No more next content
3 Comments
Like Comment

LinkedIn respects your privacy

Understanding Model Collapse in Artificial Intelligence

Summary

Explore categories

Understanding Model Collapse in Artificial Intelligence

Summary

More in Understanding AI Systems

Explore categories