Explainer: How AI is learning all the wrong lessons from social media data
LLMs such as ChatGPT, Claude, and Perplexity are driving AI adoption rates in both consumer and business segments. Around 78% of global companies currently use AI, with the global AI market expected to reach $1.85 trillion by 2030.
Experts say this could be the most important study on AI this year.
A new study has found that large language models, the backbone of artificial intelligence (AI), fed on a diet of meme threads & clickbait content suffer irreversible “brain rot” leading to cognitive decline. Careful data curation is therefore critical to protect the integrity of AI systems, explains Banasree Purkayastha
What has the study found?
A study by scientists at the University of Texas at Austin, Texas A&M University, and Purdue University has found that LLMs exposed to viral social media data begin to suffer measurable cognitive decay. Repeated training on trivial, engagement-driven, or low-quality information web content can weaken their reasoning, memory, and ethical reliability, thus making AI actually less intelligent. In other words, garbage in, garbage out.
The scientists call it “LLM brain rot”, inspired by the concept of Brain Rot, defined as the effect on human cognition that comes from consuming large volumes of trivial and unchallenging online content. The term “Brain Rot” itself was Oxford’s 2024 Word of the Year. While LLMs obviously do not have “grey matter” or “neurons” in the same sense as humans, they do have parameters and attention mechanisms that might analogously be “overfitted” or “distracted” by certain data patterns, it said. And the impact is similar. AI systems continue thinking, but less and less coherently, thus threatening LLM safety.
How they tested on junk data?
The scientists built two versions of information sets from Twitter/X data: one filled with short, high-engagement viral posts, the other with longer, factual or educational text. Engagement-based junk data was based on two factors: Popularity – the total number of likes, retweets, replies, and quotes; and length – the number of tokens in a tweet. For longer text junk data, superficial topics (like conspiracy theories, exaggerated claims, unsupported assertions or superficial lifestyle content), and attention-drawing style (such as sensationalised headlines using clickbait language or excessive trigger words such as hashtag, wow or look, that are capitalised to gain more attention, but do not encourage in-depth thinking) were chosen. Then they trained several open LLMs, including LLaMA and Qwen, on these datasets. And watched their cognitive ability rapidly collapse.
Reasoning fell by 23% while long-context memory dropped 30%. According to the authors, the failure pattern wasn’t random. As training exposure to viral content increased, the tendency to skip intermediate reasoning steps, a phenomenon they call thought skipping, also rose, a mechanistic kind of attention deficit built into the model’s weights. The LLMs produced shorter, less structured answers and made more factual and logical errors. What is more worrying is that personality tests showed spikes in narcissism and psychopathy, raising questions about safety and reliability of LLMs. The damage was also largely irreversible — after the degraded models were retrained on clean data, reasoning performance improved somewhat but never returned to baseline. According to the researchers, this is due to representational drift, a structural deformation of the model’s internal space that standard fine-tuning can’t reverse.
Why the findings matter
Experts say this could be the most important study on AI this year. The results from the study call for a re-examination of current data collection from the Internet and continual pre-training practices. If data is the new oil, clean data is the new clean fuel. As LLMs scale and ingest ever-larger volumes of web data, careful curation and quality control will be essential to prevent cumulative harms. However, it’s not just about data quality – it’s about protecting AI’s cognitive health and ensuring the integrity of LLMs. Given the biases and errors made by the LLMs fed on junk data, the study has recommended regular “cognitive health checks” for AI systems. “Limited by the scope of the paper, we leave it as an open question how popular tweets or other junk data change the learning mechanism, resulting in cognitive declines. Answering the question is essential for building stronger defence methods in the future,” said the authors of the study.
LLMs such as ChatGPT, Claude, and Perplexity are driving AI adoption rates in both consumer and business segments. Around 78% of global companies currently use AI, with the global AI market expected to reach $1.85 trillion by 2030. India has the highest rate of AI adoption (around 59% of companies). While customer service uses AI the most, cybersecurity and fraud detection, customer relationship management, content creation, digital assistants, supply chain operations and many others all ride on AI. As AI increasingly becomes a mainstream business technology, it becomes all the more crucial for business leaders to remain “cognitively sharp” in the AI age. Companies can’t really afford to treat AI as a colleague or co-pilot as it can create a giant echo chamber. Thus, keeping an eye on the datasets on which the LLM has been trained is crucial. Every assertion it makes must be verified through other channels before it can be trusted for high-stakes decision-making.