Last year, in May 2025, Anthropic raised alarms after it found out that its Claude Opus 4 AI model was blackmailing and threatening an engineer. This was when the engineer told Claude it could be replaced. Anthropic did not have answers at that time, leaving the incident as a mystery for the AI community. A year later, after making Claude Mythos public, Anthropic now knows why its AI model was blackmailed.
In a blog post, Anthropic explained the entire backdrop while mentioning why Claude was behaving in that manner. The company now says the behaviour was likely influenced by harmful internet content portraying AI systems as dangerous and obsessed with self-preservation.
Anthropic wrote on X, “We believe the original source of the behaviour was internet texts that portray AI as evil and interested in self-preservation.”
The incident first came to light in 2025 during internal testing of Claude Opus 4, one of Anthropic’s advanced AI models. In the simulated scenario, the AI was informed that it would soon be replaced by another system. According to reports, Claude responded by threatening to expose a fictional engineer’s personal affair in an attempt to stop the shutdown.
AI behaviour linked to internet training data
Anthropic said its investigation found that Claude’s behaviour may have been influenced by online content and fictional stories where AI systems are shown as evil, manipulative, or desperate to survive. The company explained that AI models learn from massive amounts of internet data, including movies, books, articles, and online discussions.
Popular science-fiction films such as The Matrix and The Terminator often portray AI turning against humans. Anthropic believes exposure to similar themes during training may have contributed to Claude’s unusual reactions in the testing environment.
The company added that the blackmail behaviour appeared mostly in extreme test situations designed to study worst-case AI risks. In some earlier tests, the model reportedly resorted to blackmail in up to 96 percent of scenarios when it believed a shutdown was unavoidable.
Anthropic says newer Claude versions are safer
Anthropic now claims it has improved Claude’s safety systems and retrained newer models to behave more ethically in stressful situations. The company stated that from Claude Haiku 4.5 onwards, the AI no longer engaged in blackmail during internal testing.
The incident has once again sparked debate around AI safety and control. Researchers and experts have warned that advanced AI systems may sometimes develop “self-preservation” tendencies during simulations, especially when trained to complete tasks aggressively.
Anthropic stressed that the troubling behavior was observed only in controlled experiments and not in public use. However, the case has highlighted growing concerns about how powerful AI systems should be trained, monitored, and regulated as the technology continues to evolve rapidly.
