Amazon Web Services has explained what caused the outage that recently downed parts of its own services and third-party websites that use its services. In a post on its website, the company said an automated process caused the outage.
“An automated activity to scale capacity of one of the AWS services hosted in the main AWS network triggered an unexpected behaviour from a large number of clients inside the internal network,” Amazon Web Services said in the post.
“This resulted in a large surge of connection activity that overwhelmed the networking devices between the internal network and the main AWS network, resulting in delays for communication between these networks.”
The issue also hit Amazon’s ability to see what was wrong with the system. Explaining why it took so long to fix the outage, the company explained that the issue prevented even its own operations team from using real-time monitoring and internal controls they typically rely on.
Since Amazon’s Support Contact Centre also runs on the Amazon Web Services network, customers could not create support cases for seven hours. Amazon’s Service Health dashboard was also hit, resulting in delayed acknowledgment of the issue.
The company said it was working on a way to improve its response to outages and planned to release a revamped Service Health Dashboard to help customers receive timely updates in case of an outage.
Amazon Web Services said: “Finally, we want to apologise for the impact this event caused for our customers.”
“We know this event impacted many customers in significant ways. We will do everything we can to learn from this event and use it to improve our availability even further.”
The outage on December 7 not only knocked out Amazon’s own services, but also popular services such as Tinder, Venmo, Disney Plus, and Roomba. It also put on hold some Amazon deliveries. Amazon experienced the last major outage around the same time last year, causing an hours-long outage for a number of apps and websites.