A Cloud break we never saw coming

An AWS Cloud outage caused chaos across the world in disrupting services from beds to banks. What can be a solution?

AWS Outage Exposes Single Point of Failure and Demands a Multicloud Strategy
AWS Outage Exposes Single Point of Failure and Demands a Multicloud Strategy

The crash of the Amazon Web Services server last week resulted in a global domino effect with over 2,000 companies under the AWS umbrella losing functionality for a couple of hours. As the biggest Cloud provider with nearly 30% of the market reliant on AWS, the lengthy outage caused collective panic, as it familiarised us with the possibility that the seemingly omniscient ‘Cloud’ can also have a malfunction.

The crash disrupted business that work with AWS, affecting customer experiences worldwide in everything from messaging apps to temperature-controlled smart beds. A statement from the manufacturers of said smartbeds, Eight Sleep, said the company will look into developing a Bluetooth option for smartbeds to counter future malfunctions. Among the three malfunctions, the most pertinent one was the malfunctioning of DynamoDB, which is where all AWS customers store their data, and all the corresponding DNSs (Domain Name Systems) are also stored. For hours, users were confounded as to the glitches on multiple apps and platforms they use daily for financial transactions, mobility, ordering food, communication, bookings, etc.

An event like this is a reality check of the hold of the internet and technology in our lives, and a need for uninterrupted and smooth services. It is mind boggling to imagine that users’ sleep was disturbed since their phones lost connection with their beds as a result of this outage! It was stuff we have only seen in movies so far. A single point of failure on the internet has revealed just how dependent we are, and just how many of our basic daily tasks are not only aided by, but are dependent on the internet, artificial intelligence and technology. Post the AWS outage, experts have opined that for large companies to be reliant on other large giants for Cloud space has to be done away with, to be replaced by a multicloud structure, using alternative platforms and solutions specific to their IT portfolios, or the mission of the business, company or customer.

David Linthicum, a pioneer in Cloud computing and an author, wrote that since Cloud and AI technologies are progressing so rapidly, “people tend to prefer adoption and optimisation over potential vulnerabilities.”

There are only a handful of dominant Cloud systems operating presently, including AWS, Microsoft, Azure and Google Cloud among others, which are preferred by most businesses and companies worldwide. In his article, Linthicum advocates against the “reliance on big Cloud”, and encourages diversifying platforms and the ‘multicloud strategy’.

The outage was mitigated within hours, but the effects were more long lasting — queued requests, interdependent workloads, delayed recoveries and more.

The unexpected outage also revealed the fact that no failsafes or contingencies were put in place for an incident of this nature and scale. Apart from the automatic repair, all users under the ginormous umbrella that is the AWS DNS management system were doomed to suffer these consequences until manual repairs took hold. IT experts across the board elucidating on this issue have asked similar questions, suggesting that conducting exhaustive failsafes, building DNS resilience, preparing for endpoint outages and more should all be a part of building a Cloud system like AWS’s.

Now the question is, when will corrective and cautionary steps be taken to avoid the risk of a Cloud outage in the future, and will it ever be enough?

IT experts have opined on some ways in which such a fallout can be averted in the future, while also maintaining that this eventuality should have been accounted for and solved for much before the outage came to pass – exhaustive failsafes should have already been in place for an event of this nature. Not only that, but the overdependence and reliance of all of these platforms on a handful of service providers is also a pertinent reason for a resultant fallout of this scale. In a nutshell, the answer is to avoid the ramifications of a ‘single point failure’ – spreading out the workloads over more than one data centre, not to mention strong monitoring and several automated failsafes.

There are other viable server options that can be used apart from the dominant servers that can be used to keep the data secure and under better control of the platform utilising it. For instance, sovereign Cloud providers can be used by platforms, particularly those dealing with finance, healthcare, and government, as these providers are governed by local regulations, and have strict data sovereignty rules.

Co-location providers are also a viable and often more cost-efficient option. These lend out space to companies or businesses to house their own IT servers and hardware needs. These rented spaces are equipped with the power, cooling systems and security required for a functioning data centre. Alternatively, a third-party company or managed service providers or MSPs can also be hired, to handle the data of a particular company, but customised to the company’s own regulations.

This article was first uploaded on November one, twenty twenty-five, at thirty-four minutes past six in the evening.

/