In late February 2017, a number of large websites across the internet abruptly went down.
Community-question-site Quora crashed, as did product management tool Trello, and Amazon’s artificial intelligence assistant Alexa also struggled.
The outage lasted several hours — and Amazon was to blame. This is because all the affected sites made use of Amazon Web Services (AWS), the cloud web hosting service from the Seattle-based technology giant that now underpins vast swathes of the modern web and hit $12 billion (£9.3 billion) in revenue last year.
The outage lasted several hours, and highlighted the unique vulnerabilities of our digital world: A handful of companies are responsible for maintaining huge swathes of the internet — and when there’s a problem with one of them, thousands of businesses and millions of people can be left unable to work.
So what happens inside Amazon when there’s a tech failure of this magnitude? Business Insider sat down with Werner Vogels, the chief technology officer of AWS at the AWS Summit in London in late June to discuss how the company handles it.
“We are so, so aware of the fact for many businesses their livelihoods are dependent on Amazon operating, on AWS really operating well, and that’s a heavy responsibility,” he said. “We’re happy to take it.”
Step 1: Find the problem — and console the customers
“[The] first thing that happens is a load of alarms start going off even before your customers are experiencing something,” the Dutch-born executive explained.
The Amazon Web Services team then has two urgent tasks: Triage the problem and figure out just what’s going on, while trying to calm the freaking-out customers whose businesses have just gone offline.
“You see the symptoms, but you do not necessarily see the root cause of it … you immediately fire off a team whose task is to actually communicate with the customers … making sure that everyone is aware of exactly what is happening.”
Meanwhile, “internal teams of course immediately start going off and trying to find what’s the root cause of this is, and whether we can repair or restore it, or what other kinds of actions we can start taking.”
Vogels then dropped in a sly humble-brag: AWS goes down so rarely that when it does, it can be difficult to work out what’s going on because there’s little frame of reference. “Remember, this is a service that has not gone down in 12 years, so it’s not that … we could rely on some sort of previous experience on this.”
The time of day shouldn’t make a difference to repair efforts: AWS teams work “round the sun,” and there are always demanding customers expecting uptime, whether it’s late-night gaming in Seattle or early-morning financial services firms in Zurich.
If there’s a major outage, though, Vogels said “of course” he would expect to be woken up immediately, and the senior management team will continuously track developments.
Step 2: Fix it
The issue behind the fault in February? Human error. The short version is that an engineer typed the wrong number — causing a chain reaction that ultimately led to a major failure.
Once diagnosed, Amazon’s engineers have to go about fixing the problem, while also ensuring other systems do not also buckle under the sudden strain. “You have to sort of start protecting customers, start protecting system, because what happens is so many customers are still using this system, can’t get access to the system, and while you’re trying to repair this you’re still overwhelmed with customers that are still retrying and retrying and retrying.
“And so you then start to block the traffic to make sure the system can come back online and become healthy again before you can stat accepting traffic again.”
Throughout all of this, you have anxious customers seeking guidance. “Customers don’t like advice that says ‘sit still, don’t do anything.’ No, that’s not what they want, and for that you need to give them really good information, make them understand what’s happening, given an expectation of when the service will be coming back online if you have such information.”
Some of AWS’ big customers have systems and failsafes in place to try and anticipate these kind of failures and prepare for them. Netflix has a system called ChaosMonkey, for example: “A whole set of tools to sort of simulate these extreme failures … they take away a whole availability zone or a whole region and see what happens, and things like that.”
But why a monkey? As Netflix previously explained: “The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables—all the while we continue serving our customers without interruption.”
Step 3: Learn from it
Vogels places the blame not on the engineer directly responsible, but Amazon itself, for not having failsafes that could have protected its systems or prevented the incorrect input. “I think we can blame ourselves, in terms of not having turned this into sort of a procedure or something that was automated, where we could’ve had total good control over what the number could be.”
This is a key point for Vogels: As you grow and develop, introducing too many points that require human intervention result in points of possible failure. Where possible, automate.
“Internally it triggers a whole set of new operational procedures. The minimum thing you have to do from this is learn from it understand really what are the things … realising there may be still organically growing operational procedures where there is too much human decision-making in the path which could be automated, and so you then go do a review of your overall business to see if there are other places in your organisation … where there might be operational vulnerabilities.”
Because of what’s at stake, the stakes are far higher for AWS and other cloud providers — Microsoft, Google, IBM, and so on — than ordinary businesses, and the tolerance for major failure is much lower.
“I will never be satisfied until our services are what I call ‘indistinguishable from perfect,'” Vogels said. “Even though stuff happens and in this case it’s human, other things can happen, major natural disasters can happen, things like that. So we see we’re prepared for most of these kind of things and we help customers build architectures that can protect themselves from this as well.”
P.S. Here’s precisely what caused the February outage
In the aftermath of the outage in February, Amazon Web Service published a public postmortem explaining what went wrong, and some of the changes it was making as a result of it. You can read the full thing here, and an extract is below:
“We’d like to give you some additional information about the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on the morning of February 28th. The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region. This subsystem is necessary to serve all GET, LIST, PUT, and DELETE requests. The second subsystem, the placement subsystem, manages allocation of new storage and requires the index subsystem to be functioning properly to correctly operate. The placement subsystem is used during PUT requests to allocate storage for new objects. Removing a significant portion of the capacity caused each of these systems to require a full restart. While these subsystems were being restarted, S3 was unable to service requests. Other AWS services in the US-EAST-1 Region that rely on S3 for storage, including the S3 console, Amazon Elastic Compute Cloud (EC2) new instance launches, Amazon Elastic Block Store (EBS) volumes (when data was needed from a S3 snapshot), and AWS Lambda were also impacted while the S3 APIs were unavailable.”