AWS Meets the Hydra: One Error Brings Hours of Cloud Downtime

Add Your Comments

The word of the day is hydra, defined by as a persistent or many-sided problem that presents new obstacles as soon as one aspect is solved.

On Feb. 28 a large chunk of the Eastern United States internet fell apart because Amazon Web Services shut itself down. I run most of my correspondence and all of our contracts through Grammarly; I first found out about the AWS mishap when I tried to use Grammarly for one of those contracts. For a couple of hours, I was greeted with a screen that in part read “Grammarly runs on Amazon Web Services, and they are currently experiencing an outage” complete with a cute graphic of a grammar robot recharging. I immediately thought to myself: why does Grammarly have this web page ready to pop up? It does not speak well of its relationship with AWS.

I have since found out the cause of the AWS outage: human error. It seems that a software engineer was tweaking part of the billing system and doing a bit of recoding. Amazon explains, “The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems.”  It sounds like it went on and on and on.

The hydra commenced for four long hours. If your company was dependent on the east coast AWS infrastructure you were something out of luck. Your customers couldn’t care less about AWS.

What does this mean for the hosting company that is somewhat smaller than AWS? I quickly wrote a tweet that read: AWS has been down for 3 hrs causing havoc on the www nationwide. Host + mirror your services.

For several years I have consistently suggested that one way of competing with the like of AWS is to do something they cannot do. Be somewhere else. AWS customers need to mirror or back-up their websites and data on an entirely different system and network. This is where you should step in. And it is easy to understand thanks to Amazon, which is doing its best to hide and sanitize the issue. AWS customers of any reasonable size should be running to your company for back-up and data storage.

Yes Amazon, it was human error that set these dominoes in motion. Yes, you will claim you understand the problem and will not allow X amount of resources to be re-allocated at one time. Cross my heart, ever and ever again.

Who is to say the next hydra won’t last for days or even weeks.

There is a reason why Amazon, Google, Microsoft, Apple, Salesforce each deliver new versions of operating systems, issue security patches, and software updates, why? The core product has a problem, subject to glitches, blunders, or, human error.

Hosting owners, operators and sales personnel, you have a marketing opportunity. One industry friend quickly asked if he could use my Twitter note. Of course, and to the reader, permission to reprint in whole granted.

Now go out and build your business. The market is smaller than you think and will be much larger than you dream.

Later, Tom

Find out more about Tom Millitzer: Millitzer Capital FB

Email Tom Direct

Add Your Comments

  • (will not be published)