You were trying to find your favorite puppy pictures on Imgur, only to be sorely disappointed by either a slowly responding page, or a blank screen entirely. If this was your day on Feb. 28, 2017, you were not alone. That’s the day that where storage buckets hosted in the US-East-1 region of the Amazon AWS ecosystem experiencing some serious gremlins. Medium, Imgur, the Docker Registry Hub, and Yahoo webmail services all took a hit. I couldn’t even reach my Nest devices; showing that many IoT technologies were either unavailable or severely impacted.
So, what happened? Was it a giant failure of technology? Or maybe a coordinated attack? No. Human error. A typo to be specific.
Here’s the reality – this isn’t the first time this has happened, and it’s certainly not the last. In June 2012 a very powerful storm knocked out an entire data center which was owned by Amazon. What was hosted in that data center? Amazon Web Services. All affected AWS businesses in that data center were effectively down for over six hours. Why didn’t the generators kick in you ask? Here’s Amazon’s response:
All utility electrical switches in both data centers initiated transfer to generator power. In one of the data centers, the transfer completed without incident. In the other, the generators started successfully, but each generator independently failed to provide stable voltage as they were brought into service. The utility power in the Region failed a second time at 7:57pm PDT. Again, all rooms of this one facility failed to successfully transfer to generator power while all of our other data centers in the Region continued to operate without customer impact.
I’m not trying to pick on AWS here. Other major cloud providers have had serious issues as well. An SSL certificate issue helped cause a global – cascading – event taking down numerous cloud-reliant systems. Who was the provider? Microsoft Azure. Full availability wasn’t restored for 12 hours – and up to 24 hours for many others. About 52 other Microsoft services relying on the Azure platform experienced issues – this includes the Xbox Live network.
There’s a real cost associated with all of these cloud outages. Ponemon Institute recently released the results of the latest Cost of Data Center Outages study. Previously published in 2010 and 2013, the purpose of this third study is to continue to analyze the cost behavior of unplanned data center outages. According to the new study, the average cost of a data center outage has steadily increased from $505,502 in 2010 to $740,357 today (or a 38 percent net change).
Throughout their research of 63 data center environments, the study found that:
- The cost of downtime has increased 38 percent since the first study in 2010.
- Downtime costs for the most data center-dependent businesses are rising faster than average.
- Maximum downtime costs increased 32 percent since 2013 and 81 percent since 2010.
- Maximum downtime costs for 2016 are $2,409,991.
We know that outages aren’t fun, and they can be quite costly. As today’s business rely even more on data center services, the cost and impact of outages will continue to be challenges facing IT leaders and executives. However, there are ways to prepare. Consider the following steps:
- Have a plan. Specifically, design a Business Impact Analysis (BIA) plan for your organization. This means carefully planning out all of your systems, their components, details around data, user access, and very much more. The goal here is to understand and rank you most critical apps and business process. Basically, what can or can’t you live without? From there, understand what an outage would cost you; and how you can effectively mitigate that outage. This type of planning will help you understand which systems are critical to recover; and, how quickly they need to be recovered. If you can’t do it yourself; hire a partner that can help you develop a disaster recovery ad business continuity plan designed around your BIA. Yes, this can be time consuming and a very task-oriented operation. However, it’ll be well worth the effort.
- Test out your plan. What good is a plan if it’s never tested? You absolutely need to test out your recovery systems to ensure proper failover. In fact, you should test out various failure scenarios as well. Weather-related, malicious, and accidental events should all be planned around. It’s critical to test various components around failure to ensure you have proper failover capabilities. Here’s the thing – you do not need to test out your production systems. You can create “production-like” silo testing beds to check your most critical apps and data points. From there, you can tweak your production infrastructure as needed.
- Evolve and ensure your plan is in line with your business. Your recovery plan should never, ever, be set in stone. As soon as you update an app – check your failover strategy. As soon as you create a new network patch – review your BIA to ensure compliance. As soon as you add a new data point – test out the dependencies to other databases and workloads. A loose link in your recovery plan can prolong the failure and create additional challenges. Because of the importance of IT, your recovery plan must coincide with updates, patches, business initiatives, and evolving IT strategies.
- Learn and understand the value of your IT infrastructure. Too often we brush off the expense of a truly all-encompassing recovery strategy thinking what we have is “good enough.” Then, something happens and we quickly learn the real value of being down for an entire day. A big part of the conducting a BIA is learning the critical nature of our apps, processes, and data points. Another big part is placing a real business value around those IT components. Understand what it’ll cost you to lose an app for 1 hour versus 1 day. Remember, outage costs continue to rise. Invest in a good recovery strategy now to help insulate you from risk and those rising costs.
- Don’t put all of your eggs in one cloud basket. Having a cloud strategy is great. But, it also takes a lot of planning and coordination. Many organizations like to find one cloud partner and work them directly. Nothing wrong with that at all. However, if you have critical systems which simply can’t go down – think about the impact of losing that data center. What if the recovery zones you setup within the very same cloud provider don’t work? Or, what if the generators fail to kick in? New tools surrounding software-defined networks, network functions virtualization, and advanced load-balancing help organizations stay very agile. Furthermore, cloud control platforms like CloudStack, OpenStack, and Eucalyptus all help extend cloud and simplify management. The point here is think about creating your recovery strategy outside of your primary provider. Maybe an on premise or hybrid ecosystem makes sense. Recovery options can be leveraged where you only pay if a disaster actually happens. When your primary cloud goes down, you can create a failover platform which is completely transparent to the end-user. This helps your users (or customers) continue to get their services while you recover primary systems.
The biggest point here is to, at the very least, have some kind of plan in place to protect yourself against an outage. The honest truth is that no environment is completely without its faults. Even a platform offering six 9’s of availability means that things might go down 20-30 seconds during the year. Can you handle that? What about data and application dependencies; can they handle a 20 second outage? Being prepared isn’t an illusion around never going down or experiencing an outage. Rather, it’s a plan that will help you recover as quickly as possible, while helping insulate the risk around cost. In the world of cloud – be prepared for volatile and constantly changing parameters. The more prepared you are around your critical systems, the quicker you can recover.