A part of Datapipe's management console, adapated for the AWS service
(WEB HOST INDUSTRY REVIEW) — In late April, Amazon’s AWS hosting services were hit with a significant outage that impacted its EC2 and relational database services in its Northern Virginia availability zone – an outage that lasted nearly three days (before everything was back to functioning normally) and was especially notable for impacting the performance of a few popular social networking services, including Hootsuite and Foursquare.
For many observers and users of the AWS services, the outage was a bit of a wakeup call – a rare demonstration of the fallibility of Amazon’s EC2 service, or of infrastructure clouds in general.
“If you were in a [traditional data center environment] and your application had no ability to be repositioned somewhere else, and the power went out for a day, what would you do? They would call you unprepared,” says Ed Laczynski, vice president of cloud strategy and architecture at managed hosting provider Datapipe, in an interview with the WHIR. “I think a lot it has to do with understanding that just because it’s on the cloud, doesn’t mean you don’t have to think about how you engineer it.”
For Datapipe (www.datapipe.com), whose Managed Cloud solution is a unique kind of managed front-end for the more basic AWS services, the Amazon outage was an opportunity to make good on its promise of providing the kind of engineering, monitoring and management that would render this kind of disruption irrelevant from an end-user standpoint.
“You actually have to make sure you take advantage of what’s really not available in other traditional environments – the ability to dynamically provision resources,” says Laczynski of cloud environments. “You need to be prepared to do that. So, you need to have your data, for example, stored in a place and a manner that’s available in case you need to spin up in other places. You need to have some self-healing, meaning you need to implement some source control management and change management processes, so if you do need to reposition resources, you can get the latest version of your software or your code or user-generated content, and reposition it somewhere where that compute load is available.”
According to Laczynski, Datapipe had a direct hand in engineering about 95 percent of the customer solutions it runs that are either in-part or entirely based on Amazon’s cloud hosting. In those cases, the applications were run through processes to ensure their cloud readiness, and designed to best practices for auto-scaling, self-healing or high availability, depending on the customer’s needs.
Datapipe has a long process, and a broad set of design principles, it takes Managed Cloud customers through. While it varies depending on the customer’s needs, the process generally centers around evaluating improving customer applications in terms of cloud readiness, change management, auto-scaling, self-healing, high availability and disaster recovery.
“We kind of walk our customers through that process,” says Laczynski. “Not every customer has all of those attributes. And not all of them need it. Really, the main ones are being core cloud ready, having change management, and having some kind of self-healing strategy.”
When it came to reacting on the day of the outage, says Laczynski, for Datapipe, it was a matter of executing plans that were already in place. In a way, it was a real-world test of all the systems the company had developed for customers. During the outage, Datapipe’s teams were polling the company’s solutions, looking at availability and checking in with the agents it has placed on all its customer solutions, and repositioning systems where it was necessary.
“We’re close to the metal, we’re in terminals, we’re in APIs and consoles and we’re keeping an eye on things and making sure we’re prepared for outages. If a customer does have a disruption, we have a whole process for that, where we open up tickets on our side, engage the customer on it and help them to resolution,” he says. “This event tested all that, on our side, and I’m happy to say we came out pretty good on it. I’m proud of what our team did there. It was kind of one of those days – it’s happening now, so let’s see if everything we’ve done works. We knew it would work because we tested it.”
Followng the outage, which lasted approximately three days for some of the worst-hit clients, and result in some data loss in a few cases, Amazon published a lengthy post-mortem report, promising improvements to the system and a compensation package in excess of what the company’s SLA guaranteed.
For Datapipe’s part, the report from Amazon was welcome, and opened up the kimono on AWS a little bit in a way that could really benefit the user community in general.
“I think the fact that they responded with a series of architectural documents and best practice guides – some of that stuff that was a little bit left to interpretation – I think was very helpful,” says Laczynski. “It helps us educate customers, as well, in terms of, here’s the reason why we take extra time to set up your solution.”
He says that one of the takeaways from the Amazon outage was the success some of the company’s big-name customers had keeping their systems online, and what it says about architecting applications for the cloud.
“You’ve heard the success stories,” he says, “whether it’s from Datapipe or it’s from these large Amazon customers, like Netflix for example, that didn’t experience major disruption, even though inside, they probably were running around like we were, trying to make sure services stayed available. To the customer, they experienced minimal disruption because we took ownership of that, and we knew that our engineering was prepared these sorts of events.
“I think a lot of customers put something on cloud, and just assume that cloud equals easy. I think that’s not the case.”
No related posts.











