The AWS Service Health Dashboard indicates the outage occured on August 7 and 8
(WEB HOST INDUSTRY REVIEW) — Amazon Web Services (www.aws.amazon.com) revealed the details of its Dublin data center outage last week. In a blog post, Amazon says that it initially believed the transformer failure was caused by lightning, but now says it is continuing to investigate the root cause.
On August 7, the utility provider suffered a failure of a 110kV 10 megawatt transformer and resulted in a total loss of electricity supply to all customers connected to the transformer, including the AWS Availability Zone.
AWS says that its backup generators failed (as a result of faulty programmable logic controllers) and there was insufficient power for all of the servers to continue operating. UPS quickly drained, according to AWS, and it lost power to almost all of the EC2 instances and 58 percent of the EBS volumes in that Availability Zone.
About 30 minutes later, AWS says it was seeing launch delays and API errors in all EU West Availability Zones because the EC2 management service has servers in each Availability Zone and the EC2 management servers receiving requests were continuing to accept RunInstances requests targeted at the impacted zone. It says many of the instances and volumes in the Availability Zone were accessible by 1:49 PM PDT.
In addition to the power disruption, AWS says it discovered an error in the EBS software that “cleans up unused storage for snapshots after customers have deleted an EBS snapshot.” According to the blog post, “the human checks in this process failed to detect the error and the deletion process was executed.”
On Monday, August 8, AWS says it created recovery snapshots for all affected snapshots, delivered them to customer accounts and communicated the issue on the Service Health Dashboard.
The blog post also outlined preventative actions it plans to take including adding redundancy and more isolation for PLCs so they are insulated from other failures.
For EC2, it plans to address the resource saturation that affected API calls. It also will reduce the long recovery time required to recover stuck or inconsistent EBS volumes.
“Communication in situations like this is difficult. Customers are understandably anxious about the timing for recovery and what they should do in the interim,” the post reads. “We always prioritize getting affected customers back to health as soon as possible, and that was our top priority in this event, too. But, we know how important it is to communicate on the Service Health Dashboard and AWS Support mechanisms.”
According to the blog post, it communicated more frequently in this event than in prior events by tweeting to key dashboard updates, staffing its AWS Support team to handle higher forum and support contacts and tried to give an approximate time-frame early on. Still, it says, its communication needs improvement.
“First, we can accelerate the pace with which we staff up our Support team to be even more responsive in the early hours of an event,” it says. “Second, we will do a better job of making it easier for customers (and AWS) to tell if their resources have been impacted.”
AWS says it will provide a 10 day credit equal to 100 percent of usage of EBS volumes, EC2 instances and RDS database instances affected. Customers impacted by the EBS software bug will receive a 30 day credit and have access to Premium Support Engineers.
“We will do everything we can to learn from this event and use it to drive improvement across our services. As with any significant operational issue, we will spend many hours over the coming days and weeks improving our understanding of the details of the various parts of this event and determining how to make changes to improve our services and processes.”
In 2009, Amazon blamed a separate EC2 outage on lightning as well.
No related posts.











