Facebook Apologizes After 2.5-hour Outage

The somewhat oddball profile picture of the Facebook engineering account The somewhat oddball profile picture of the Facebook engineering account

(WEB HOST INDUSTRY REVIEW) — Facebook, the runaway social networking success that many in the hosting business consider a potentially dangerous competitor, suffered an outage of several hours on Thursday after a change to a cache configuration value caused a database cluster to be overwhelmed.

In a post made Thursday evening to the company’s engineering blog, Facebook’s director of software engineering Robert Johnson apologized for the outage, which lasted roughly 2.5 hours.

“Today we made a change to the persistent copy of a configuration value that was interpreted as invalid,” wrote Johnson. “This meant that every single client saw the invalid value and attempted to fix it. Because the fix involves making a query to a cluster of databases, that cluster was quickly overwhelmed by hundreds of thousands of queries a second.”

The site’s enormous popularity – more than 500 million registered users, according to recent accounts – meant that the outage, the site’s longest outage in four years, was major news around the web, in spite of being a comparatively small hiccup in the operation of a service that is available free to users.

According to reports, users attempting to access the site during the outage got “DNS error” messages.

Jordan said in the blog post that the fix for the condition meant cutting off requests to the database cluster, which unfortunately meant turning off the site. The company was able to bring the site back online after the databases had recovered and the condition causing the problem had been fixed.

“For now we’ve turned off the system that attempts to correct configuration values,” writes Jordan. “We’re exploring new designs for this configuration system following design patterns of other systems at Facebook that deal more gracefully with feedback loops and transient spikes.”

Liam Eagle

About

Liam Eagle has worked as a contributor to the Web Host Industry Review since its inception in 2000, and as editor since 2003. He has been editor of the WHIR's print magazine since its launch. His daily involvement in the gathering and reporting of Web hosting news and his regular interaction with Web hosting leaders gives him an uncommonly broad appreciation of the issues and tends facing the business. Through his WHIR blog, Liam spots Web hosting trends and offers opinions on the industry-wide impacts of major developments and the motivation behind big announcements. Follow him on Twitter @liameagle

No related posts.

Leave a Comment