FOREST CITY, NC - APRIL 19:  A collage of profile pictures makes up a wall in the break room at the new Facebook Data Center on April 19, 2012 in Forest City, North Carolina.  The company began construction on the facility in November 2010 and went live today, serving the 845 million Facebook users worldwide.  (Photo by Rainier Ehrhardt/Getty Images)

Facebook Open Sources Data Center Network Fault Detection Tools

Add Your Comments

datacenterknowledgelogoBrought to you by Data Center Knowledge

Several years ago Facebook shut down an entire data center to test the resiliency of its application. According to Jay Parikh, the company’s head of engineering, the test went smoothly. The data center going offline did not disrupt anybody’s ability to mindlessly scroll through their Facebook feed instead of spending time being a contributing member of society.

Facebook and other web-scale data center operators, companies that built global internet services that make billions upon billions of dollars, have shifted the data center resiliency focus from redundancy and automation of the underlying infrastructure – the power and cooling systems – to software-driven failover. A globally distributed system that consists of so many servers can easily lose some of those servers without any significant impediment to the application’s performance.

Read more:Facebook Turned Off Entire Data Center to Test Resiliency

That’s not to say they’ve abandoned backup generators, UPS systems, and automatic transfer switches. You’ll still see all of those things in Facebook data centers; it’s just that they are no longer the single line of defense.

Today, Facebook open sourced some of the software tools it has built in-house that help its engineers detect the location of an outage within its infrastructure down to a single cluster of servers within a matter of seconds, isolate the fault, and avoid a wider-scale issue.

Read more: Why Should Data Center Operators Care about Open Source?

The tools are parts of a system called NetNORAD, which constantly monitors the entire Facebook data center infrastructure for packet loss rates and latency. Using data analytics, it detects abnormal patterns and triggers alarms, usually within 30 second of a fault.

“Our scale means that equipment failures can and do occur on a daily basis, and we work hard to prevent those inevitable events from impacting any of the people using our services,” Petr Lapukhov, a network engineer at Facebook, wrote in a blog post. “The ultimate goal is to detect network interruptions and automatically mitigate them within seconds. In contrast, a human-driven investigation may take multiple minutes, if not hours.”

The components of NetNORAD Facebook is open sourcing are pinger and responder, the system that has a set of servers (pingers) continuously reach out to all servers in Facebook data centers and generates packet loss and latency data based on the responses they receive, andfbtracert, the tool that automatically determines the exact location of a fault.

For more details on how NetNORAD works, read Lapukhov’s blog post.

Original article appeared here: Facebook Open Sources Data Center Network Fault Detection Tools


Subscribe Now and Get Our Exclusive Report on "The Hosting Infrastructure Ecosystem"

Enter your email to receive messages about offerings by Penton, its brands, affiliates and/or third-party partners, consistent with Penton's Privacy Policy.

Related Forum Threads

Add Your Comments

  • (will not be published)