365 Main Recovers From DowntimeBy Justin Lee, theWHIR.com
August 15, 2007 — (WEB HOST INDUSTRY REVIEW) — A data center’s reputation has become synonymous with its reliability, which is built almost entirely on its ability to maintain 100 percent uptime, and thus, assure clients that their service will be available 24 hours a day, 365 days a year.
For this reason, data centers spend millions of dollars a year updating their back-up power generators and related equipment to ensure that their clients are never without power in the event of a power failure or unforeseen disaster.
So when data center operator 365 Main’s (365main.com) San Francisco facility failed to start on July 24 during a PG&E power outage, resulting in 40 percent of customers in the facility losing power to their equipment for up to 45 minutes, no one was more concerned than 365 Main itself.
“There’s no question that this was a difficult experience for us and for our customers,” Miles Kelly, vice president of marketing at 365 Main. “We take operations very seriously. We talk about being the world’s finest — we do have a track record of 99.9967 percent across the entire portfolio, so any downtime we take very seriously.” The outage occurred after the transformer breakers at a local PG&E power station inexplicably opened. Power outages normally trigger 365 Main’s back-up diesel generators to start-up and take over providing power supply to customers, however, three of 365 Main’s 10 back-up power generators, manufactured by Hitec, failed to complete their start sequence.
The Hitec units are strenuously tested and inspected, Kelly says, on a daily, weekly, monthly, quarterly, semi-annual and annual basis, with the documented information made available to 365 Main customers for review. These same affected units handled perfectly in recent inspections prior to the incident.
Within hours of the outage, 365 Main sent a team of Hitec specialists to the San Francisco data center facility to join on-site technicians and begin systematically testing the generators to find the root cause.
Finally, after days of thorough testing, the team found a weakness in an essential component of the back-up generator system known as a Detroit Diesel Electronic Controller, which prevented the component from correctly resetting its memory. The invalid data left in the DDEC’s memory then caused misfiring or engine start failures when the generators were called on to start during the power outage. The investigation team fixed the issue by altering the timing of a command to the DDEC component, allowing more time between the engine shut-down command and the DDEC reset command. Once this fix was introduced, the Hitec generators successfully passed more than 50 consecutive start-up sequence tests without incident. ”This particular outage revealed this particular weakness, and that is something that has been addressed,” says Kelly. “There is no such thing as 100 percent-uptime data center, but we are doing everything we can to achieve that.”
365 Main has performed the DDEC fix in both its San Francisco and El Segundo facilities ? the only two facilities in its portfolio with Hitec generators containing DDECs.
The data center operator is also sharing the discoveries of its investigation with other Hitec customers. Meanwhile, Hitec has expanded its preventative maintenance procedures as a direct result of discoveries made during the investigation. Following the outage, 365 Main published an apology to its customers as well as provided daily updates directly from the investigation team meeting minutes, enabling customers and the public to track progress.
All of the affected 365 Main customers received refunds for any dropped electrical power from their servers during the outage under their 365 Main service level agreements. The company also launched an extensive customer outreach program where it met with the CEO’s of its customers in order to prove their credibility as a reliable data center operation, says Kelly.
“To best deal with the outage we have been very forthcoming in acknowledging the seriousness of what happened, letting people know all the information we had, day by day,” says Kelly. “No question our reputation is at risk and we are doing everything we can to show that this particular problem has been taken care of.”











