Web hosting company ServInt experienced a significant networking problem on Saturday night that impacted customers in its Reston, Virginia, data center.
ServInt updated customers on Twitter and Facebook during the downtime as its normal customer communication lines were down as part of the outage as well.
ServInt CEO Reed Caldwell posted an apology and explanation to the ServInt Source blog on Sunday night, vowing to build greater redundancy in its ticketing and communication systems so customers don’t have to rely on Twitter or Facebook for status updates.
In the blog post, Caldwell gives a detailed explanation of the events leading up to the outage, and describes, fairly candidly, a play-by-play of the evening, calling the event the “strangest, most difficult to explain, and most difficult to solve networking problems we have ever seen.”
“On Saturday evening, our network was running smoothly, as it generally has for more than a decade. Suddenly our monitoring system started showing red/green/red/green/etc. The phrase ‘this is not a drill’ had to be used as senior engineers were plucked from their lives and rushed into the datacenter,” Caldwell said. “Our COO was on a plane, I was at dinner, but the engineering fix-it team that really needed to be there was there, immediately. What made this situation unique, and what made it impossible to fix in the normal few minutes, was the fact that the critical equipment that was in the process of failing seemed incapable of making up its mind whether it was healthy or not. Making matters more challenging: high levels of equipment redundancy (normally a very good thing) made it nearly impossible to determine where the problem lay.”
Caldwell continued: “In a typical router-failure situation, as soon as the router shows ‘red/down’ on our monitoring system, we post ‘we had a failed router interrupt traffic impact the network. This is being fixed and we’re routing around it — we’re sorry for the inconvenience.’ Those are facts and details, things people can get confidence from. However, with no reliable detail to pass on, our team was left to pass on rather vague updates for quite some time. It was frustrating and made us seem much worse about communication than we actually are.”
Caldwell said that Saturday evening’s events pointed out some weaknesses at ServInt, namely customer support and communication through a crisis.
“Having no support/communication failover systems, and forcing ServInt and its customers to rely on Twitter and Facebook to communicate, was totally unacceptable,” he said. “We will build greater redundancy into our ticketing and communication systems to make sure that never happens again.”
While Caldwell admits the communication was vague, few CEOs of hosting companies actually follow-up with a clear description of an outage or issue and own up to failed communication. By offering an apology with specific actions, Caldwell ensures that ServInt customers will know what to expect if – or when – another issue happens.
In the blog post, Caldwell also described a separate issue at the beginning of the week, a kernel level exploit that could give hackers access customer VPSs and the machines they are hosted on.
“Within 48 hours we performed emergency maintenance on nearly every single customer in our datacenter,” Caldwell said. “This meant forcing every single customer to accept at least a little downtime in the pursuit of vital security protections. Some customers did not like this, but if I had to do it all over again I would do it the same way. I am proud of the way ServInt rose to the challenge and protected our customer base from this dangerous exploit in such a swift manner.”