CloudFlare suffered an hour-long outage over the weekend after pushing out a change that caused a system-wide failure of its Juniper edge routers
This article has been updated to include a statement from Juniper Networks.
Web security and performance provider CloudFlare suffered an hour-long outage over the weekend after pushing out a change that caused a system-wide failure of its Juniper edge routers, according to several reports.
The post-mortem on Sunday’s outage is hosted on CloudFlare’s Posterous blog, and is inaccessible at the time of writing, but according to CNET the issue started 1:47 PDT Sunday morning when CloudFlare detected a DDoS attack on one of its customers.
The outage affected Juniper routers running the Flowspec protocol, which allows customers to broadcast router rules to a large number of routers efficiently, CNET says.
CloudFlare detected the attack when it identified attack packets between 99,971 and 99,985 bytes, exceeding CloudFlare’s 4,470-byte maximum packet size. In this instance, Flowspec accepted the rule and relayed it to its edge network.
“What should have happened is that no packet should have matched that rule because no packet was actually that large,” Matthew Prince, CloudFlare CEO writes in the post-mortem. “What happened instead is that the routers encountered the rule and then proceeded to consume all their RAM until they crashed.”
The outage impacted all of its 23 data centers, and the approximately 175,000 customers around the world that rely on its DNS and web proxy service. CloudFlare also has more than 100 web hosting partners that resell its performance and security service to their own customers.
This is one of a handful of major outages CloudFlare has had since its launch, and is a blow to the company who just launched its next generation web optimization protocol Railgun with more than 25 hosting partners last week.
Prince says CloudFlare will make it up to its paying customers, though at this point it isn’t clear what the reimbursement will look like. CloudFlare introduced very competitive SLAs with its high-end plans last year; a 100 percent SLA with its business plan, and a 2500 percent SLA with its enterprise plan.
UPDATE Monday, 3:27 pm ET: “Juniper Networks is aware of and investigating a reported network outage with one of our customers, Cloudflare. While we have not completed our investigation, we believe this incident was triggered by a product issue that Juniper identified last October, when a patch was also made available. Our customer support team is actively supporting Cloudflare in its efforts to resolve the issue and we are not aware of any other customers experiencing similar issues.”
Talk back: Were you impacted by the CloudFlare outage over the weekend? How did your customers that use CloudFlare react? Let us know in a comment.












{ 5 comments… read them below or add one }
Another example of service engineering MIA. Various tests are part of the design process prior to a release being pushed to the production environment. I wouldn’t be pointing finger’s at the Juniper SE. No, whoever is responsible for network engineering at CloudFlare needs to have their a!! handed to them. I’d put them in “Herding Cats” in the below referenced article:
http://www.networkperformanceinnovations.com/blog/where-are-you-along-the-network-service-design-continuum/
Absolutely agree. CF acted irresponsibly by introducing mass config update to ALL servers (who does that, really?) and when this failed, they tried to shift the blame. Their first response was all about “not trowing the vendor under the bus” but few hours later they did just that… Now it turns out that there was a patch which they didn’t care to use… Very unprofessional and irresponsible behavior. They should invest more in IT and less in marketing and hype.
So my boss showed this to me because we use Juniper like crazy and respect CloudFlare. My take? Cloudflare, like the first commenter, screwed up their rollout BIG TIME. Seems like they are better at spinning outages than preventing their own instigation of them.
But as Juniper I’d fire the Juniper Sales Engineer on the Cloudflare account. If you know your client’s organization and network like a good SE, then you should know ahead of time to warn the client against these things, and then notify your Sales Manager so Juniper HQ has wind of this ahead of time should it blow up. It’s basic CYA for goodness sake.
Juniper routers are deployed and working all over the world correctly. Cloudflare has a dozen locations and they don’t all fail at the same time… It’s because they were asked to do something without proper testing and validation that Juniper supports such a configuration change – which obviously it doesn’t. Throwing vendors under the bus shows the inexperience of cloudflare showing through and that they’re not enterprise ready. Hopefully this showed them a good lesson to TEST first before mas deploying code. And heaven forbid, if it’s hard to test without real traffic (which it is hard) – then push the code our to 1 or 2 pops, not your entire network. Sheesh.
Whilst I agree with you there, every second CloudFlare waste “testing” rules is a second that the attack is affecting other customers on the affected PoP’s. Whilst testing should happen, when you are someone like CloudFlare and you are getting hit with DDoS after DDoS, there is little time to do testing. In my opinion this whole issue could have been prevented before testing was even needed! CloudFlare engineers should know their stuff, and should have realized that there actually were no packets of that size and shouldn’t have even pushed the rule. Taking it one step further, with all the tech and software they use internally, and develop, they could easily develop a relatively simple Flowspec verifier to help prevent minor issues like this in the future!