CloudFlare spent the first hour and a half of the new year thinking it was one second further into 2017 than the rest of the world, resulting in a handful of DNS resolution failures which impacted a small number of machines.
According to CloudFlare, who offered an explanation and apology in a blog post on Sunday, some of its customers who use CNAME DNS were affected when its custom RRDNS software confronted a negative number caused by neglecting to account for the leap second.
At peak, only 0.2 percent of DNS queries and 1 percent of HTTP requests to CloudFlare were affected, and by 0645 UTC the fix was applied across the its global network. The most affected machines were patched in 90 minutes, CloudFlare says.
A blog post explaining the error and its fix provides some background on CloudFlare DNS, as well as a section on “falsehoods programmers believe about time.”
“Cloudflare customers use our DNS service to serve the authoritative answers for DNS queries for their domains. They need to tell us the IP address of their origin web servers so we can contact the servers to handle non-cached requests,” CloudFlare CTO John Graham-Cumming wrote. CNAME is one of the two ways origin server information is provided. “When a customer uses the CNAME option, Cloudflare has occasionally to do a lookup, using DNS, for the actual IP address of the origin server. It does this automatically using standard recursive DNS. It was this CNAME lookup code that contained the bug that caused the outage.”
The company applied a fix preventing negative numbers from being recorded, and restarted its RRDNS servers to fix any recurrence.
CloudFlare is inspecting its code for any other possible leap second issues.
Azure suffered a glitch on February 29, 2012, that lasted more than 12 hours.