Amazon CTO Werner Vogels mentioned it first. Now there are discussion threads on Slashdot , Engadget and Digg.
Google Research's disk failure trends report reminds me of Ronnie, a former customer whose 10-server population suffered FOUR hard drive failures over the course of three weeks. At least a couple of the failed drives drives were brand new. Other folks with his server config experienced no such disasters. Were Ronnie's machines located in a particularly hot section of the data center? Was his utilization much higher than anyone else's?
Believe or not, after analyzing over 100,000 hard drives (parallel or serial ATA, 5400 to 7200 RPM, 80GB to 400GB) between Dec 2005 and Aug 2006, Google found that there's no significant correlation between either temperature or activity level and failure probability. Age isn't necessarily a good predictor, either. The average failure rate among 1 year old drives was ~2%, rising to ~8% in years 2 and 3, but declining to ~6% in year 4 (see page 4 of this PDF).
Also importantly, while certain SMART parameters (especially scan errors, reallocation count, offline reallocation, probational count) correlate strongly with higher failure rates, over 56% of the failed drives had zero counts on any of these variables. In other words, it's not possible to pinpoint impending failure based on SMART data.
On the other hand, drive models, manufacturers and vintages do make a huge difference. The report doesn't show breakdowns, but Amazon's Vogels says you pretty much get what you pay for. In addition to investing in high quality disks, you can also improve average longevity of your hard drive population with a longer burn-in period, during which bad disks will be weeded out.
Still, hard drive survival is very much a game of numbers, which isn't reassuring news for anyone who's running mission critical apps on standalone servers. Vogels says this is a good reason to store data on S3 and let Amazon worry about the problem; many users agree. If you're in the hosting business and would like to hold on to your customers, a virtualized storage platform is increasingly a must-have. "Fast hardware replacement" offers just aren't as compelling as freedom from hassles associated with hard drive failure.
There are no comments for this entry.
[Add Comment]