When I originally wrote, and then republished on Dan's Data, my Ground Zero column about hard drives wearing out, I was puzzled by something.
The famous Google study (PDF) of a large population of hard drives found, oddly, that the Self-Monitoring, Analysis, and Reporting Technology (not dreadfully helpfully abbreviated "S.M.A.R.T.") that's built into all modern hard drives was pretty much useless for its intended purpose. It just doesn't often tell you when a drive is on the way out and should be replaced.
Any drive that's been in service for a couple of years will have a couple of S.M.A.R.T. warning flags thanks to the basic hour counters built into the standard. Those warnings, by themselves, don't mean much at all. But despite those largely useless warnings that all older drives have, 36% of the drives that failed in the Google study had no warnings at all!
Technically, S.M.A.R.T. should work much better than this. The drive controller board knows when it has to repeatedly retry reads or writes, for instance; that's the most basic kind of ominous error. S.M.A.R.T. is just a standardised interface to allow drives to tell monitoring software how often stuff like that is happening.
And yet, very often, no such report happens.
S.M.A.R.T. monitoring isn't completely useless; a drive that actually does report any of the more serious S.M.A.R.T. problems should indeed be replaced. So you should still run some S.M.A.R.T. monitoring utility or other.
But, usually, a drive with serious S.M.A.R.T. errors is a drive that you're already carrying out to the shooting range. By the time the monitoring software reports an error, you've already lost data.
I didn't know why this was.
Now, however, I've got a clue, and I've added a piece to the Ground Zero column to mention it.
This Slashdot comment led me to this Usenet post from (someone who says he is) a former Seagate engineer. He alleges that the hard drive manufacturers' marketing departments just overruled the engineers and made them, in essence, secretly turn off S.M.A.R.T.'s early warning features, to make the drives look more reliable.
Until those drives failed without warning, of course.
But until that happened, they looked super-reliable!
I don't know whether this is actually true, but it sure does fit the evidence.
Great work, marketroids!
Regular readers may have noticed a certain animosity, on my part, towards the hard-working graduates of the world's many fine advertising and marketing schools.
Damn straight.
10 December 2007 at 11:20 pm
I've got a friend who worked for WD until rather recently. His stories of the drive creation process paint an image far from what the PR department would like. Specificly the mismanaged half wits who kludged together the drive firmware. I should ask him if he knows anything about how SMART was handled next time I see him.
11 December 2007 at 1:50 am
Concerning the reliability of drives, I think the following quote from http://blogs.sun.com/relling/entry/using_mtbf_and_time_dependent illustrates rather effectively the difference between marketingspeak and what you can really expect from hardware:
"details. For example, data collected for the years 1996-1998 in the US showed that the annual death rate for children aged 5-14 was 20.8 per 100,000 resident population. This shows an average failure rate of 0.0208% per year. Thus, the MTBF for children aged 5-14 in the US is approximately 4,807 years."
11 December 2007 at 6:13 am
Oh yeah, the joys of "marketing specs"... here's another good one: LiIon cells *can* be discharged all the way down, but then they tend to either a) lose lots of capacity or b) blow up at random. So a safe battery pack will cut the discharge at about 80% of the nominal capacity of the cells. So you have a nominal figure (good for, oh, about one charge/discharge cycle ) and then an actual achievable figure which is 80% of the nominal. Guess which figure gets marked on these packs? Yep, sigh.
11 December 2007 at 4:50 pm
http://www.donotcall.gov.au
I encourage all Aussie to register their home phone number. Either that or play how long can the telemarketer be strung along before they hangup. A good line five minutes into the call is "whoa whoa the kids are fighting one moment please...wait..wait.....okay, that's the kids sorted, please start again from the beginning as I've lost my train of thought" and repeat until telemarketer hangs up. Sure they are plebs paid by the sale, but impact them and you impact their evil overlords.
11 December 2007 at 9:36 pm
Uh, OK... but we're not actually talking about telemarketing, here.
(Also, the lobbyists made good and sure that Australian Do Not Call list does not apply to politicians, charities, schools, religions, the Government itself, or according to its own home page, "market researchers". We're on the list, and it's reduced the number of junk calls we receive a little, but it's a long way from being a complete solution. Not that we ever got that many junk calls, mind you; the telemarketing problem is less severe here in Australia than it is in the USA.)
12 December 2007 at 9:08 am
In my experience, server-class (SCSI, or obviously SCSI-based) hard drives will live around 5 years (in a server role), always-on desktop drives (or consumer grade drives in a server role) will last about 3 years, and drives that spend most of their lives spun down will last longer than is worthwhile to keep them running. For the average consumer, your drive will be painfully outdated well before it actually dies.
With that said, there is still the chance that a drive will die before its time -- its just that its not a very *good* chance, if you're dealing with small numbers. In a data center environment where you have racks upon racks of 2u servers, holding 6 drives each, that chance works out to actual, regular, drive failures.
Consumer drives are fairly reliable for consumer applications, but I only buy those drives with a 5 year warranty (Seagate gives you 5 years on all their drives). This doesn't make them necessarily any more reliable, but it portrays a level of pride in their products that they are willing to stand behind them.
Anecdotally, I would say that drive manufacturers all hit a point where they had to increase profits by not replacing drives. Some chose to shorten their warranties, and some chose to make more reliable drives. This may not be the case, but it gives me the warm fuzzies.
1 January 2008 at 3:13 pm
As a previous evil tele-researcher (yeah I know, when they say it is soul crushing, it is. Being on the phone for 6 hours and having every single person either hate you or simply hang up gets kinda old kinda fast.) What you NEED to say (and they won't care unless you say this) is: Please remove my phone number from your database and never call me again. Thankyou, goodbye.
If you say that the poor person on the end of the line is required (we were told by law) to remove you from the list.
I've been doing that for the last 2 years or so since I quit and we now get zero calls. Yay!