IT conspiracy theory of the week

When I originally wrote, and then republished on Dan's Data, my Ground Zero column about hard drives wearing out, I was puzzled by something.

The famous Google study (PDF) of a large population of hard drives found, oddly, that the Self-Monitoring, Analysis, and Reporting Technology (not dreadfully helpfully abbreviated "S.M.A.R.T.") that's built into all modern hard drives was pretty much useless for its intended purpose. It just doesn't often tell you when a drive is on the way out and should be replaced.

Any drive that's been in service for a couple of years will have a couple of S.M.A.R.T. warning flags thanks to the basic hour counters built into the standard. Those warnings, by themselves, don't mean much at all. But despite those largely useless warnings that all older drives have, 36% of the drives that failed in the Google study had no warnings at all!

Technically, S.M.A.R.T. should work much better than this. The drive controller board knows when it has to repeatedly retry reads or writes, for instance; that's the most basic kind of ominous error. S.M.A.R.T. is just a standardised interface to allow drives to tell monitoring software how often stuff like that is happening.

And yet, very often, no such report happens.

S.M.A.R.T. monitoring isn't completely useless; a drive that actually does report any of the more serious S.M.A.R.T. problems should indeed be replaced. So you should still run some S.M.A.R.T. monitoring utility or other.

But, usually, a drive with serious S.M.A.R.T. errors is a drive that you're already carrying out to the shooting range. By the time the monitoring software reports an error, you've already lost data.

I didn't know why this was.

Now, however, I've got a clue, and I've added a piece to the Ground Zero column to mention it.

This Slashdot comment led me to this Usenet post from (someone who says he is) a former Seagate engineer. He alleges that the hard drive manufacturers' marketing departments just overruled the engineers and made them, in essence, secretly turn off S.M.A.R.T.'s early warning features, to make the drives look more reliable.

Until those drives failed without warning, of course.

But until that happened, they looked super-reliable!

I don't know whether this is actually true, but it sure does fit the evidence.

Great work, marketroids!

Regular readers may have noticed a certain animosity, on my part, towards the hard-working graduates of the world's many fine advertising and marketing schools.

Damn straight.

  1. Jax184 Says:

    I've got a friend who worked for WD until rather recently. His stories of the drive creation process paint an image far from what the PR department would like. Specificly the mismanaged half wits who kludged together the drive firmware. I should ask him if he knows anything about how SMART was handled next time I see him.

  2. haakon Says:

    Concerning the reliability of drives, I think the following quote from illustrates rather effectively the difference between marketingspeak and what you can really expect from hardware:

    "details. For example, data collected for the years 1996-1998 in the US showed that the annual death rate for children aged 5-14 was 20.8 per 100,000 resident population. This shows an average failure rate of 0.0208% per year. Thus, the MTBF for children aged 5-14 in the US is approximately 4,807 years."

  3. Zed2 Says:

    Oh yeah, the joys of "marketing specs"... here's another good one: LiIon cells *can* be discharged all the way down, but then they tend to either a) lose lots of capacity or b) blow up at random. So a safe battery pack will cut the discharge at about 80% of the nominal capacity of the cells. So you have a nominal figure (good for, oh, about one charge/discharge cycle ) and then an actual achievable figure which is 80% of the nominal. Guess which figure gets marked on these packs? Yep, sigh.

  4. rocketsensor Says:

  5. Daniel Rutter Says:

  6. squash Says:

    In my experience, server-class (SCSI, or obviously SCSI-based) hard drives will live around 5 years (in a server role), always-on desktop drives (or consumer grade drives in a server role) will last about 3 years, and drives that spend most of their lives spun down will last longer than is worthwhile to keep them running. For the average consumer, your drive will be painfully outdated well before it actually dies.

    With that said, there is still the chance that a drive will die before its time -- its just that its not a very *good* chance, if you're dealing with small numbers. In a data center environment where you have racks upon racks of 2u servers, holding 6 drives each, that chance works out to actual, regular, drive failures.

    Consumer drives are fairly reliable for consumer applications, but I only buy those drives with a 5 year warranty (Seagate gives you 5 years on all their drives). This doesn't make them necessarily any more reliable, but it portrays a level of pride in their products that they are willing to stand behind them.

    Anecdotally, I would say that drive manufacturers all hit a point where they had to increase profits by not replacing drives. Some chose to shorten their warranties, and some chose to make more reliable drives. This may not be the case, but it gives me the warm fuzzies.

  7. chrysilis Says:

