From a reader:

Love to hear your take on this: Why RAID 5 stops working in 2009.

That article by Robin Harris is more than a year old, but was suddenly linked hither and yon in the last few days. Its thesis is that because RAID array capacities are approaching unrecoverable read error rates, if one disk in your RAID fails, you'll very probably get a read error from one of the other disks in the course of rebuilding the array, and lose data.

This basic claim is true, but there are three reasons why this problem is not as scary as it sounds.

1: Losing one bit shouldn't lose you more than one file. Consumer RAID controllers may have a fit over small errors, but losing one file because a drive failed is seldom actually a serious problem.

2: Putting your valuable data on a RAID array and not backing it up is a bad idea even if disk errors never happen. One day, you are guaranteed to confidently instruct your computer to delete something that you actually want, and RAID won't protect you from that, or from housefires, theft of the drives, and so on. You still need proper backups.

3: If you're going to build a story on statistics, it helps a lot if you get the statistics **right**.

Robin Harris says it is "almost certain" that 12 terabytes of disk drives with a one-in-12Tb read error rate will have an error if you read the whole capacity.

This statement is wrong.

Actually, the probability of one or more errors, in this situation, is only 63.2%. When you know why this is, you discover that there's less to fear here than you'd think.

(Robin Harris is not the only sinner, here. This guy makes exactly the same mistake. This guy, on the other hand, says one error in ten to the fourteen reads gives you "a 56% chance" of an error in seven terabytes read; he's actually done the maths correctly.)

The mistake people make over and over again when figuring out stuff like this is saying that that if you've got (say) a one in a million chance of event Y occurring every time you take action X, and you do X a million times, the probability that Y will have happened is 1.

It isn't.

(If it were, then if you did X a million and one times, the probability that Y will have occured would now be slightly **more** than one. This is unacceptably weird, even by mathematicians' standards.)

What you do to figure out the **real** probabilities in this sort of situation is look at the probability that Y will **never** happen over your million trials.

(If it matters to you if Y happens more than once, then things get more complex. But usually the outcomes you're interested in are "Y does not happen at all" and "Y happens one or more times". That is the case here, and in many other "chance of failure" sorts of situations.)

To make this easier to understand, let's look at a version of the problem using numbers that you can figure out on a piece of paper, without having to do anything a million times.

Let's say that you're throwing an ordinary (fair!) six-sided die, and you don't want to get a one. The chance of getting a one is, of course, one in six, and let's say you're throwing the die six times.

For each throw, the probability of something *other* than one coming up is five in six. So the probability of something other than one coming up for all six throws is:

5/6 times 5/6 times 5/6 times 5/6 times 5/6 times 5/6.

This can more easily be written as five-sixths to the power of six, or (5/6)^6, and it's equal to (5^6)/(6^6), or 15625/46656. That's about 0.335, where 1 is certainty, and 0 is impossibility.

So six trials, in each of which an undesirable outcome has a one in six chance of happening, certainly do not make the undesirable outcome certain. You actually have about a one-third chance that the undesirable outcome will not happen at all.

It's easy to adjust this for different probabilities and different numbers of trials. If you intend to throw the dice 15 times instead of six, you calculate (5/6)^15, which gives you about a 0.065 chance that you'll get away with no ones. And if you decide to toss a coin ten times, and want to know how likely it is that it'll never come up tails, then the calculation will be (1/2)^10, a miserable 0.00098.

In the one-in-a-million, one-million-times version, you figure out (1 - 1/1000000)^1000000, which is about 0.368. So there's a 36.8% chance that the one-in-a-million event will never happen in one million trials, and a 63.2% chance that the event will happen one or more times.

OK, on to the disk-drive example.

Let's say that the chance of an unrecoverable read failure is indeed one in ten to the 14 - 1/100,000,000,000,000. I'll express this big number, and the other big numbers to come, in the conventional computer-y form of scientific notation that doesn't require little superscript numbers. One times ten to the power of 14, a one with 14 zeroes after it, is thus written "1E+14".

The chance of **no** error occurring on any given read, given this error probability, is 1 - 1/(1E+14), which is 0.99999999999999. Very close to one, but not quite there.

(Note that if you start figuring this stuff out for yourself in a spreadsheet or something, really long numbers may cause you to start hitting precision problems, where the computer runs out of digits to express a number like 0.99999999999999999999999999999999999999 correctly, rounds it off to one, and breaks your calculation. Fortunately, the mere fourteen-digit numbers we're working with here are well within normal computer precision.)

OK, now let's say we're reading the whole of a drive which just happens to have a capacity of exactly 1E+14 bits, at this error rate of one error in every 10^14 reads. So the chance of zero errors is:

(1 - 1/(1E+14))^1E+14

This equals about 0.368. Or, if you prefer, a 63.2% chance of one or more errors.

Note that the basic statement about the probability of an error remains true - overall, a drive with an Unrecoverable Read Error Rate of one in ten to the fourteen will indeed have such an error once in every ten to the fourteen reads. But that doesn't **guarantee** such an error in any **particular** ten to the fourteen reads, any more than the fact that a coin comes up evenly heads or tails guarantees that you'll get one of each if you throw it twice.

Now, a RAID that's 63.2% likely to have an error if one of its drives fails is still not a good thing. But there's a big difference between 63.2% and "almost certain".

(Note also that we're talking about a **lot** of data, here. At fifty megabytes per second, ten to the fourteen bits will take about 2.8 **days** to read.)

Getting the statistics right makes the numbers look proportionally better if the error rate can be reduced.

If drive manufacturers manage to reduce the error rate by a factor of ten, for instance, so now it's one in every ten to the **fifteen** reads instead of every 1E+14, the chance that you'll get no such errors in a given ten to the fourteen reads improves to about 90.5%.

If they reduce the error rate all the way to one in ten to the **sixteen**, then ten to the fourteen reads are 98.9% likely to all be fine.

I'm not saying it's necessarily easy to make such an improvement in the read error rate, especially in the marketing-bulldust-soaked hard-drive industry.

But neither is the situation as dire as the "almost certain" article says.

All who commit such crimes against mathematical literacy are hereby sentenced to read John Allen Paulos' classic *Innumeracy*.

(This is not a very severe sentence, since the book is actually rather entertaining.)

23 October 2008 at 4:49 pm

RAID is nice...

...but off-site backups are necessary.

23 October 2008 at 5:16 pm

The probabilities are determined by Poisson statistics, which describe the probabilities of discrete random occurrences that occur at a known rate. An interesting point is that when you have the rate and the number of trials the same (e.g. one million trials, event probability one in a million) the probability of the event occurring once is the same as it not occurring at all- 36.8%. The remaining 26.4% are multiple occurrences.

*sigh*

I miss working as a professional geek :(

23 October 2008 at 6:57 pm

> Fortunately, the mere fourteen-digit numbers we’re working with here are well within normal computer precision.

That statement deserves some clarification. Using

Singleprecision floating point calculation is NOT enough, as it will internally save 1-1E-14 as 1.0000. (It only uses 32 bits in total, and once you've accounted for sign and exponent, there's only 23 bits left for the significand.) The more commondoubleprecision, with its 64 bits (52 bits significand) on the other hand will store it as 0.9999999999999900079927783735911361873150, an error of 8E-18, which will give a result rate of 0.3682.Using

extendedprecision with 80 bits(64 bits significand) gives a final result of 0.36788.I've been beating my compiler for half an hour, but I can't force it to use 128 bit floats....

23 October 2008 at 7:32 pm

"And if you decide to toss a coin ten times, and want to know how likely it is that it’ll never come up tails, then the calculation will be (1/2)^10, a miserable 0.00098."

When I was younger, I figured a way to flip a 50c coin and *nearly always* get the side I declared. I think my record was up near 50. That's when I decided that I was a master coin-flipper and lost interest.

A decade later and out of practise, I was with a bunch of mates at a pub when I happened to mention my amazing ability (we were talking about randomness) and bet the next round that I could flip 20 in a row.

I didn't have a 50c piece, so borrowed one. As I reached 10 the cheers were loud. As I flipped 15 there was silence. Next flip I dropped it. It landed on the floor on the wrong side.

There was much jeering and rejoicing in the affirmation that random is random and that sometimes, beer is free.

The 50c went towards the tab. :)

23 October 2008 at 11:10 pm

So, the chances of flipping a coin 92 times and having it come up heads each time is approximately 0.0000000000000000000000000002 or 2x10^-26 percent... Of course, after a coin has flipped 91 times and come up heads, the chances of it coming up heads the 92nd time is still 0.5 or 50%... (Just had to put that in there.)

24 October 2008 at 12:00 am

As with Major Malfunction's comment, in the real world if you flip a coin 91 times and it comes up heads every time, you're actually likely to suspect there's something fishy going on, making the coin not entirely fair :-).

24 October 2008 at 12:15 am

You don't actually need to suffer through horrible precision issues: for small p and large n, (1-p)^n approaches e^(-pn), and this distribution approaches the exponential distribution beloved of engineers (see MTBF).

Now if only I could convince my sysadmin that RAID!=backup.

24 October 2008 at 12:55 am

or if you're lazy you can just use an arbitrary precision calculator, tell it you want 100 digits of precision and press go...

≈0.367879441171440482198317912942188734802555237040082071548360806899173123091441309312472039260147817

24 October 2008 at 3:37 am

One factor I'm not sure is being considered is that drive failure rates (which is what unrecoverable errors are, if the drive didn't fail the read could be retried successfully) aren't correlated with volume of data read, which makes 1-error-in-12TB statistics meaningless. Drive failure rates are also highly variable over the life of the drive, starting off at their maximum during the break-in period, then dropping to their normal level, and then steadily increasing as the drive exits its useful service life. I find it easy to believe that a RAID5 array built with new drives fresh out of the box could experience multiple drive failures during a short period of time, but after six months this becomes substantially less likely. This is also why you keep hot-spares, so the drive you are swapping in isn't still in its break-in period.

24 October 2008 at 3:53 am

I don't think drive failure and unrecoverable read errors are the same. I mean, yes, any unrecoverable error is a drive failure, but a drive that's failed to read something is not necessarily a dead drive.

As far as I know, unrecoverable read errors always happen

afterretry attempts. Sometimes the drive just can't figure out what that bit used to be. I think it's quite possible for the drive to then be able to successfully write and read data from that exact same location, though I presume it'll actually map that location out, probably replacing it with a sector from a section set aside for the purpose.24 October 2008 at 4:58 am

Now hang on a moment...

The article claims this failure rate will lead to the DEATH of RAID, but wouldn't it be more likely to cause the LIFE of RAID? If a drive has a decent chance of corrupting 1 file every time you fill it up, wouldn't you want to put it into a RAID 5 even more than ever before?

24 October 2008 at 8:23 am

The article also said something about RAID 6 not helping much (RAID 6 has two lots of parity distributed across the disks, instead of just 1 for RAID 5), as it just delays the inevitable. But surely - even when the figures are made correct - this means that for RAID 6 to not help much, that it would require two read errors to occur on two disks in the same stripe. I think the article writer missed that point and simply merged all the numbers together.

24 October 2008 at 11:10 am

Well, rats. Peridot beat me to my point about e. When you mentioned that a 1/1,000,000 chance 1,000,000 times was 36.8%, I immediately recognized that magic number (e^-1).

It takes around 5 million tries to get the probability of at least one failure up to 95%.

24 October 2008 at 7:28 pm

"As with Major Malfunction’s comment, in the real world if you flip a coin 91 times and it comes up heads every time, you’re actually likely to suspect there’s something fishy going on, making the coin not entirely fair :-)."

Au contraire! The coin was fair!

It's the *technique* that was not. *That's* about as random as juggling chainsaws.

24 October 2008 at 11:30 pm

Popup: Just use a bignum library that stores numbers internally as rationals.

To me, the more interesting question is, "Are we really going to lose the whole array if there's one unrecoverable section?" I don't know all the detailed mechanics of how RAID5 works, but that seems to me like it would require a remarkably boneheaded design. Most ordinary single-disk filesystems don't have this property, certainly. Even FAT filesystems usually have at least two copies of the allocation tables, or so I am told, and in practice a filesystem error is usually in a system file, which can be easily replaced because it's a standard part of the operating system and thus extremely non-unique. I would like to think that in the event the RAID controller cannot recover a certain sector from the drives, it would just return an error when attempting to read that sector, the same as an ordinary disk controller would do if a single drive could not recover the sector, and let the software figure out what to do about it, which in most cases would probably just mean marking the sector as bad and possibly corrupting one file. Perhaps someone who is more versed in RAID than I am can discuss this point. (And before someone asks what if the whole RAID is just used to store One Big File, that would probably be a database, and in that case wouldn't you use replication instead of RAID?)

I do think that RAID5 can sometimes be a false economy, depending on the amount of storage capacity you're talking about. Suppose you're talking about a three-drive RAID5 setup with a typical size of drive (in late 1998, we'll say 100GB, or maybe 150). By using RAID5 instead of RAID1, you get 2/3 of the total capacity instead of 1/2, which sounds like a win. But in a lot of cases it's probably a loss, because the three drives probably cost more than the two drives of greater capacity that you would need to get the same amount of storage out of RAID1. The simplicity of RAID1 is worth something, as well.

It is, of course, still no substitute for offsite backups. Any data you would still want in the event of a complete hardware failure of the entire computer should be backed up offsite. Fire, theft, lightning, malicious bogus anonymous tips to overzealous law enforcement, ... there any number of exciting ways you can lose all the computers in the building at one shot.

24 October 2008 at 11:43 pm

Did I really write "late 1998"? Make that "late 2008". And yeah, as of October 24th on Newegg, the cheapest 160GB drive runs about $41 (so, three of them would be $123 plus shipping), but for $58 you can get a 320GB drive (so two of them would be $116 plus shipping). So unless you actually need the performance characteristics of RAID, which with a consumer-grade RAID controller are probably nothing to write home about anyway, the RAID1 setup is actually cheaper.

That does change when you get up into the larger capacities and higher performances of products that aren't really aimed at home users, so I'm not saying RAID5 doesn't have any real uses. But for a typical home or small-office user, it's uneconomic.

25 October 2008 at 12:07 am

RE: Dan's comment about the coin not being fair -- you don't *really* have to worry until somebody calls "edge".

27 October 2008 at 4:40 am

`Losing one bit shouldn’t lose you more than one file.`

Not quite. Losing one bit in one RAID stripe basically means you've lost all that stripe (due to the way RAID checksums work, the whole stripe will be marked as invalid). A typical stripe size is 64 kB, which must then be multiplied by the number of drives (minus one, for RAID-5). That will tell you how much data you've lost. If the total amount of data is smaller than one file system allocation unit, then you probably only lost one file (unless that part of the filesystem was compressed, wich can mean more than one file per allocation unit). If it was bigger, you may have lost more files.

Anyway, you are right that (unlike the ZDNet article claims) the risk is mainly to individual files (or small groups of files), not to the

wholearray.