Monday, 3 November 2008

A Rant about RAID, with a Bad Metaphor about Eggs, and No Happy Ending.

I went in to work this morning and my main workstation had died over the weekend. Bluescreen on boot, no safe mode, nothing. Windows Update gone bad? We'l l probably never know, given I don't think it's coming back any time soon... but, as with previous overnight machine suicides, it looks like a problem with SATA RAID - specifically, two WD Velociraptors in a RAID-1 (mirror) array controlled by an Intel ICH10R chipset on an Asus P5Q motherboard.

You know your whole eggs & baskets thing, right? SATA RAID is like carefully dividing your eggs into two really good baskets, then tying them together with six feet of wet spaghetti and hanging them off a ceiling fan.

Long story short, I lost a day, and counting. I had to split the mirror into individual drives, switch the BIOS back to IDE, which gave me a bootable OS but - seriously - no text. No captions, no icon labels, no button text, nothing. Just these weird, ghostly empty buttons. Running a repair off the WinXP x64 CD got my labels back, but somehow left Windows on drive D. Another half-hour of registry hacks to get it back to drive C: where it belongs, and I had a creaking but functional system - VS2008 and Outlook are working, but most of my beloved little apps are complaining that someone's moved their cheese. Reinstalling is probably inevitable, along with the deep, deep joy that is reinstalling Adobe Creative Suite when your last remaining "activation" is bound to a PC that now refuses to deactivate it. Even Adobe's support team don't understand activation. Best they could come up with was "yes, that means there's no activations on that system." Err, no, Mr. Adobe, there are. It was very clear on that point. Wouldn't let me run Photoshop without it, you see. "Oh... then you'd better just reformat, and when you reinstall, you'll need to phone us for an activation override". Thanks, guys. I feel the love.

Sorry, I digress. This whole experience is all the more frustrating because RAID mirrors are supposed to be a Good Thing. If you believe the theory, RAID-1 will let you keep on working in the event of a single drive failure. Well... In the last 5 years or so, I haven't had a single workstation die because of a failed hard drive, but I've lost count of the number of times an Intel SATA RAID controller has suddenly thrown a hissy-fit under Windows XP and taken the system down with it. Every time it starts with a bit of instability, ends up a week or two later with bluescreens on boot and general wailing and gnashing of teeth, and every time, running drive diagnostics on the physical disks shows them to be absolutely fine.

This is across four different Intel motherboards - two Abit, one Asus, and a Dell Precision workstation - running both the ICH9R (P35) and ICH10R (P45) chipsets, and various matched pairs of WD Caviar, WD Raptor, WD Velociraptor and Seagate drives. One system was a normal Dell Precision workstation, the others are various home-built combinations, all thoroughly memtest86'ed and burned-in before being put into production doing anything important.

Am I doing something wrong here? I feel like I've invested enough of both my and my employer's time and money in "disaster-proofing" my working environment, and just ended up shooting myself in the foot. I'm beginning to think that having two identical workstations, with a completely non-RAID-related disk-mirroring strategy, is the only way to actually guarantee any sort of continuity - if something goes wrong, you just stick the spare disk in the spare PC and keep on coding. Or hey, just keep stuff backed up and whenever you lose a day or two to HD failure, tell yourself it's nothing compared to the 5-10 days you'd have lost if you'd done something sensible like using desktop RAID in the first place.

[Photo from bartmaguire via Flickr, used under Creative Commons license. Thanks Bart.]


jeremygray said...

Assuming for the moment that the array was still considering itself intact and valid, since you mentioned that the individual drives are okay and made no mention of the RAID controller complaining about integrity...

Somewhere between your OS and your apps they corrupted your data and the RAID did exactly what it is supposed to: write that data to both drives and make sure it got there safely.

Sounds to me like the thing that failed in this case (not entirely unlike other events in recent press) is your backup strategy.

JohnW said...

Dylan, I posted a reply of sorts on my blog. Consider using Windows software-based RAID if you're able to run server products or use VMs for development tools/work only.

AMcGuinn said...

If windows + raid doesn't work properly, it may not be raid that's at fault.

jeff said...

It sounds to me like your OS decided to hork itself, and the RAID did exactly what it is supposed to -- write down what the OS tells it. You gave no evidence of anything failing but yourself or Windows, and it is a safe bet that Windows is the culprit here, that corrupted things.

I have been using RAID on hundreds of boxes for years, and I have never seen a case where I could definitively state it was the fault of the RAID when something died -- failures like this are almost always 'input' related -- bad input is RAIDed out correctly.

Simply use a backup strategy and get back on your feet when your OS horks like that -- since RAID is not designed to recover from bad instructions.

Jay Bazuzi said...

Painful day for you. I've been there.

RAID = complexity, hence some new risks.

Windows Home Server = nightly backup that's good enough for me to rely on. Can even roll back a few days if I need to. Nice when I want to upgrade the main HD, too. Desktop RAID 0 maybe for performance.

dun3 said...

Hey Dylan,

been there. ;) One thing that I learned from the experience: raid is not a backup. Raid is for performance improvements and/or uptime improvements - but it will never help, when the OS decides to write awful stuff to your disk (or if you accidentally delete important data). Since your Desktop PC does not have the strick requirement to be up 24/7 Raid should imho only be used for performance - and never for a false sense of security.

Since I learned that the hard way, I have been using TrueImage for daily backups to several locations - and switched to Raid 0 (striping).


Steve Byan said...

Does the intel raid controller have a battery-backed-up non-volatile write-cache? Probably not. Is the volatile write cache in the SATA disks disabled? Probably not. Is your system on a UPS? Probably not.

Suppose your system is writing to the disks. The RAID controller mirrors the writes to both disks. The disks cache the writes in their volatile write cache. The disk firmware will flush out the cached writes to the disk media over a 30 second interval.

Suppose some cached writes on one disk get written to the media, but the mirrored copy on the other disk hasn't yet been written to the media. Now suppose the power fails, or Windows crashes (which causes a hardware reset to be sent to the disk drives). You now have a mirrored pair of disks which do not contain the same data.

What data will you get back when you read from the RAID-1? Flip a coin - sometimes you'll get the old data, from the disk where the cached write didn't make it to the media. Other times you will get the new data.

Also note that if the disk cache is enabled, a power failure or system crash will discard whatever data is in the disk's cache and which hasn't yet been written to the media. This can lead to file system corruption, and I've seen this result in a blue-screen after running Windows Update.

To sum up:

If your RAID controller doesn't have battery-backed-up NV-RAM, put your system on a UPS.

If your RAID controller doesn't disable the disk cache (or allow the use of WRITE FUA to force NTFS's metadata writes through the disk's cache and onto the disk media), then buy another RAID controller.

tegbains said...

We run all of servers and important workstations using external RAID's in a RAID 1 configuration. But the way these raids work is that the BIOS and OS have no idea that there is a RAID involved. They just see the RAID drive box as a single drive.

The external RAID boxes have an LCD screen to display the drive & RAID status as well as a sound buzzer to indicate problems. We use them via eSATA, USB and Firewire 800. And they work quite well overall.

Here's a link to the Stardom 3610 units that we use (and sell).

As for drives dying, I have only recently sent back 10 hard drives from Western Digital alone, had 6 DOA's from WD, and 2 DOA's from Seagate. We run the manufacturer's drive test on all new and possibly failing drives. SMART helps, but you still need to know what is happening to them.

And lastly, backups are still critical even if you have a RAID 1 system

Steve Byan said...

tegbains said...
"We run all of servers and important
workstations using external RAID's in a RAID 1 configuration. But the way these raids work is that the BIOS and OS have no idea that there is a RAID involved. They just see the RAID drive box as a single drive."

The specs on the Stardom-3610 web-page you referenced are inadequate. Does it include a battery-backed-up NV-RAM? If not, then it is subject to the "RAID write-hole", and so cannot reliably mirror data without an external UPS.

Does the Stardom-3610 support WRITE FUA? Does the firmware in its RAID controller disable the disk drive cache? If not, then your data integrity is at risk with the Stardom-3610 in the event of a power-loss or system crash.

In the interest of performance, does the tardom-3610 support SATA native command queuing (NCQ)?

StorageCraft said...

I think the main problem occurred when the operating system and the files running it actually got corrupted and the RAID functioned in a way for which it is programmed to. The data was copied by it on both the drives and it got duplicated.