2015-07-17

Damaged hard disks and data salvaging: stay well away from SpinRite

This is something I posted to MacIntouch on 2011; reposting here hoping it's easier to find and helps someone.

(Interestingly I made a spanish version of this post which has attracted a couple of True Believers. I'm looking forward to see what happens in english)




From my experience with SpinRite (about 6 years ago): stay away from it, it's just snake oil. There are free, effective tools that work and don't need to resort to unbelievable claims to explain how they work.

My case: my hard disk had suddenly started acting up; while booting, the computer would freeze for about 15 minutes and then go back to life, until it had a new 15-minute long seizure. The seizures were consistently that long, and always started with some "click" from the disk.

Looked like it had some kind of mechanical failure, the kind in which reading the data ASAP is important because things will get only worse, quick. And yet I spent more than 24 hours waiting for SpinRite to do *anything*; it only showed absurd graphs while it still didn't even finish the first hundred kilobytes of the about 120 GB disk. A shorter (but still hours-long) second try showed the same. Since it would never finish at that rate, I interrupted it and went on with a (free, of course) Knoppix LiveCD; I used ddrescue and dd_rescue (different and very useful) to salvage whatever I could.

ddrescue reads disks and keeps trying to get the info out no matter what errors it will find. dd_rescue does the same but takes into account the fact that when there are hardware errors, the failing zone can be really un-readable, causing lots of (slow!) retrys just when you should be trying to be as gentle as possible with the drive. dd_rescue skips the detected bad parts to first extract the readable zones and later go back (automatically!) to the previously skipped zones. Genius, and free.

When there are hardware failures in the hard disk, a failed read can take seconds instead of milliseconds; about 5 seconds in my case, IIRC. If you take into account that the OS tries commonly to read a lot of sectors each time (read-ahead and caching), the timeouts were well into the minutes whenever one of the bad zones was touched. Finally I found that using hdparm, also in Linux, you can force the reads to be as small as 1 sector, so the timeouts and the whole process was manageable and I could finally read about 95% of the disk in about 2 days before it finally stopped responding altogether.

After all was finished, I tried looking for info on SpinRite, which anyway always had sounded somewhat suspicious to me - together with the whole Gibson Research thing. Effectively, what I found was not encouraging. Long story short, what SpinRite claims to do made maybe sense in the 80's, when hard drives were considerable "dumber" than now (RLL and MFM encodings; maybe that also explains its reliance on DOS and BIOS).

If SpinRite helps now (it can), it's simply because it's forcing the hardware to do what it would have done anyway given the chance: retry reading, writing 0's to force drive-side reassignment of bad blocks, etc. But of course if you do that in a really failing disk, you are burning away time that could be better used to take that data to a new disk, instead of just rewriting it on the failing one.

When you have the salvaged data in an image file or a new disk, possibly with missing parts, you can mount it in OS X and use more traditional OS X tools to fix/read what you can from it. For example, I used Diskwarrior to get whole parts of the disk image readable again as properly structured folders. And you can always try Data Rescue II (or free similar options, like photorec) to recover lost files...

I had long been wanting to write about the whole ordeal on my blog but never found the time, so I hope leaving this here will help someone. But note that this is my recollection of how it was about 6 years ago!

There was also a comment on the magnetic coating "flaking off"; I don't think so, the heads float much closer to the surface of the disk than the diameter of a human hair. Any flaking would be a quick disaster.

Generally about the reliability of SMART alerts: of course they can help, but Google and others published studies a couple of years ago on the reliability of the failure alerts based on data from tens of thousands of drives used in their data centers, from different manufacturers. The result was that only about 50% of the failures are predicted by SMART… but when SMART does predict it, it happens in less than 24 hours (which is what they are designed for).

And for the users of SmartMonTools: remember that some SMART diagnostic values are not updated until there is an offline test (which can be initiated by SmartMonTools itself). However, even with those tests being used daily, my last drive failed without warning...

No comments

Post a Comment