Hard Disk Failure – The Best Kept Secret
The average hard disk lasts for three years. That was a disaster recovery statistic that I read, and am still absorbing. The secret is that no computer manufacturer, or disk supplier, will admit to disk problems, if they came clean their market share would collapse. You never see a sticker on a new computer saying, ‘Warning – The disk may fail and you will lose all your data’.
Frankly, until my recent battles with my own server, I did not believe that disks were so unreliable. My message to you is this, learn from my mistakes, or else you are destined to repeat them. Don’t be like me and ignore valuable clues.
From an unexpected source, I had the tip-off that disk failure is a wide-spread occurrence, but I choose foolishly to ignore the evidence. In the last four years I have sold over 8,000 ebooks. Almost every week someone writes to me saying that their disk failed, they lost my ebook and could I send them a replacement. At first I thought these were people attempting a scam to get a free ebook; invariably they could not remember their clickbank or paypal receipt number. However, when I checked their email address against my receipts, incredibly, every single request was from a genuine purchaser. These were not scam merchants, but people who had bought an ebook, so naturally, I sent them another copy.
Foolishly, I still dismissed the disk problem, my twisted mind thought along these lines, ‘These people have just lost their ebook, and thus to hype up my sympathy, they pretend their disk failed’. I am now ashamed of such thoughts; my only consolation is that I always sent a replacement ebook.
For at least 15 years I have lived a charmed life, none of my machines have displayed any hint of hard disk problem. My old mate ‘Barking’ Eddie summed up the situation with the inelegant but descriptive phrase, ‘Guy, these days manufacturers tune disks to nackerdness’. Eddie and I then reminisced how we used to tune Mini Cooper S engines to achieve incredible revs; the engines had a short, yet powerful, life then the piston broke. It now seems the same is happening with disks, huge capacity, fast spinning, but just a three year life.
With disks, they rarely crash spectacularly with smoke and flying spindles, the errors are more subtle, bad blocks or bad sectors. Windows’ operating systems together with NTFS, do their best to write bad blocks of data to healthy regions of the disk. The real problem occurs when the operating system’s own files are on a bad block, the result surfaces when you reboot – the machine will not start.
SolarWinds’ Orion performance monitor will help you discover what’s happening on your network. This utility will also guide you through troubleshooting; the dashboard will indicate whether the root cause is a broken link, faulty equipment or resource overload.
What I like best is the way NPM suggests solutions to network problems. Its also has the ability to monitor the health of individual VMware virtual machines. If you are interested in troubleshooting, and creating network maps, then I recommend that you try NPM now.
Here are symptoms of an impending disk failure.
1) When you inspect the System log in the event viewer you see lots of:
2) When you run chkdsk c: you get:
3) ‘Barking’ Eddie says he can spot a bad disk by the frequency that the light flashes. Just when I nearly believed him, Eddie went over the top by saying he could always smell a bad disk. He claimed that the smell of death for a disk was a cross between hyacinth and axle grease.
A brief version of how I recovered from disk failure.
1) Preamble: for a number of years, even for home use, I only buy machines with two hard drives. So when Harddisk0 failed, I swapped the jumpers and thus changed the orientation of the Master / Slave disk pairing. As a result the machine could now boot from the healthy disk (Harddisk1).
2) To start the recovery procedure, I installed a parallel operating system, for example, XP can assist the repair of Windows Server 2003.
3) Then I ran chkdsk 😡 /f (/f fixes bad block errors as best it can) x = affected partition, try them all!
4) Reboot the computer with the original operating system CD in the caddy. My mission was not to install a new operating sytem, but to REPAIR the existing operating system.
5) Confusion of the two repairs. The first R = Repair refers to the Command Console. While the command console is great for configuration problems, it’s of no help here. So if you are trying this routine, avoid the first R = Repair and press enter. When you see the second option R= Repair, take it, press ‘R’.
6) As setup proceeds, it looks like a fresh installation, you have to have faith that it is actually repairing the original operating systems. The first clue is at the very beginning of the startup sequence, you select a boot.ini option similar to ‘C: \Windows’. The second clue in my case was that setup detects keyboard – English United Kingdom, a new installation would display keyboard English – United States.
My next surprise was that setup asked for the Product Key, fortunately I had recorded the 25 digits and moreover, it was legitimate copy of Windows Server 2003. Thereafter I left setup repairing for 33 minutes. When I returned and rebooted it was truly a magic moment, I now saw: ‘Preparing Network Connections’ followed by the logon menu. Phew, I had recovered the system.
7) When I replaced the damaged disk with a brand new disk, I experience a few finger and thumb problems due to my failing eyesight, my glasses just would not focus on the screw heads in that dark box. At one difficult stage ‘Barking’ Eddie passed me what we call a ‘Birmingham Screwdriver’, but I declined to take his hammer to the computer chassis; instead I called for a torch – much more useful.
8) There were also logical problems. It was not as straightforward as booting into the alternate operating system, and restoring data from backup. If the original drive was C:, then the new drive has to be C:. Easy to say, but tricky to achieve when the alternative operating system is already using the C:. The sign that the drive letter was wrong was when the boot halted displaying error 0xc00002E1.
I flirted with Active Directory Restore mode; I read TechNet articles about search and replace drive letters in the registry, it worked, but then I thought, ‘This is probably flaky’. I decided to do the job properly, therefore I started over. I began by switching the drive letters, so that I could re-restore the backup to the original C: drive letter.
After I successfully replaced the disk and restored the data, I am convinced that the disk access is faster. Those bad blocks must have been slowing down performance.
Tired of writing scripts? The User Management Resource Administrator solution by Tools4ever offers an alternative to time-consuming manual processes.
It features 100% auto provisioning, Helpdesk Delegation, Connectors to more than 130 systems/applications, Workflow Management, Self Service and many other benefits. Click on the link for more information onUMRA.
Several kind readers have written suggesting that I protect my server by mirroring the disks. I am mulling over this RAID 1 proposal, the downside is that you need to convert basic disk to dynamic disk. That is a great idea except that you can never boot into another operating system once you switch to dynamic disk and as you know I like a parallel operating system, not just for disk problems but to cure other problems that arise on this test server.
Lots of useful disk and file articles