How To Knock Sense Into HDD


Posted:   |   Прочесть по-русски   |   Leer en español   |   More posts about hardware magic

A couple of days ago I saved two my HDDs using a sort of magic.

Symptoms were almost identical: during the boot up OS dumped on a terminal a bunch of errors, and the same errors regularly appeared in logs. It was somewhat like this:

Mar 31 07:31:31 rohan kernel: [    1.640757] ata5.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x0
Mar 31 07:31:31 rohan kernel: [    1.641114] ata5.00: irq_stat 0x40000008
Mar 31 07:31:31 rohan kernel: [    1.641317] ata5.00: failed command: READ FPDMA QUEUED
Mar 31 07:31:31 rohan kernel: [    1.641582] ata5.00: cmd 60/08:00:50:00:02/00:00:00:00:00/40 tag 0 ncq 4096 in
Mar 31 07:31:31 rohan kernel: [    1.641582]          res 41/40:00:52:00:02/00:00:00:00:00/40 Emask 0x409 (media error) <F>
Mar 31 07:31:31 rohan kernel: [    1.642365] ata5.00: status: { DRDY ERR }
Mar 31 07:31:31 rohan kernel: [    1.642570] ata5.00: error: { UNC }
Mar 31 07:31:31 rohan kernel: [    1.650046] ata5.00: configured for UDMA/133
Mar 31 07:31:31 rohan kernel: [    1.650057] sd 4:0:0:0: [sdb] Unhandled sense code
Mar 31 07:31:31 rohan kernel: [    1.650061] sd 4:0:0:0: [sdb]
Mar 31 07:31:31 rohan kernel: [    1.650064] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
Mar 31 07:31:31 rohan kernel: [    1.650067] sd 4:0:0:0: [sdb]
Mar 31 07:31:31 rohan kernel: [    1.650069] Sense Key : Medium Error [current] [descriptor]
Mar 31 07:31:31 rohan kernel: [    1.650075] Descriptor sense data with sense descriptors (in hex):
Mar 31 07:31:31 rohan kernel: [    1.650078]         72 03 11 04 00 00 00 0c 00 0a 80 00 00 00 00 00
Mar 31 07:31:31 rohan kernel: [    1.650094]         00 02 00 52
Mar 31 07:31:31 rohan kernel: [    1.650101] sd 4:0:0:0: [sdb]
Mar 31 07:31:31 rohan kernel: [    1.650104] Add. Sense: Unrecovered read error - auto reallocate failed
Mar 31 07:31:31 rohan kernel: [    1.650108] sd 4:0:0:0: [sdb] CDB:
Mar 31 07:31:31 rohan kernel: [    1.650110] Read(10): 28 00 00 02 00 50 00 00 08 00
Mar 31 07:31:31 rohan kernel: [    1.650123] end_request: I/O error, dev sdb, sector 131154
Mar 31 07:31:31 rohan kernel: [    1.650416] Buffer I/O error on device sdb2, logical block 1
Mar 31 07:31:31 rohan kernel: [    1.650711] ata5: EH complete

At the same time smartctl didn't show any severe errors. The only thing that confused me was the following:

197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       1
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       3

Non-zero value in “Pending sectors” means that HDD can't write a sector and can't relocate it, because it's unreadable for some reason. Non-zero value in “CRC errors” means that something is wrong with communication channel between the HDD itself and a motherboard. There is one of several possibilities, which can be explained, for example, by bad cable or such.

I decided to check if the HDD is really bad or not. The simplest way to do this is rather radical and destructive, but it works. I mean to overwrite all volume of the drive with some value, e.g. let it be “0”. In case of dead drive, you could not do such thing. So, since my first “bad drive” got dusty over 2 years and I resign myself to loss of all the data on it, I did

# dd if=/dev/zero of=/dev/sdb bs=16M

… and after 5 hours my 2TB HDD was as clean as new.

After such procedure kernel was quiet and logs were clean. A few days of work with this drive showed that it's ok and I can do the same with my other drive, which is a part of RAID1. Sooner said than done, and the next morning I had another clean HDD and clean logs. It was no problem to recreate a partition table and rebuild mirror. By now it works almost entire week and I didn't notice any signs of RAID degradation and drive instability. Too soon though, let's see what happens in a month. Nevertheless, I'd check drives with whdd or such… Some time, maybe…