Click-Click-Knock, says no hard drive, ever.

This is what death looks like

My desktop/workspace requires two running machines for it to function. When I tell people this they usually find this weird, to say the least, and overly complicated. However, I can’t say I care. I’ve been through too much to care – from losing hard drives, important files, losing whole servers and months worth of work. I’ve seen it all. That all too familiar ‘click-click-knock’ sound that’s never supposed to come from your hard drive is a sound I’m familiar with and prepared for.

At approximately 21:30 this evening, my desktop’s hard drive started to give out death rattles. In all fairness, I should have seen this coming: lately I’ve been seeing an increase in OOM/segfault errors and random systemd-coredump proccesses. Sure it’s an older computer, but I don’t remember having these problems. @21:35 the second series of knocks started to occur. I’m not going to lie, I did kinda ignore the first set, but this second set confirmed it: this ship is sinking, fast.

The logs confirmed this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Sep 27 20:37:29 whiteroom kernel: [1098349.923549] ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 27 20:37:29 whiteroom kernel: [1098349.925854] ata3.00: failed command: READ DMA
Sep 27 20:37:29 whiteroom kernel: [1098349.928440] ata3.00: cmd c8/00:00:08:16:40/00:00:00:00:00/e0 tag 0 dma 131072 in
Sep 27 20:37:29 whiteroom kernel: [1098349.928440] res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 27 20:37:29 whiteroom kernel: [1098349.933417] ata3.00: status: { DRDY }
Sep 27 20:37:29 whiteroom kernel: [1098350.040197] ata3: soft resetting link
Sep 27 20:37:34 whiteroom kernel: [1098355.193652] ata3.00: qc timeout (cmd 0xec)
Sep 27 20:37:34 whiteroom kernel: [1098355.193660] ata3.00: failed to IDENTIFY (I/O error, err_mask=0x4)
Sep 27 20:37:34 whiteroom kernel: [1098355.193663] ata3.00: revalidation failed (errno=-5)
Sep 27 20:37:38 whiteroom kernel: [1098358.926858] ata3: soft resetting link
Sep 27 20:37:38 whiteroom kernel: [1098359.093749] ata3.00: configured for UDMA/133
Sep 27 20:37:38 whiteroom kernel: [1098359.093758] ata3.00: device reported invalid CHS sector 0
Sep 27 20:37:38 whiteroom kernel: [1098359.093777] sd 2:0:0:0: [sda]
Sep 27 20:37:38 hhiteroom kernel: [1098359.093780] Result: hostbyte=0x00 driverbyte=0x08
Sep 27 20:37:38 whiteroom kernel: [1098359.093783] sd 2:0:0:0: [sda]
Sep 27 20:37:38 whiteroom kernel: [1098359.093786] Sense Key : 0xb [current] [descriptor]
Sep 27 20:37:38 whiteroom kernel: [1098359.093792] Descriptor sense data with sense descriptors (in hex):
Sep 27 20:37:38 whiteroom kernel: [1098359.093794] 72 0b 00 00 00 00 00 0c 00 0a 80 00 00 00 00 00
Sep 27 20:37:38 whiteroom kernel: [1098359.093810] 00 00 00 00
Sep 27 20:37:38 whiteroom kernel: [1098359.093818] sd 2:0:0:0: [sda]
Sep 27 20:37:38 whiteroom kernel: [1098359.093820] ASC=0x0 ASCQ=0x0
Sep 27 20:37:38 whiteroom kernel: [1098359.093824] sd 2:0:0:0: [sda] CDB:
Sep 27 20:37:38 whiteroom kernel: [1098359.093827] cdb[0]=0x28: 28 00 00 40 16 08 00 01 00 00
Sep 27 20:37:38 whiteroom kernel: [1098359.093840] end_request: I/O error, dev sda, sector 4199944

Another set explains why every program has had its moments of crashing as of recently:

1
2
3
4
5
6
7
[root@whiteroom log]# egrep 'segfault|deleted' messages*
messages.1:Sep 27 20:37:49 whiteroom kernel: [1098370.497321] xfce4-panel[362]: segfault at 8 ip b69e4bb6 sp bfed3f00 error 6 in libc-2.19.so (deleted)[b6971000+1bb000]
messages.1:Sep 27 20:37:52 whiteroom kernel: [1098372.766378] xfce4-power-man[375]: segfault at 10 ip 00000010 sp bfca02ec error 4 in xfce4-power-manager[8048000+22000]
messages.1:Sep 27 20:37:52 whiteroom kernel: [1098373.060427] xfsettingsd[376]: segfault at 5 ip b669af42 sp bf8368c0 error 6 in libSM.so.6.0.1[b6699000+7000]
messages.2:Sep 18 00:55:37 whiteroom kernel: [249838.048508] ltbin[11881]: segfault at 0 ip 09a4a41f sp bfd23f00 error 4 in ltbin[8048000+4046000]
messages.2:Sep 20 15:46:54 whiteroom kernel: [476115.351940] ltbin[30538]: segfault at 0 ip 09a4a41f sp bffefa10 error 4 in ltbin[8048000+4046000]
messages.3:Sep 12 11:49:14 whiteroom kernel: [137730.639121] slim[293]: segfault at 40000a ip b72cbbb6 sp bff47c50 error 6 in libc-2.19.so[b7258000+1bb000]

This drive was dying. At this point there’s nothing I can do except pray I have enough time to rescue/rsync as many files off as I can right? Ha, no.

With a combination of regular backups and an intricate NAS setup, nothing of value was lost, at all.

There are three main points to this setup; security, ease of use and availability. This is the setup that allows me to sleep at night:

  • Machine A: This is the NAS box, it’s headless. It’s only job is to store files and maintain the integrity of these files. It’s running the latest stable debain and ZFS [raidz] (more information on this setup can be found here). This box shares its data with other machines on the network. It can suffer the loss of a drive before things start to get bad. More importantly this box is backed up to a forth encrypted drive, monthly. Between this box and the backup drive, all my data is safe.

  • Machine B: This is the desktop. Now, I know what your asking, why have two machines when this can be done with one? Oh do keep reading. This machine’s only job is to serve as the desktop. It has user accounts but no local home folder. Instead, home is pulled from the NAS box and mounted as /home. That’s the key to this setup, /home (where all my user data, movies and music resides) is not stored locally on the desktop. This dramatically decreases the chances of /home getting wiped out due to bad programs, accidental commands and more importantly bad harddrives.

I could literally:

• physically beat this machine with a bat
• light it on fire and watch it burn
• delete the hard drive’s mbr
• physically open the computer, rip out the hard drive while it’s running, take the harddrive and toss it in a microwave oven
• run shred, dd, or any other number of file/disk destroying commands, on this machine’s harddrive

… And /home will not be affected via the actions of any of the previously stated.

This is why I have two machines. When I heard the hard drive announce that it was on its last leg, I wasn’t distraught, because nothing was going to be lost.

To restore this machine, I simply have to get another hard drive (although I might boot from a USB drive temporarily), reinstall the OS, mount /home, and continue with life.

The moral of this story: there’s a method to my madness and backups mean you can sleep better at night. So the next time your hard drive crashes you can do what I did: hit a bar, not because you’re sad and distraught, but because they have amazing wings and your data is safely tucked away on four different drives.

More information on the NAS/ZFS setup can he found here.