ZFS Part 2: Disk Failure
Before I’m ready to trust ZFS I need to make sure I can replace a disk when it dies. With the setup described here, as a first experiment I removed the primary disk.
So, power down and remove the primary disk (ad4). Note that if you’re doing this on the Proliant system I mentioned, then you really should replace the drive mount (it is needed for cooling). Luckily I have a spare system so I just borrowed one.
Reboot. Comes up fine on the secondary disk without further intervention.
$ zpool status pool: scratch state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM scratch ONLINE 0 0 0 gpt/scratch8 ONLINE 0 0 0 errors: No known data errors pool: system state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 mirror DEGRADED 0 0 0 gpt/system8 ONLINE 0 0 0 gpt/system4 UNAVAIL 0 0 0 cannot open errors: No known data errors
Note that the system pool is now degraded. How would we have known if we hadn’t checked? Well, turns out we missed something from the previous setup.
We should have put
daily_status_zfs_enable="YES"
daily_status_gmirror_enable="YES"
in /etc/periodic.conf
. Then in the daily mail we’d see:
Checking status of zfs pools: pool: system state: DEGRADED status: One or more devices could not be opened. Sufficient replicas exist for the pool to continue functioning in a degraded state. action: Attach the missing device and online it using 'zpool online'. see: http://www.sun.com/msg/ZFS-8000-2Q scrub: none requested config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 mirror DEGRADED 0 0 0 gpt/system8 ONLINE 0 0 0 gpt/system4 UNAVAIL 0 0 0 cannot open errors: No known data errors Checking status of gmirror(8) devices: Name Status Components mirror/swap DEGRADED gpt/swap8
So remember, boys and girls, read your daily mails!
So far, so good. One disk failed, the system came back up without intervention, and would have alerted us in daily mails had we configured it correctly (of course it now is). So what happens if we put the disk back in? Since we’ve modified the other disk in the meantime, we’d hope that would get reconciled. Let’s see…
Power down and replace the missing disk, reboot.
Now we see
$ zpool status pool: scratch state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM scratch ONLINE 0 0 0 gpt/scratch8 ONLINE 0 0 0 errors: No known data errors pool: system state: ONLINE scrub: resilver completed after 0h0m with 0 errors on Sat Mar 26 10:48:56 2011 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 mirror ONLINE 0 0 0 gpt/system8 ONLINE 0 0 0 gpt/system4 ONLINE 0 0 0 345K resilvered errors: No known data errors $ gmirror status Name Status Components mirror/swap COMPLETE gpt/swap4 gpt/swap8
and there we are, back to where we started. But suppose the disk had really failed, then what? See the next exciting installment!
[…] LinksBen Laurie blathering « ZFS Part 2: Disk Failure […]
Pingback by Links » ZFS Part 3: Replacing Dead Disks — 27 Mar 2011 @ 16:39