ZFS Part 3: Replacing Dead Disks
As discussed in my previous article, if a disk fails then a ZFS system will just carry on as if nothing has happened. Of course, we’d like to restore the system to its former redundant glory, so here’s how…
Once more, we simulate a failure by removing the primary disk, but this time replace it with a new unformatted disk (I guess if the new disk was already bootable you’d need to fix that first).
Let’s assume we’re several years down the line and no longer have any documentation at all. First off, find your disks by inspecting dmesg. As before we have ad4 and ad8. ad4 is the new disk.
# diskinfo -v ad4 ad8 ad4 512 # sectorsize 500107862016 # mediasize in bytes (466G) 976773168 # mediasize in sectors 0 # stripesize 0 # stripeoffset 969021 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. S20BJ9AB212006 # Disk ident. ad8 512 # sectorsize 500107862016 # mediasize in bytes (466G) 976773168 # mediasize in sectors 0 # stripesize 0 # stripeoffset 969021 # Cylinders according to firmware. 16 # Heads according to firmware. 63 # Sectors according to firmware. 9VMYLC5V # Disk ident.
This time they are conveniently exactly the same size, despite having diffferent manufacturers (Samsung and Seagate respectively). We already know from the first article in this series that we can deal with disks that don’t look the same, and in any case only 250GB is currently replicated. So, let’s partition the new disk as the old one…
# gpart show ad8 => 34 976773101 ad8 GPT (466G) 34 128 1 freebsd-boot (64K) 162 4194304 2 freebsd-swap (2.0G) 4194466 484202669 3 freebsd-zfs (231G) 488397135 488376000 4 freebsd-zfs (233G) # gpart show -l ad8 => 34 976773101 ad8 GPT (466G) 34 128 1 (null) (64K) 162 4194304 2 swap8 (2.0G) 4194466 484202669 3 system8 (231G) 488397135 488376000 4 scratch8 (233G) # gpart create -s gpt ad4 ad4 created # gpart add -b 34 -s 128 -t freebsd-boot ad4 ad4p1 added # gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad4 bootcode written to ad4 # gpart add -s 4194304 -t freebsd-swap -l swap4 ad4 ad4p2 added # gpart add -s 484202669 -t freebsd-zfs -l system4 ad4 ad4p3 added # gpart add -t freebsd-zfs -l scratch4 ad4 ad4p4 added # gpart show ad4 => 34 976773101 ad4 GPT (466G) 34 128 1 freebsd-boot (64K) 162 4194304 2 freebsd-swap (2.0G) 4194466 484202669 3 freebsd-zfs (231G) 488397135 488376000 4 freebsd-zfs (233G)
Now we’re ready to reattach the disk to the various filesystems.
First the swap. Since we can’t remove the dead disk from the gmirror setup, first we forget then add the new swap partition back in.
# gmirror forget swap # gmirror insert -h -p 1 swap /dev/gpt/swap4 # gmirror status Name Status Components mirror/swap DEGRADED gpt/swap8 gpt/swap4 (29%)
and after a while
# gmirror status Name Status Components mirror/swap COMPLETE gpt/swap8 gpt/swap4
Next the main filesystem. In this case, since the new device has the same name as the old one, we can just write
# zpool replace system /dev/gpt/system4 If you boot from pool 'system', you may need to update boot code on newly attached disk '/dev/gpt/system4'. Assuming you use GPT partitioning and 'da0' is your new boot disk you may use the following command: gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0
Once more we’ve already done this step, so no need to do it again. Note, this command took a little while, don’t be alarmed!
# zpool status pool: scratch state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM scratch ONLINE 0 0 0 gpt/scratch8 ONLINE 0 0 0 errors: No known data errors pool: system state: DEGRADED status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state. action: Wait for the resilver to complete. scrub: resilver in progress for 0h0m, 9.77% done, 0h2m to go config: NAME STATE READ WRITE CKSUM system DEGRADED 0 0 0 mirror DEGRADED 0 0 0 gpt/system8 ONLINE 0 0 0 replacing DEGRADED 0 0 0 gpt/system4/old UNAVAIL 0 0 0 cannot open gpt/system4 ONLINE 0 0 0 221M resilvered errors: No known data errors
and after not very long
# zpool status pool: scratch state: ONLINE scrub: none requested config: NAME STATE READ WRITE CKSUM scratch ONLINE 0 0 0 gpt/scratch8 ONLINE 0 0 0 errors: No known data errors pool: system state: ONLINE scrub: resilver completed after 0h1m with 0 errors on Sun Mar 27 13:04:02 2011 config: NAME STATE READ WRITE CKSUM system ONLINE 0 0 0 mirror ONLINE 0 0 0 gpt/system8 ONLINE 0 0 0 gpt/system4 ONLINE 0 0 0 2.21G resilvered errors: No known data errors
And we’re all good, back to where we were before. Reboot to check everything is fine.
Note, by the way, that all of this was done on a live system in multi-user mode. Apart from the occasional reboot there was no loss of service whatsoever.
Also, because the primary disk didn’t really fail, if I wanted I could put it in my other machine and end up with a working replicated system there without any need for setup.
There is one niggling question remaining: I started off with one 250 GB and one 500 GB disk. I now have two 500 GBs, which means the non-redundant scratch file system I had before could now become redundant. Or they could become part of the system pool. Or they could become a bigger non-redundant scratch filesystem.
In the end I decided to do the simplest thing, which is to make the scratch partitions part of the larger system partition. If I ever need to rearrange that is always possible either with the help of an additional disk or, even, with less safety, by taking one of the disks out of the pools and rearranging onto that (see a description of doing this kind of thing on freenas).
So, to make them part of the existing pool, first destroy the scratch filesystem (if I’d already used it I’d have to copy it before I started, but since I haven’t I can just blow it away). Since we mounted the pool direct, we destroy it with zpool:
# zpool destroy scratch
(and we can confirm it has gone with zpool list and zfs list). Just
for naming sanity, I rename the two scratch partitions:
# gpart modify -i 4 -l system8.p2 ad8
ad8p2 modified
# gpart modify -i 4 -l system4.p2 ad4
ad4p2 modified
and since those aren’t reflected in /dev/gpt, reboot. Then finally
# zpool add system mirror /dev/gpt/system4.p2 /dev/gpt/system8.p2
and presto
# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
system 463G 2.21G 461G 0% ONLINE -
Instead of adding the former scratch partitions as a 2nd vdev in your system pool, could you not simply have used gpart to delete the old scratch partitions and enlarge the gpt/system{4,8} partitions? ZFS should just grow to fill the space available.
Similarly if you need more space eventually but have no place to install more drives: just replace each disk in turn with a larger capacity one, creating the same size boot and swap and making the rest of the drive on big zfs partition. Let resilvering complete after each swap, and hey presto: enlarged zpool. Without system downtime, unless you have to reboot to physically swap the drives.
Comment by Matthew Seaman — 27 Mar 2011 @ 19:43
Apparently, I could’ve done! I’m a complete ZFS n00b, so forgive the thrashing around. Good to know there’s more ways to do it…
Presumably I could implement your solution even now, though I’m struggling to figure out a clean way without using a temporary disk as a spare (I think it is possible: presumably I’d detach the second disk, grow its partition, export to it, then grow the first disk and attach it to the second? Doesn’t sound like a change that could be done live, though).
Comment by Ben — 28 Mar 2011 @ 14:07
Hmmm… Getting rid of vdevs is not so easy. There isn’t a ‘zpool subtract’ which is the converse of ‘zpool add.’ It can be done though.
Remove ad8 from your zpool, breaking the mirror. Repartition ad8. Create a new zpool using just ad8.
Copy data from ad4 by snapshotting the system zpool and using a zpool export … | zpool import … pipeline. Unfortunately this doesn’t copy across all the ZFS options (read-only, nosuid, noexec etc), so you’ll have to fix those up too. Obviously, from this point you’ll want to avoid making any significant changes to the data in your system zpool after creating the snapshot, as those changes will be lost in the next step. You can however make a new snapshot and update the copy with just the changes between the first and second snapshots — a bit like running rsync repeatedly.
Then the tricky bit — reboot so your system is running off the new zpool. This will involve setting various ZFSes to mount at the root directory (much as you did during install), updating /boot/loader.conf and copying /boot/zfs/zpool.cache over to the new zpool.
Once rebooted, destroy the zpool on ad4, repartition it and add it as a mirror to the new zpool on ad8.
Apart from the reboot in the middle, should be doable without downtime, although you’ll want to avoid any heavy system activity while it’s all in progress.
Comment by Matthew Seaman — 28 Mar 2011 @ 20:37