Links

Ben Laurie blathering


ZFS Part 3: Replacing Dead Disks

As discussed in my previous article, if a disk fails then a ZFS system will just carry on as if nothing has happened. Of course, we’d like to restore the system to its former redundant glory, so here’s how…

Once more, we simulate a failure by removing the primary disk, but this time replace it with a new unformatted disk (I guess if the new disk was already bootable you’d need to fix that first).

Let’s assume we’re several years down the line and no longer have any documentation at all. First off, find your disks by inspecting dmesg. As before we have ad4 and ad8. ad4 is the new disk.

# diskinfo -v ad4 ad8
ad4
        512             # sectorsize
        500107862016    # mediasize in bytes (466G)
        976773168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        969021          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        S20BJ9AB212006  # Disk ident.

ad8
        512             # sectorsize
        500107862016    # mediasize in bytes (466G)
        976773168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        969021          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9VMYLC5V        # Disk ident.

This time they are conveniently exactly the same size, despite having diffferent manufacturers (Samsung and Seagate respectively). We already know from the first article in this series that we can deal with disks that don’t look the same, and in any case only 250GB is currently replicated. So, let’s partition the new disk as the old one…

# gpart show ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

# gpart show -l ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  (null)  (64K)
        162    4194304    2  swap8  (2.0G)
    4194466  484202669    3  system8  (231G)
  488397135  488376000    4  scratch8  (233G)

# gpart create -s gpt ad4
ad4 created
# gpart add -b 34 -s 128 -t freebsd-boot ad4
ad4p1 added
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad4
bootcode written to ad4
# gpart add -s 4194304 -t freebsd-swap -l swap4 ad4
ad4p2 added
# gpart add -s 484202669 -t freebsd-zfs -l system4 ad4
ad4p3 added
# gpart add -t freebsd-zfs -l scratch4 ad4
ad4p4 added
# gpart show ad4
=>       34  976773101  ad4  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

Now we’re ready to reattach the disk to the various filesystems.

First the swap. Since we can’t remove the dead disk from the gmirror setup, first we forget then add the new swap partition back in.

# gmirror forget swap
# gmirror insert -h -p 1 swap /dev/gpt/swap4
# gmirror status
       Name    Status  Components
mirror/swap  DEGRADED  gpt/swap8
                       gpt/swap4 (29%)

and after a while

# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  gpt/swap8
                       gpt/swap4

Next the main filesystem. In this case, since the new device has the same name as the old one, we can just write

# zpool replace system /dev/gpt/system4
If you boot from pool 'system', you may need to update
boot code on newly attached disk '/dev/gpt/system4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

Once more we’ve already done this step, so no need to do it again. Note, this command took a little while, don’t be alarmed!

# zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 9.77% done, 0h2m to go
config:

        NAME                   STATE     READ WRITE CKSUM
        system                 DEGRADED     0     0     0
          mirror               DEGRADED     0     0     0
            gpt/system8        ONLINE       0     0     0
            replacing          DEGRADED     0     0     0
              gpt/system4/old  UNAVAIL      0     0     0  cannot open
              gpt/system4      ONLINE       0     0     0  221M resilvered

errors: No known data errors

and after not very long

# zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: ONLINE
 scrub: resilver completed after 0h1m with 0 errors on Sun Mar 27 13:04:02 2011
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  2.21G resilvered

errors: No known data errors

And we’re all good, back to where we were before. Reboot to check everything is fine.

Note, by the way, that all of this was done on a live system in multi-user mode. Apart from the occasional reboot there was no loss of service whatsoever.

Also, because the primary disk didn’t really fail, if I wanted I could put it in my other machine and end up with a working replicated system there without any need for setup.

There is one niggling question remaining: I started off with one 250 GB and one 500 GB disk. I now have two 500 GBs, which means the non-redundant scratch file system I had before could now become redundant. Or they could become part of the system pool. Or they could become a bigger non-redundant scratch filesystem.

In the end I decided to do the simplest thing, which is to make the scratch partitions part of the larger system partition. If I ever need to rearrange that is always possible either with the help of an additional disk or, even, with less safety, by taking one of the disks out of the pools and rearranging onto that (see a description of doing this kind of thing on freenas).

So, to make them part of the existing pool, first destroy the scratch filesystem (if I’d already used it I’d have to copy it before I started, but since I haven’t I can just blow it away). Since we mounted the pool direct, we destroy it with zpool:

# zpool destroy scratch

(and we can confirm it has gone with zpool list and zfs list). Just
for naming sanity, I rename the two scratch partitions:

# gpart modify -i 4 -l system8.p2 ad8
ad8p2 modified
# gpart modify -i 4 -l system4.p2 ad4
ad4p2 modified

and since those aren’t reflected in /dev/gpt, reboot. Then finally

# zpool add system mirror /dev/gpt/system4.p2 /dev/gpt/system8.p2

and presto

# zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
system 463G 2.21G 461G 0% ONLINE -

3 Comments »

  1. Instead of adding the former scratch partitions as a 2nd vdev in your system pool, could you not simply have used gpart to delete the old scratch partitions and enlarge the gpt/system{4,8} partitions? ZFS should just grow to fill the space available.

    Similarly if you need more space eventually but have no place to install more drives: just replace each disk in turn with a larger capacity one, creating the same size boot and swap and making the rest of the drive on big zfs partition. Let resilvering complete after each swap, and hey presto: enlarged zpool. Without system downtime, unless you have to reboot to physically swap the drives.

    Comment by Matthew Seaman — 27 Mar 2011 @ 19:43

  2. Apparently, I could’ve done! I’m a complete ZFS n00b, so forgive the thrashing around. Good to know there’s more ways to do it…

    Presumably I could implement your solution even now, though I’m struggling to figure out a clean way without using a temporary disk as a spare (I think it is possible: presumably I’d detach the second disk, grow its partition, export to it, then grow the first disk and attach it to the second? Doesn’t sound like a change that could be done live, though).

    Comment by Ben — 28 Mar 2011 @ 14:07

  3. Hmmm… Getting rid of vdevs is not so easy. There isn’t a ‘zpool subtract’ which is the converse of ‘zpool add.’ It can be done though.

    Remove ad8 from your zpool, breaking the mirror. Repartition ad8. Create a new zpool using just ad8.

    Copy data from ad4 by snapshotting the system zpool and using a zpool export … | zpool import … pipeline. Unfortunately this doesn’t copy across all the ZFS options (read-only, nosuid, noexec etc), so you’ll have to fix those up too. Obviously, from this point you’ll want to avoid making any significant changes to the data in your system zpool after creating the snapshot, as those changes will be lost in the next step. You can however make a new snapshot and update the copy with just the changes between the first and second snapshots — a bit like running rsync repeatedly.

    Then the tricky bit — reboot so your system is running off the new zpool. This will involve setting various ZFSes to mount at the root directory (much as you did during install), updating /boot/loader.conf and copying /boot/zfs/zpool.cache over to the new zpool.

    Once rebooted, destroy the zpool on ad4, repartition it and add it as a mirror to the new zpool on ad8.

    Apart from the reboot in the middle, should be doable without downtime, although you’ll want to avoid any heavy system activity while it’s all in progress.

    Comment by Matthew Seaman — 28 Mar 2011 @ 20:37

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress

Close
E-mail It