Links

Ben Laurie blathering


ZFS Part 2: Disk Failure

Before I’m ready to trust ZFS I need to make sure I can replace a disk when it dies. With the setup described here, as a first experiment I removed the primary disk.

So, power down and remove the primary disk (ad4). Note that if you’re doing this on the Proliant system I mentioned, then you really should replace the drive mount (it is needed for cooling). Luckily I have a spare system so I just borrowed one.

Reboot. Comes up fine on the secondary disk without further intervention.

$ zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
        the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

        NAME             STATE     READ WRITE CKSUM
        system           DEGRADED     0     0     0
          mirror         DEGRADED     0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  UNAVAIL      0     0     0  cannot open

errors: No known data errors

Note that the system pool is now degraded. How would we have known if we hadn’t checked? Well, turns out we missed something from the previous setup.

We should have put

daily_status_zfs_enable="YES"
daily_status_gmirror_enable="YES"

in /etc/periodic.conf. Then in the daily mail we’d see:

Checking status of zfs pools:
  pool: system
 state: DEGRADED
status: One or more devices could not be opened.  Sufficient replicas exist for
	the pool to continue functioning in a degraded state.
action: Attach the missing device and online it using 'zpool online'.
   see: http://www.sun.com/msg/ZFS-8000-2Q
 scrub: none requested
config:

	NAME             STATE     READ WRITE CKSUM
	system           DEGRADED     0     0     0
	  mirror         DEGRADED     0     0     0
	    gpt/system8  ONLINE       0     0     0
	    gpt/system4  UNAVAIL      0     0     0  cannot open

errors: No known data errors

Checking status of gmirror(8) devices:
       Name    Status  Components
mirror/swap  DEGRADED  gpt/swap8

So remember, boys and girls, read your daily mails!

So far, so good. One disk failed, the system came back up without intervention, and would have alerted us in daily mails had we configured it correctly (of course it now is). So what happens if we put the disk back in? Since we’ve modified the other disk in the meantime, we’d hope that would get reconciled. Let’s see…

Power down and replace the missing disk, reboot.

Now we see

$ zpool status
  pool: scratch
 state: ONLINE
 scrub: none requested
config:

        NAME            STATE     READ WRITE CKSUM
        scratch         ONLINE       0     0     0
          gpt/scratch8  ONLINE       0     0     0

errors: No known data errors

  pool: system
 state: ONLINE
 scrub: resilver completed after 0h0m with 0 errors on Sat Mar 26 10:48:56 2011
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  345K resilvered

errors: No known data errors

$ gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  gpt/swap4
                       gpt/swap8

and there we are, back to where we started. But suppose the disk had really failed, then what? See the next exciting installment!

1 Comment »

  1. [...] LinksBen Laurie blathering « ZFS Part 2: Disk Failure [...]

    Pingback by Links » ZFS Part 3: Replacing Dead Disks — 27 Mar 2011 @ 16:39

RSS feed for comments on this post. TrackBack URI

Leave a comment

Powered by WordPress

Close
E-mail It