Links

Ben Laurie blathering


Completely Redundant Disks with ZFS and FreeBSD

A while back, I bought a ReadyNAS device for my network, attracted by the idea of RAID I can grow over time and mirrored disks.

Today I just finished building the same thing “by hand”, using FreeBSD and ZFS. At a fraction of the cost. Here’s how.

First off, I bought this amazing bargain: an HP ProLiant MicroServer. These would be cheap even at list price, but with the current £100 cashback offer, they’re just stupidly cheap. And rather nice.

Since I want to cater for a realistic future, I am assuming by the time I need to replace a drive I will no longer be able to buy a matching device, so I started from day one with a different second drive (the primary is 250 GB, secondary is 500 GB – both Seagate, which was not the plan, but I’ll remedy that in the next episode). I also added an extra 1GB of RAM to the machine (this is important for ZFS which is apparently not happy with less than 2GB of system RAM).

I then followed, more or less, Pawel’s excellent instructions for creating a fully mirrored setup. However, I had to deviate from them somewhat, so here’s my version.

The broad overview of the process is as follows

  1. Install FreeBSD on the primary disk, using a standard sysinstall.
  2. Create and populate gmirror and ZFS partitions on the secondary disk.
  3. Boot from the primary disk, but mount the secondary.
  4. Create and populate gmirror and ZFS partitions on the primary disk.
  5. Use excess secondary disk as scratch.

In my case the two disks are ad4 (primary, 250 GB) and ad8 (secondary, 500 GB). Stuff I typed is in italic.

Since we need identical size partitions for the mirror, we need to simulate the first disk (since it happens to be smaller). Get the disk’s size

# diskinfo -v /dev/ad4
/dev/ad4
        512             # sectorsize
        250059350016    # mediasize in bytes (233G)
        488397168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        484521          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9VMQN8T5        # Disk ident.

Create a memory disk the same size. Note that the sector sizes must match!

# mdconfig -a -t swap -s 488397168
md0

Verify they are the same.

# diskinfo -v /dev/ad4 /dev/md0
/dev/ad4
        512             # sectorsize
        250059350016    # mediasize in bytes (233G)
        488397168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset
        484521          # Cylinders according to firmware.
        16              # Heads according to firmware.
        63              # Sectors according to firmware.
        9VMQN8T5        # Disk ident.

/dev/md0
        512             # sectorsize
        250059350016    # mediasize in bytes (233G)
        488397168       # mediasize in sectors
        0               # stripesize
        0               # stripeoffset

Now partition the memory disk as we will the first disk later on.

# gpart create -s gpt md0
md0 created
# gpart add -b 34 -s 128 -t freebsd-boot md0
md0p1 added
# gpart add -s 2g -t freebsd-swap -l swap1 md0
md0p2 added
# gpart add -t freebsd-zfs -l systemx md0
md0p3 added

and show the resulting sizes

# gpart show md0
=>       34  488397101  md0  GPT  (233G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)

Now blow away the memory disk, we don’t need it any more.

# mdconfig -d -u 0

Create the partitions on the second disk.

# gpart create -s gpt ad8
ad8 created
# gpart add -b 34 -s 128 -t freebsd-boot ad8
ad8p1 added
# gpart add -s 2g -t freebsd-swap -l swap1 ad8
ad8p2 added
# gpart add -s 484202669 -t freebsd-zfs -l system8 ad8
ad8p3 added

And eat the rest of the disk as a scratch area (this area will not be mirrored, and so should only be used for disposable stuff).

# gpart add -t freebsd-zfs -l scratch8 ad8
ad8p4 added

Check it matches the md0 simulation

# gpart show ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

And don’t forget to set up the bootloader

# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad8
bootcode written to ad8

I realised as this point I had intended to label everything with an 8, to match the unit number, and had not done so for swap, so for completeness, here’s how you fix it

# gpart modify -i 2 -l swap8 ad8
ad8p2 modified
# gpart show -l ad8
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  (null)  (64K)
        162    4194304    2  swap8  (2.0G)
    4194466  484202669    3  system8  (231G)
  488397135  488376000    4  scratch8  (233G)

Note that the label change is not reflected by the device names in /dev/gpt, which is needed for the next step, so at this point I rebooted.

Now set up the swap mirror.

# gmirror label -F -h -b round-robin swap /dev/gpt/swap8

Create the ZFS storage pool called system, consisting only of our system8 partition.

# zpool create -O mountpoint=/mnt -O atime=off -O setuid=off -O canmount=off system /dev/gpt/system8

And create a dataset – “mountpoint=legacy” stops ZFS from managing it.

# zfs create -o mountpoint=legacy -o setuid=on system/root

Mark it as the default bootable dataset.

# zpool set bootfs=system/root system

Mount it

# mount -t zfs system/root /mnt
# mount
/dev/ad4s1a on / (ufs, local)
devfs on /dev (devfs, local, multilabel)
system/root on /mnt (zfs, local, noatime)

And create the remaining mountpoints according to Pawel’s suggested layout…

# zfs create -o compress=lzjb system/tmp
# chmod 1777 /mnt/tmp
# zfs create -o canmount=off system/usr
# zfs create -o setuid=on system/usr/local
# zfs create -o compress=gzip system/usr/src
# zfs create -o compress=lzjb system/usr/obj
# zfs create -o compress=gzip system/usr/ports
# zfs create -o compress=off system/usr/ports/distfiles
# zfs create -o canmount=off system/var
# zfs create -o compress=gzip system/var/log
# zfs create -o compress=lzjb system/var/audit
# zfs create -o compress=lzjb system/var/tmp
# chmod 1777 /mnt/var/tmp
# zfs create -o canmount=off system/usr/home

And create one for each user:

# zfs create system/usr/home/ben

Now, at a slightly different point from Pawel, I edit the various config files. First /boot/loader.conf. Note that some of these are commented out: this is because, although they appear in Pawel’s version, they are already built into the kernel (this is because I use a GENERIC kernel and he uses a stripped-down one). Including them seems to cause problems (particularly geom_part_gpt, which causes a hang during boot if present).

geom_eli_load=YES
#geom_label_load=YES
geom_mirror_load=YES
#geom_part_gpt_load=YES
zfs_load=YES
vm.kmem_size=3G # This should be 150% of your RAM.

Enable ZFS

# echo zfs_enable=YES >> /etc/rc.conf

Change fstab for the new layout (note, you might want to edit these in – for example, my system had an entry for cd drives).

# cat > /etc/fstab
system/root / zfs rw,noatime 0 0
/dev/mirror/swap.eli none swap sw 0 0
^D

The .eli extension here is magic: geom_eli finds it at startup and automatically encrypts it.

Set the work directory for ports (so that it uses the faster compression scheme during builds).

# echo WRKDIRPREFIX=/usr/obj >> /etc/make.conf

These need to be done now because the next step is to copy the entire install to the new ZFS filesystem. Note that this particular command pastes completely incorrectly from Pawel’s blog post so be careful!

# tar -c --one-file-system -f - . | tar xpf - -C /mnt/

Tar can’t copy some types of file, so expect an error or two at this point:

tar: ./var/run/devd.pipe: tar format cannot archive socket
tar: ./var/run/log: tar format cannot archive socket
tar: ./var/run/logpriv: tar format cannot archive socket

Just for fun, take a look at the ZFS we’ve created so far…

# zfs list
NAME USED AVAIL REFER MOUNTPOINT
system 1.12G 225G 21K /mnt
system/root 495M 225G 495M legacy
system/tmp 30K 225G 30K /mnt/tmp
system/usr 652M 225G 21K /mnt/usr
system/usr/home 50K 225G 21K /mnt/usr/home
system/usr/home/ben 29K 225G 29K /mnt/usr/home/ben
system/usr/local 297M 225G 297M /mnt/usr/local
system/usr/obj 21K 225G 21K /mnt/usr/obj
system/usr/ports 190M 225G 159M /mnt/usr/ports
system/usr/ports/distfiles 30.8M 225G 30.8M /mnt/usr/ports/distfiles
system/usr/src 165M 225G 165M /mnt/usr/src
system/var 100K 225G 21K /mnt/var
system/var/audit 21K 225G 21K /mnt/var/audit
system/var/log 35K 225G 35K /mnt/var/log
system/var/tmp 23K 225G 23K /mnt/var/tmp

Unmount ZFS

# zfs umount -a

And the one we mounted by hand

# umount /mnt

And set the new ZFS-based system to be mounted on /

# zfs set mountpoint=/ system

And … reboot! (this is the moment of truth)

After the reboot, you should see

$ mount
system/root on / (zfs, local, noatime)
devfs on /dev (devfs, local, multilabel)
system/tmp on /tmp (zfs, local, noatime, nosuid)
system/usr/home/ben on /usr/home/ben (zfs, local, noatime, nosuid)
system/usr/local on /usr/local (zfs, local, noatime)
system/usr/obj on /usr/obj (zfs, local, noatime, nosuid)
system/usr/ports on /usr/ports (zfs, local, noatime, nosuid)
system/usr/ports/distfiles on /usr/ports/distfiles (zfs, local, noatime, nosuid)
system/usr/src on /usr/src (zfs, local, noatime, nosuid)
system/var/audit on /var/audit (zfs, local, noatime, nosuid)
system/var/log on /var/log (zfs, local, noatime, nosuid)
system/var/tmp on /var/tmp (zfs, local, noatime, nosuid)
$ zfs list
NAME USED AVAIL REFER MOUNTPOINT
system 1.79G 225G 21K /
system/root 763M 225G 763M legacy
system/tmp 43K 225G 43K /tmp
system/usr 1.04G 225G 21K /usr
system/usr/home 50.5K 225G 21K /usr/home
system/usr/home/ben 29.5K 225G 29.5K /usr/home/ben
system/usr/local 297M 225G 297M /usr/local
system/usr/obj 416M 225G 416M /usr/obj
system/usr/ports 190M 225G 159M /usr/ports
system/usr/ports/distfiles 30.8M 225G 30.8M /usr/ports/distfiles
system/usr/src 165M 225G 165M /usr/src
system/var 106K 225G 21K /var
system/var/audit 21K 225G 21K /var/audit
system/var/log 41.5K 225G 41.5K /var/log
system/var/tmp 23K 225G 23K /var/tmp
$ swapinfo
Device 1K-blocks Used Avail Capacity
/dev/mirror/swap.eli 2097148 0 2097148 0%

Note that system is not actually mounted (it has canmount=off) – it is used to allow all the other filesystems to inherit the / mountpoint. The one that is actually mounted on / is system/root, which is marked as legacy because it is mounted before zfs is up.

Now we’re up on the second disk, time to get the first disk back in the picture (we’re using it for boot but nothing else right now).

First blow away the MBR

# dd if=/dev/zero of=/dev/ad4 count=79
79+0 records in
79+0 records out
40448 bytes transferred in 0.008059 secs (5018970 bytes/sec)

and create the GPT partitions:

# gpart create -s GPT ad4
ad4 created
# gpart add -b 34 -s 128 -t freebsd-boot ad4
ad4p1 added
# gpart add -s 2g -t freebsd-swap -l swap4 ad4
ad4p2 added
# gpart add -t freebsd-zfs -l system4 ad4
ad4p3 added
# gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 ad4
bootcode written to ad4

No scratch partition on this one, there’s no room. Now the two disks should match

# gpart show
=>       34  976773101  ad8  GPT  (466G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)
  488397135  488376000    4  freebsd-zfs  (233G)

=>       34  488397101  ad4  GPT  (233G)
         34        128    1  freebsd-boot  (64K)
        162    4194304    2  freebsd-swap  (2.0G)
    4194466  484202669    3  freebsd-zfs  (231G)

apart from the scratch partition, of course.

Add the mirrored swap

# gmirror insert -h -p 1 swap /dev/gpt/swap4

And when rebuilding is finished, you should see

# gmirror status
       Name    Status  Components
mirror/swap  COMPLETE  gpt/swap8
                       gpt/swap4

Now add the second disk’s zfs partition

# zpool attach system /dev/gpt/system8 /dev/gpt/system4
If you boot from pool 'system', you may need to update
boot code on newly attached disk '/dev/gpt/system4'.

Assuming you use GPT partitioning and 'da0' is your new boot disk
you may use the following command:

        gpart bootcode -b /boot/pmbr -p /boot/gptzfsboot -i 1 da0

We already did this part, so no need to do anything. Wait for it to finish. Here it is partway through

# zpool status
  pool: system
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 0h0m, 39.20% done, 0h0m to go
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  718M resilvered

errors: No known data errors

and now done

# zpool status
  pool: system
 state: ONLINE
 scrub: resilver completed after 0h2m with 0 errors on Fri Mar 18 12:13:19 2011
config:

        NAME             STATE     READ WRITE CKSUM
        system           ONLINE       0     0     0
          mirror         ONLINE       0     0     0
            gpt/system8  ONLINE       0     0     0
            gpt/system4  ONLINE       0     0     0  1.79G resilvered

errors: No known data errors

And we’re done. Reboot one last time to check everything worked.

One final task not relevant to the mirroring is to mount the scratch disk area.

Create a mountpoint

# mkdir /scratch

And a pool

# zpool create -O mountpoint=/scratch -O atime=off -O setuid=off scratch /dev/gpt/scratch8

This filesystem has no redundancy, as previously mentioned. (edit: I am told that the mkdir and mountpoint are both redundant – zfs will create the directory as needed, and uses the pool name as the mount point by default)

In the next installment I will fail and replace one of the disks.

Edit:

daily_status_zfs_enable="YES"
daily_status_gmirror_enable="YES"

should be added to /etc/periodic.conf so checks are added to the daily mails.

5 Comments

  1. That seems like an awful lot of work to get ZFS working in Linux. Why not just use OpenSolaris (or OpenIndiana)? It “just works” for ZFS. Don’t get me wrong, I love Linux and my main server has been Linux for over a decade, but my NAS is OpenIndiana. I’ve got over 9TB of drive space and it runs beautifully.

    Comment by Jeremy Beker — 18 Mar 2011 @ 20:20

  2. Congratulations, but… damn.

    I think I’ll stick with my 3ware RAID and UFS for now.

    Comment by Peter da Silva — 19 Mar 2011 @ 21:20

  3. […] LinksBen Laurie blathering « Completely Redundant Disks with ZFS and FreeBSD […]

    Pingback by Links » ZFS Part 2: Disk Failure — 27 Mar 2011 @ 16:12

  4. Thank you for the hint on geom_part_gpt causing problems! I used a 8.2-RELEASE mfsBSD image (skipping the standard sysinstall) to set up root/boot on ZFS on a remotely managed Dell R200 according to Pawel’s guide. The boot hang had me stuck for a while, not finding any clues on what could have caused it.

    Comment by Daniel Néri — 27 Jun 2011 @ 12:29

  5. Hi Ben Laurie
    I have tried many of these howto’s but yours is the only one that actually works ” as is ” and you also cover failed disk replacement. In fact it is definitely the best howto I have ever read!
    Only one thing that caught me out for a minute was the tar command
    tar -c –one-file-system -f – . | tar xpf – -C /mnt/
    One must be in / directory fro this to get the desired results. So
    cd / before the tar command is required. I mention this for complete newbies.
    Still a magnificent howto.
    Many thanks Seb

    Comment by Seb — 19 Oct 2011 @ 18:26

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

Powered by WordPress