October 30, 2015
By: gotwf

ZOL on boot and root

I want native ZFS goodness on Linux so I spent some time exploring. Discerning current from more dated best practices proved a bit challenging so I found myself bumbling about a bit. I expect this situation to be rectified as uptake of ZFS On Linux continues, ZOL stablizes, and comprehensive best practices are ferreted out. Follows are some personal notes to self and reflections documenting my adventures. End objective is a minimally installed base Linux suitable for further build out as a personal workstation. But first we’ve got some choices to understand.

"The perfect is the enemy of the good."
— Voltaire

Grubbin'…​

Ideally we’d like to have /boot and / share the same zpool, at least from management perspective of keeping our kernels and user lands in sync. Alas, ZFS on Linux development is still moving towards a 1.0 release. In the meantime, it’s a rapidly moving target. At least compared to GRUB. Yes, GRUB supports ZOL. Sort of…​ GRUB’s devs are necessarily more conservative about merging ZOL patches into something as important as a boot loader/manager. Hence, running ZFS single pool for / and /boot datasets can be problematical: While we want latest ZOL for features, enhancements, bug fixes, etc, on the other we need to ensure maximum reliability from GRUB.

One solution is to use grub-git. ZOL developers are invested in keeping grub patched for ZFS so you’ll get at or near complete ZFS feature support. If such is an important target for you then this approach may be attractive. The downside is that it’s grub-git. Maybe you don’t like living dangerously.

Another solution, in contrast to above, priortizes reliability, is willing to forgo features, and is accomplished thru the use of carefully configured ZFS feature flags tuned to maximize GRUB release version compatibility and reliability. The downside to this approach is that you necessarily give up a lot of ZFS goodness by so limiting yourself.

For those who like having their cake and eating it too: A potentially best of both worlds solution is to break /boot and / out to separate pools. We now have the ability to create the boot pool using a limited feature set known to be well supported by GRUB. With this handled, we are now free to create a second pool for / and enjoy the full smorgasbord of ZOL features where it matters most. The downside is …​. that this is far from perfect. Hmmm…​ methinks this offers a compromise good enough for my needs. Yours may differ.

Whole Disks vs Partitions

You may have heard that ZFS likes whole disks. And you would be correct. Using whole disks is preferable if interoperabiilty with other ZFS implementations that honor the "whole_disk" property is a concern. For example, on FreeBSD the whole_disk property is always set to true. This is ZOL on root and boot though, so we need to take a couple things into consideration and understand some compromises here.

ZFS does some performance tweaks when using whole disks. For example, on IllumOS based system, ZFS enables the write cache. On Linux, ZFS will set the I/O elevator to noop to avoid unnecessary CPU overhad. If using partitions it will not attempt to manage these optimizations and leave things as is. So using partitions with ZOL means we’re going to take a bit of a performance hit. Or are we?

The consensus I got on #zfsonlinux when inquiring about this, is that it would be fine to enable "elevator=noop" on a partitioned based set up like we’re using here. So feel free to tune your kernel boot parameters accordingly if you have performance concerns.

Geronimo…​

Time to go! Almost. ZFS makes things dead easy. But we need the ability to use it. Classic chickegg and the egg thang! Not to fear, as fearedbliss maintains a nice Gentoo based system-rescue-cd-with-zfs Recommended. Grab it from a torrent near you. Then follow the yellow brick road until you have a bootable iso.

Archers are going to want to embed archzfs into an archiso. Archer Jesus Alvarez does a nice job packaging this stuff up for Arch.

Then head on back here ready to begin enjoying the bliss that is ZFS.

Jump Into the Pool With Me Tonight!

Okay, I like using real hardware when testing stuff. Yeah, I do know about virtual machines…​. they have their use. I like bare metal. Get your self booted. I shall presume that you are able to set root passwd and get yourself ssh’d into toyland.

I will be using /dev/sda throughout this example. Make sure you’ve got the correct block device for your system else you may lose stuff you’d rather not…​

root@sysresccd /root % lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 698.7G  0 disk
sdb      8:16   0 698.7G  0 disk
sdc      8:32   1   3.8G  0 disk
└─sdc1   8:33   1   511M  0 part
sr0     11:0    1  1024M  0 rom
loop0    7:0    0 380.5M  1 loop /livemnt/squashfs

I’m going to be using gpt based partitions. The command line commandos among you may want to use sgdisk, in which case I’m quite sure you’re well familiar with man. I will use gdisk here. The attentive reader should be able to follow along and end up with something like this:

root@sysresccd / % gdisk /dev/sda
GPT fdisk (gdisk) version 1.0.0

Partition table scan:
  MBR: protective
  BSD: not present
  APM: not present
  GPT: present

Found valid GPT with protective MBR; using GPT.

Command (? for help): n
Partition number (1-128, default 1):
First sector (34-1465149134, default = 2048) or {+-}size{KMGTP}:
Last sector (2048-1465149134, default = 1465149134) or {+-}size{KMGTP}: +4G
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300): be00
Changed type of partition to 'Solaris boot'

Command (? for help): n
Partition number (2-128, default 2):
First sector (34-1465149134, default = 8390656) or {+-}size{KMGTP}: +128M
Last sector (8652800-1465149134, default = 1465149134) or {+-}size{KMGTP}: +4M
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300): ef02
Changed type of partition to 'BIOS boot partition'

Command (? for help): n
Partition number (3-128, default 3):
First sector (34-1465149134, default = 8660992) or {+-}size{KMGTP}: +128M
Last sector (8923136-1465149134, default = 1465149134) or {+-}size{KMGTP}: -350M
Current type is 'Linux filesystem'
Hex code or GUID (L to show codes, Enter = 8300): bf00
Changed type of partition to 'Solaris root'

Command (? for help): p
Disk /dev/sda: 1465149168 sectors, 698.6 GiB
Logical sector size: 512 bytes
Disk identifier (GUID): 1397BE73-4653-4449-9E52-1A5FC53905DE
Partition table holds up to 128 entries
First usable sector is 34, last usable sector is 1465149134
Partitions will be aligned on 2048-sector boundaries
Total free space is 1243102 sectors (607.0 MiB)

Number  Start (sector)    End (sector)  Size       Code  Name
   1            2048         8390655   4.0 GiB     BE00  Solaris boot
   2         8652800         8660991   4.0 MiB     EF02  BIOS boot partition
   3         8923136      1464432334   694.0 GiB   BF00  Solaris root

Command (? for help): w

Final checks complete. About to write GPT data. THIS WILL OVERWRITE EXISTING
PARTITIONS!!

Do you want to proceed? (Y/N): y
OK; writing new GUID partition table (GPT) to /dev/sda.
The operation has completed successfully.

Cool. Let’s grab a copy of that table:

root@sysresccd / % sgdisk --backup=./sda-gpt-part.table

Transferring to somewhere that survives a reboot is left as an exercise for the reader ;)

To recap:

root@sysresccd / % lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda      8:0    0 698.7G  0 disk
├─sda1   8:1    0     4G  0 part
├─sda2   8:2    0     4M  0 part
└─sda3   8:3    0   694G  0 part
sdb      8:16   0 698.7G  0 disk
sdc      8:32   1   3.8G  0 disk
└─sdc1   8:33   1   511M  0 part
sr0     11:0    1  1024M  0 rom
loop0    7:0    0 380.5M  1 loop /livemnt/squashfs

Time to make our boot zpool:

root@sysresccd / % zpool create -o version=28 -o ashift=9 -o cachefile= -m none \
-R /mnt/gentoo boot /dev/sda1

See man for details of setting version=28 flag, but basically we’ve just got a really dandy, nice shorthand here, for designating some nicely compatible grubbables…​.

Also please take note of the use of the ashift flag. If you know your drive is 4K, set ashift=12 We can check with a little help from smartmontools:

root@sysresccd / % smartctl -i /dev/sda | grep Sector
Sector Size:      512 bytes logical/physical

This drive uses both 512b logical and physical sectors. Hence ashift=9 is most appropriate. In the early days of 4K drives, drive manufacturers had to make their 4K drives lie in order to maintain compatibility with a widely deployed Windows XP base. In more modern times, XP is dead and drive manufactures no longer need to play this game so you’re more likely to get the truth. Still, confirm that your numbers for logical and physical are consistent.

Some are of the opinion that it is preferable to always use ashift=12 so as to maintain "forward compatibility" for the day when we’ve got our big boy pants on and upgrade to 4K drives. Well, although a potentially valid point, I’ve news for you: If and when that day comes, I’m going to be getting down and dirty at the hardware level anyways. I therefore advise: tune for what you’ve got happen' now, baby! ;D

But I digress! Let’s export that boot pool. Inquiring minds have questions? No worries. Shake it off. We’ll come back to it. All will become clear, young Padawan…​

root@sysresccd / % zpool export boot

And now create our main root pool and export it:

root@sysresccd / % zpool create -o ashift=9 -o cachefile= -m none -O compression=lz4 \
> -R /mnt/gentoo freebird /dev/sda3
root@sysresccd / % zpool export freebird

I prefer to use distinctive names for my root pools, usually associated in some way at the bare metal hardware level. Here I’ve used "freebird". You are free to use whatever. On some systems, rpool and tank are imported automatically. This may not be what you want. If I ever plug these drives into another system, I want to be in control of what gets imported and where. I also like having unique names from a management perspective. Meh.. so I have to type a few extra commands…​

We want to be using /dev/disk/by-id when we build up our system, so let’s reimport our pools as such now. I’ll run a few other commands that should be self explanatory. If not, please rtrm ;D

root@sysresccd / % zpool import -d /dev/disk/by-id -R /mnt/gentoo -Na
root@sysresccd / % zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
boot      3.97G   130K  3.97G         -      -     0%  1.00x  ONLINE  /mnt/gentoo
freebird   692G  95.5K   692G         -     0%     0%  1.00x  ONLINE  /mnt/gentoo
root@sysresccd / % zpool status
  pool: boot
 state: ONLINE
status: The pool is formatted using a legacy on-disk format.  The pool can
	still be used, but some features are unavailable.
action: Upgrade the pool using 'zpool upgrade'.  Once this is done, the
	pool will no longer be accessible on software that does not support
	feature flags.
  scan: none requested
config:

	NAME                                               STATE     READ WRITE CKSUM
	boot                                               ONLINE       0     0     0
	  ata-WDC_WD7500AYYS-01RCA0_WD-WCAPT0562110-part1  ONLINE       0     0     0

errors: No known data errors

  pool: freebird
 state: ONLINE
  scan: none requested
config:

	NAME                                               STATE     READ WRITE CKSUM
	freebird                                           ONLINE       0     0     0
	  ata-WDC_WD7500AYYS-01RCA0_WD-WCAPT0562110-part3  ONLINE       0     0     0

errors: No known data errors
root@sysresccd / % zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
boot       104K  3.84G    29K  none
freebird  74.5K   670G    19K  none

I passed off an -N flag there to tell zfs not to mount the pools so I could take a look at them and demonstrate a couple other simple zfs commands. And/or in event I make a mistake/typo it’s easy just to destroy the ensuing cluster w/o having to worry about mopping up any actual mount points created on the file system.

Time to make some ZFS datasets:

root@sysresccd / % zfs create freebird/ROOT
root@sysresccd / % zfs create -o mountpoint=/ freebird/ROOT/gentoo
root@sysresccd / % zfs create -o mountpoint=/boot boot/gentoo
root@sysresccd / % zfs list
NAME                   USED  AVAIL  REFER  MOUNTPOINT
boot                   152K  3.84G    29K  none
boot/gentoo             30K  3.84G    30K  /mnt/gentoo/boot
freebird               132K   670G    19K  none
freebird/ROOT           38K   670G    19K  none
freebird/ROOT/gentoo    19K   670G    19K  /mnt/gentoo

That should cover your basic bases. I like to manage via the use of container datasets. Moreover, I have a deep seated need to keep my home dirs extra cozy in the winter so:

root@sysresccd / % zfs create -o mountpoint=/home freebird/HOME
root@sysresccd / % zfs create -o mountpoint=/root freebird/HOME/root

The adventurous reader may deem it desirable, or even just too damn much fun, and feel irresistibly compelled to break out other datasets as you see fit for your needs, distro, etc. (cuz you is free, free, free baby!!) Follows is example of how I might prep a Gentoo workstation:

root@sysresccd / % zfs create freebird/GENTOO
root@sysresccd / % zfs create -o mountpoint=/var/portage freebird/GENTOO/portage
root@sysresccd / % zfs create -o mountpoint=/var/portage/distfiles freebird/GENTOO/distfiles
root@sysresccd / % zfs create -o mountpoint=/var/tmp/portage freebird/GENTOO/build-dir

So now we’ve got something like this:

root@sysresccd / % zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
boot                        152K  3.84G    29K  none
boot/gentoo                  30K  3.84G    30K  /mnt/gentoo/boot
freebird                    327K   670G    19K  none
freebird/GENTOO              76K   670G    19K  none
freebird/GENTOO/build-dir    19K   670G    19K  /mnt/gentoo/var/tmp/portage
freebird/GENTOO/distfiles    19K   670G    19K  /mnt/gentoo/var/portage/distfiles
freebird/GENTOO/portage      19K   670G    19K  /mnt/gentoo/var/portage
freebird/HOME                38K   670G    19K  /mnt/gentoo/home
freebird/HOME/root           19K   670G    19K  /mnt/gentoo/root
freebird/ROOT                38K   670G    19K  none
freebird/ROOT/gentoo         19K   670G    19K  /mnt/gentoo

After Party

Oooh, la, la!!! So much ZFS fun. I’ve just decided to break out a couple more:

root@sysresccd / % zfs create -o mountpoint=/var/log freebird/GENTOO/log
root@sysresccd / % zfs create -o mountpoint=/var/cache freebird/GENTOO/cache
root@sysresccd / % zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
boot                        152K  3.84G    29K  none
boot/gentoo                  30K  3.84G    30K  /mnt/gentoo/boot
freebird                    399K   670G    19K  none
freebird/GENTOO             114K   670G    19K  none
freebird/GENTOO/build-dir    19K   670G    19K  /mnt/gentoo/var/tmp/portage
freebird/GENTOO/cache        19K   670G    19K  /mnt/gentoo/var/cache
freebird/GENTOO/distfiles    19K   670G    19K  /mnt/gentoo/var/portage/distfiles
freebird/GENTOO/log          19K   670G    19K  /mnt/gentoo/var/log
freebird/GENTOO/portage      19K   670G    19K  /mnt/gentoo/var/portage
freebird/HOME                38K   670G    19K  /mnt/gentoo/home
freebird/HOME/root           19K   670G    19K  /mnt/gentoo/root
freebird/ROOT                39K   670G    19K  none
freebird/ROOT/gentoo         20K   670G    20K  /mnt/gentoon

Or not…​.

root@sysresccd / % zfs destroy freebird/GENTOO/cache
root@sysresccd / % zfs list
NAME                        USED  AVAIL  REFER  MOUNTPOINT
boot                        152K  3.84G    29K  none
boot/gentoo                  30K  3.84G    30K  /mnt/gentoo/boot
freebird                    399K   670G    19K  none
freebird/GENTOO             114K   670G    19K  none
freebird/GENTOO/build-dir    19K   670G    19K  /mnt/gentoo/var/tmp/portage
freebird/GENTOO/distfiles    19K   670G    19K  /mnt/gentoo/var/portage/distfiles
freebird/GENTOO/log          19K   670G    19K  /mnt/gentoo/var/log
freebird/GENTOO/portage      19K   670G    19K  /mnt/gentoo/var/portage
freebird/HOME                38K   670G    19K  /mnt/gentoo/home
freebird/HOME/root           19K   670G    19K  /mnt/gentoo/root
freebird/ROOT                39K   670G    19K  none
freebird/ROOT/gentoo         20K   670G    20K  /mnt/gentoo

It’s just that easy!

Okay, Moving Along Here…​

Remember that -N flag we threw up in zpool import’s face? Let’s get our datasets mounted up and ready to ride:

root@sysresccd / % zfs mount -a
root@sysresccd / % zfs mount
freebird/ROOT/gentoo            /mnt/gentoo
boot/gentoo                     /mnt/gentoo/boot
freebird/HOME                   /mnt/gentoo/home
freebird/HOME/root              /mnt/gentoo/root
freebird/GENTOO/portage         /mnt/gentoo/var/portage
freebird/GENTOO/distfiles       /mnt/gentoo/var/portage/distfiles
freebird/GENTOO/build-dir       /mnt/gentoo/var/tmp/portage
freebird/GENTOO/log             /mnt/gentoo/var/log

Sweet! Chroot into e.g. /mnt/gentoo and you’re ready to rock it the rest of the way off into the sunset as per your distro of choices' installation instructions.

Mad Props!

In addition to the handy, dandy, awesome sauce rescue cd, I have also drawn inspiration from fearedbliss’s guide for installing Gentoo Linux On ZFS, particularly with regards to creating the boot and root pools. I also derive a lot of gentoo tuned zfs dataset management inspiration from ryao’s guide. Complimented, of course, by diligent study of the Gentoo Install Guide.

Tally Ho!!

Tags: zfsonlinux guides gentoo
gotwf