Tuesday, June 26, 2012

ZFS caveats and grumbles

This is a followup of how ZFS has worked in my test deployment.

First of all, don't even try ZFS deduplication on any real data. You require about 2GB of memory for every 100GB of deduplicated data, which means that on a 2 terabyte filesystem you'd need around 400GB of memory. If you don't have that much memory, ZFS will still work... at about the same speed as a 1985-vintage Commodore 1541 floppy drive or a 9600 baud modem. So reluctantly I have to say that ZFS's deduplication capabilities are pretty much a non-starter in most production environments.

Compression, on the other hand, appears to work much better. When I turn on compression the system gets no slower as data gets added, it still remains the same level of slow.

Finally: ZFS supposedly is a "zero startup time" filesystem. But the reality is that if a pool gets shut down uncleanly, when you start up ZFS it does a potentially very lengthy integrity check, as in, can take as long as an hour to run. While still better than NTFS or Linux ext4 (both of which can take hours for any reasonably-sized filesystem), "quick" recovery from an unclean system outage is relative -- don't rely on the system being back up within minutes if someone managed to yank both power cords out of the back of your server while trying to decant some *other* system from the rack.

Next up: FreeBSD has an iSCSI initiator. But it is not integrated into system boot in any way, and ZFS if you enable it in rc.conf comes up first thing, long before networking, so you cannot enable ZFS in rc.conf (or use it as a system filesystem) if your storage array is iSCSI-connected. My rc.local looks like this now:

bsdback# cat rc.local
#!/bin/sh
iscontrol -c /etc/iscsi.conf -n tzpool1
iscontrol -c /etc/iscsi.conf -n mzpool1
iscontrol -c /etc/iscsi.conf -n tzpool2
iscontrol -c /etc/iscsi.conf -n mzpool2
sleep 5
/etc/rc.d/zfs onestart
/etc/rc.d/mountd stop
/etc/rc.d/mountd start

The mountd stop / start is necessary because mountd started up long before, noticed that the zfs mountpoints in /etc/exports didn't actually exist yet, and didn't export them. If you are exporting ZFS mountpoints via NFS on FreeBSD, this is the only way to make it happen correctly as far as I can tell -- even if you exported them via zfs, mountd starts up, looks at the zfs exports file, and refuses to export them if zfs isn't up yet.

And rc.shutdown.local looks like:

/etc/rc.d/mountd stop
/etc/rc.d/nfsd stop
/etc/rc.d/zfs onestop

This is the only way I can get the sequencing right.

Note that Red Hat Enterprise Linux has had the ability to properly sequence filesystem bringup so that iSCSI filesystems get mounted after networking (and iSCSI) comes up for quite some time -- since Red Hat Enterprise Linux 4, circa 2004, in fact. RHEL has also had the ability to automatically bring up the iSCSI targets after networking and before mounting network file systems via their equivalent of the rc.conf system (SysV Init) since 2004. This is an area in which FreeBSD lags, and should be justifiably flamed as lagging. You should not need to write custom rc.local scripting to bring up standard parts of the FreeBSD operating system in the correct order, it should Just Work(tm) after properly setting up the rc.d dependencies and ZFS flags to make it Just Work. ZFS needs to have a pre-networking and post-network two stage bringup so that any pools not located during the pre-networking stage can be searched for during the post-networking stage, and iSCSI needs to have a its own rc.d script that brings it up at the end of the networking stage.

All in all, ZFS on FreeBSD is working for me in my test deployment in my IT infrastructure, but it's not as seamless as I had hoped. When I look for IT technology I look for something that Just Works(tm). ZFS on FreeBSD would Just Work(tm) if I were using DASD's, but since I'm using network-attached storage, it's more of an erector set scenario than I'd like.

No comments:

Post a Comment