Wednesday, June 27, 2012

End of the FreeBSD ZFS experiment

So my write performance with ZFS on FreeBSD was abysmal. The final straw was when I was copying from one pool to another and it was running at roughly the same speed as a floppy disk. I made numerous attempts to tune both ZFS and the iSCSI initiator, and nothing that I tried made any real long-term difference. Things would speed up after I tweaked stuff, then slowly settle back down to a tedious crawl.

Out of frustration I created a 64-bit Centos 6.2 image with the exact same stats as the FreeBSD image and installed the native Linux port of ZFS. This requires a stub that goes into the kernel to meet licensing requirements, then compiles against that stub code. I then shut down the FreeBSD virtual machine, installed the iSCSI initiator on the Linux machine and scanned both of my iSCSI storage arrays to tell it about the initiator, then went to the storage arrays and unassigned the volumes from the FreeBSD machine and assigned them to the Linux machine instead. Then I scanned and logged them in on the Linux machine, and did the following command at a root login:

zpool import -f -F -m -a

They imported cleanly and everything came up.

So the next thing I did was set off my copy going. I am ZFS-mirroring between the two iSCSI storage arrays and I only have a single gigabit Ethernet port from my ESXi box to the storage arrays, so this put a maximum throughput of roughly 100 megabytes per seconds for both read and write. ZFS did this throughput to the storage arrays handily.

So clearly the problem is not ZFS. And FreeBSD has been shown to have good ZFS performance with DASD (direct-attached storage devices). So the fundamental problem appears to be the FreeBSD iSCSI initiator. I don't care enough to diagnose why it's so terrible when used with ZFS despite the fact that I hit all the tuning flags to turn up the queue depth etc., but the end result is that ZFS combined with iSCSI on FreeBSD is a no-go.

On Linux, BTW, it Just Worked once I built the zfs RPM's and installed them. I'm performing at the full speed of my network. And remember, that's the ultimate role of computer technology -- to Just Work, leaving the hard stuff of deciding what's going to go onto that server to the humans. My goal was to move bytes from point A to point B as fast as my ESXi system could do so, given the fact that I need to set up Etherchannel on my antique SMC switch to do ESXi trunking to get data from point A to point B any faster than gigabit Ethernet will take me. (I don't know if the antique will even do it, this is a production ESXi box so I have to set an outage time at an oddball time before I can move the iSCSI network Ethernet wires to the supposedly Etherchannel-trunked ports and flip the ESXi vswitch bit to ip-ip hash to split the traffic between the two trunked ports). So while it sucks that I have to manually build and install ZFS on Linux, the final result appears to work far better than it really should, considering the very beta-quality appearance of that ZFS On Linux site and the rapid updates they're making to the software.


Tuesday, June 26, 2012

ZFS caveats and grumbles

This is a followup of how ZFS has worked in my test deployment.

First of all, don't even try ZFS deduplication on any real data. You require about 2GB of memory for every 100GB of deduplicated data, which means that on a 2 terabyte filesystem you'd need around 400GB of memory. If you don't have that much memory, ZFS will still work... at about the same speed as a 1985-vintage Commodore 1541 floppy drive or a 9600 baud modem. So reluctantly I have to say that ZFS's deduplication capabilities are pretty much a non-starter in most production environments.

Compression, on the other hand, appears to work much better. When I turn on compression the system gets no slower as data gets added, it still remains the same level of slow.

Finally: ZFS supposedly is a "zero startup time" filesystem. But the reality is that if a pool gets shut down uncleanly, when you start up ZFS it does a potentially very lengthy integrity check, as in, can take as long as an hour to run. While still better than NTFS or Linux ext4 (both of which can take hours for any reasonably-sized filesystem), "quick" recovery from an unclean system outage is relative -- don't rely on the system being back up within minutes if someone managed to yank both power cords out of the back of your server while trying to decant some *other* system from the rack.

Next up: FreeBSD has an iSCSI initiator. But it is not integrated into system boot in any way, and ZFS if you enable it in rc.conf comes up first thing, long before networking, so you cannot enable ZFS in rc.conf (or use it as a system filesystem) if your storage array is iSCSI-connected. My rc.local looks like this now:

bsdback# cat rc.local
iscontrol -c /etc/iscsi.conf -n tzpool1
iscontrol -c /etc/iscsi.conf -n mzpool1
iscontrol -c /etc/iscsi.conf -n tzpool2
iscontrol -c /etc/iscsi.conf -n mzpool2
sleep 5
/etc/rc.d/zfs onestart
/etc/rc.d/mountd stop
/etc/rc.d/mountd start

The mountd stop / start is necessary because mountd started up long before, noticed that the zfs mountpoints in /etc/exports didn't actually exist yet, and didn't export them. If you are exporting ZFS mountpoints via NFS on FreeBSD, this is the only way to make it happen correctly as far as I can tell -- even if you exported them via zfs, mountd starts up, looks at the zfs exports file, and refuses to export them if zfs isn't up yet.

And rc.shutdown.local looks like:

/etc/rc.d/mountd stop
/etc/rc.d/nfsd stop
/etc/rc.d/zfs onestop

This is the only way I can get the sequencing right.

Note that Red Hat Enterprise Linux has had the ability to properly sequence filesystem bringup so that iSCSI filesystems get mounted after networking (and iSCSI) comes up for quite some time -- since Red Hat Enterprise Linux 4, circa 2004, in fact. RHEL has also had the ability to automatically bring up the iSCSI targets after networking and before mounting network file systems via their equivalent of the rc.conf system (SysV Init) since 2004. This is an area in which FreeBSD lags, and should be justifiably flamed as lagging. You should not need to write custom rc.local scripting to bring up standard parts of the FreeBSD operating system in the correct order, it should Just Work(tm) after properly setting up the rc.d dependencies and ZFS flags to make it Just Work. ZFS needs to have a pre-networking and post-network two stage bringup so that any pools not located during the pre-networking stage can be searched for during the post-networking stage, and iSCSI needs to have a its own rc.d script that brings it up at the end of the networking stage.

All in all, ZFS on FreeBSD is working for me in my test deployment in my IT infrastructure, but it's not as seamless as I had hoped. When I look for IT technology I look for something that Just Works(tm). ZFS on FreeBSD would Just Work(tm) if I were using DASD's, but since I'm using network-attached storage, it's more of an erector set scenario than I'd like.

Thursday, June 14, 2012

ZFS -- the killer app for FreeBSD?

Okay, so here's my issue. I have two iSCSI appliances. Nevermind why I have two iSCSI appliances, they were what was available, so that is what I'm using. So I want my backups mirrored between them. Furthermore, I want my backups to be versioned so I can access yesterday's backup or last month's backup without a problem. Furthermore, I want my backups to look like complete snapshots of the system that was backed up as of that specific point in time. Furthermore, because my data set is highly compressible in part and highly de-duplicable in part, I want it compressed as necessary and dedupe'ed as necessary.

So, how can I do this with Linux? Well... there is a LUFS version of ZFS that will sort of, maybe, do it in a way that regularly loses data and performs terribly. There is BTRFS which is basically a Linuxy re-implementation of ZFS that does compression but not deduplication, and which is still very much a beta-quality thing at present, they didn't even have a fsck program for it until this spring. And ... that's it. In short, I can only do it slowly and buggy.

So at present I have a FreeBSD virtual machine in my infrastructure happily digesting backups and bumping the snapshot counter along. And ZFS is a first-class citizen in FreeBSD land, not a castaway in LUFS-land like on Linux. I'd love to use BTRFS for this. But BTRFS today is at about the same stage as ZFS on Solaris in 2005, when it was an experimental feature in OpenSolaris, or ZFS on FreeBSD in 2008 when the first buggy port was released. ZFS on FreeBSD is stable and rock solid today, and BTRFS, realistically, isn't going to be stable and rock solid for another three or four years at least.

So if you haven't investigated ZFS on FreeBSD to manage large data sets in a versioned, compressed, and deduplicated fashion, perhaps you should. It solves this problem *today*, not a half decade from now. And a bird in hand is worth a dozen in four years.