Confessions of a Linux Penguin: Filesystem performance

I was somewhat curious about filesystem performance for video streaming purposes. So I set up a test system. The test system was:

12-disk Supermicro chassis with SAS1/SATA2 backplane, scrap (has motherboard issues with the sensors and a broken backplane connector and was rescued from the scrap heap)
SuperMicro X8DTU-F motherboard with two 2.4Ghz Xeon processors (Nehalem architecture) and 20Gb of memory (the odd amount of memory is due to being populated from the contents of my junk bin)
Six 2Tb 7200 RPM drives (all the SATA-2 drives that I could scrounge up from my junk bin)
One LSI 9211-4I SAS2 HBA controller (purchased for this purpose)
One Intel 160Gb SATA-2 SSD (replaced with a larger SSD in my laptop)

There was also another 64Gb SSD used as the boot drive for Fedora 18, the OS used for this test. Fedora 18 was chosen because it has a recent BTRFS with RAID10 support and because ZFS On Linux will compile easily on it. The following configurations were tested:

BTRFS with 6 drives configured as RAID10
ZFS with 6 drives set up as three 2 disk mirror pairs, striped. I experimented with the 160Gb SSD as logging device, and without it as logging device.
XFS on top of a Linux MD RAID10
A Windows 7 virtual machine running in KVM virtualization environment with the MD RAID10 exported as a physical device to the virtual machine.

I did not test ext4 because I know from experience that its performance on large filesystems with large files is horrific.

Note that XFS on top of RAID10 is subject to data loss, unlike BTRFS and ZFS which include integrity guarantees. Windows 7 in a virtual machine on top of MD RAID10 is subject to even more data loss plus has approximately 5% I/O performance overhead in my experience (the host system chews up a huge amount of CPU, but CPU was not in short supply for this test). The purpose was not to propose them as serious solutions to this problem (though for security camera data loss of a few seconds of data in the event of a power failure may be acceptable) but, rather, to compare BTRFS and ZFS performance with the "old" standards in the area.

The test itself was simple. I set up a small script that set up 8 dd processes with large block sizes running in parallel streaming /dev/zero data to the filesystem, similar to what might occur if eight extremly high definition cameras were simultaneously streaming data to the filesystem. Compression was off because that would have turned it into a CPU test, heh. At intervals I would killall -USR1 dd and harvest the resulting performance numbers.

My experience with these hard drives as singletons is that they are fundamentally capable of streaming approximately 110Mb/sec to the platters. Because ZFS and BTRFS are both COW filesystems, they should have been utilizing the full streaming capability of the drives if they properly optimized their disk writes and had zero overhead. In practice, of course, that doesn't happen.

First I tried BTRFS. After some time it settled down to approximately 298Mb/sec throughput to the RAID10 test volume. This infers approximately 10% overhead (note that since RAID10 is striped mirrors, multiply by two to get total bandwidth, then divide by six to get per-drive bandwidth). The drive lights stayed almost constantly lit.

Next I tried ZFS with the log on the SSD. I immediately noticed that my hard drive lights were "loping" -- they were brightly lit then there were occasional burps or pauses. The final result was somewhat disappointing -- approximately 275Mb/sec throughput, or roughly 17% penalty compared to raw drive performance.

But wait. That's an older Intel SSD that had been used in a laptop computer for some time before I replaced it with a larger SSD. Perhaps the common ZFS wisdom to put the log file onto an SSD is not really that good? So the next thing I tried was to destroy the testpool and re-create it from scratch without the log device, and see whether that made a performance difference. The result was no performance difference. Again I was averaging 275Mb/sec throughput. To say that I'm surprised is an understatement -- "common knowledge" says putting the ZFS log on a SSD is a huge improvement. Doesn't appear to be true, at least not for streaming workloads with relatively few but large transactions.

In other words, don't use ZFS for performance. Its performance appears to have... issues... on Linux, likely due to the SEL (Solaris Emulation Layer), though it is quite acceptable for many purposes (a 17% I/O performance penalty sucks, but let's face it, most people don't max out their I/O subsystems anyhow). It's the same licensing issues that have held up ZFS adoption elsewhere, Sun created their license for their own purposes, which are not the same as the Open Source community's purposes, and Oracle appears to agree with Sun. On the other hand ZFS is *really* handy for a backup appliance, due to the ability to snapshot backups and replicate them off-site, thereby providing all three of the functions of backups (versioning, replication, and offsite disaster recovery). For limited purposes ZFS's deduplication can be handy also (I use it on virtual machine backups since virtual machine files get rsync'ed to the backup appliance every time but most of the blocks are the same as the last time it was rsync'ed -- ZFS merely links those blocks to the previously-existing blocks rather than use up all my disk). That's a purpose that I've put it to, and am having excellent results with it. Note that the latest BTRFS code in the very latest Linux kernel has RAIDZ-like functionality so that's no longer an advantage of ZFS over BTRFS.

Finally I went to the hoary old md RAID10 driver. This caused some havoc because md RAID10 insists upon scanning and possibly rebuilding every block on the RAID array before you get full-performance access to it. I changed the max rebuild speed to well above the physical capability of the hardware and MD10 reported that it was rebuilding at 315Mb/sec. This means approximately a 4% performance overhead from MD10 compared to the raw physical hard drive speed. The kernel thread md10_raid10 was using 66% of one core, and the kernel thread md10_resync was using 15% of another core. This tends to indicate that if I had a full 12-disk RAID10 array, I'd be maxing out a core and get lower performance -- fairly disappointing. Intransa's code has similar bottlenecks (probably caused by the bio callbacks to the RAID layer from the driver layer), but I'd expected the native code to perform better than 10 year old code that wasn't even originally designed for modern kernels and was not originally designed to talk to local storage and can do so only via a shim layer that I am, alas, intimately familiar with (Intransa's RAID layer was originally designed to talk to IP-connected RAID cabinets for scale-out storage). So it goes.

So now the RAID is rebuilding. I'll come back tomorrow with a new post after it's rebuilt and look at what xfs and ntfs on top of that RAID layer do. In the meantime, don't attach too much importance to these numbers. Remember, this is being done on scrap hardware as basically a "I wonder" as to how good (or how bad) the new filesystems are compared to the oldies for my particular purpose (streaming a lot of video data to the drives). YMMV and all that. And testing it on modern hardware -- SATA3 drives and backplanes, Sandy/Ivy motherboard and processors, etc. -- likely would result in much faster numbers. Still, for my purposes, this random scrap assemblage gives me enough information. So it goes.

- Eric Lee Green

* Disclaimer - I am currently chief storage appliance engineer at Intransa, Inc., a maker of iSCSI storage appliances for video surveillance. This blog post was not, however, conducted using Intransa-owned equipment or on company time, and is my own opinion and my own opinion only.