Sunday, February 24, 2013

Part III: Enter KVM

See: The next test is envisioned to be NTFS. This will require writing a small Java program to do what I did from the shell on Unix. But before that, I wanted to quantify the performance loss caused by KVM I/O virtualization.

I installed Fedora 18 on a KVM virtual machine via virt-manager and pushed /dev/md10 (the 6-disk RAID10 array) into the virtual machine as a virtio device. I then did raw I/O to /dev/vdb (what it showed up as in the virtual machine), and found that I was getting roughly the same performance as native -- which, as you recall, was 311Mb/sec. I was getting 308Mb/sec, which is close enough to be no real difference. The downside was that I was using 130% of a CPU core between the virtio driver and kflushd (using write-back mode rather than write-through mode), i.e., using up one CPU core plus 1/3rd of another to transfer the data from the VM to the LSI driver. For the purposes of this test, that is acceptable -- I have 8 cores in this machine, remember.

The next question was whether XFS performance would show the same excellent results in the VM that it showed native. This proved to be somewhat disappointing. The final result was around 280mb/sec -- or barely faster than what I was getting from ZFS. My guess is that natively XFS tries to align writes with RAID stripes for the sake of performance, but with the RAID array hidden behind the emulation layer provided by the virtualization system, it was not able to do so. That, combined with the fact that it only had half as much buffer cache to begin with (due to my splitting the RAM between the KVM virtual machine and the host OS -- i.e., 10Gb apiece) made it more difficult to effectively schedule I/O. I/O on the KVM side was "bursty" -- it would burst up to 1 gigabyte per second, then down to 0 gigabyte per second, as shown by 'dstat'. This similarly caused I/O on the host side to be somewhat "bursty". Also, this tends to support the assertion that it's the SEL (Solaris Emulation Layer) that's causing ZFS's relatively poor streaming performance when compared to BTRFS, since the SEL effectively puts the filesystem behind an emulation layer too. It also supports the assertion that the Linux kernel writers have spent a *lot* of time working on optimizations of the filesystem/block layer interface in the recent Linux kernels. It also raises the question of whether hardware RAID controllers -- which similarly hide the physical description of the actual RAID system behind a firmware-provided abstraction layer -- would have a similar negative impact upon filesystem performance. If I manage to snag a hardware RAID controller for cheap I might investigate that hypothesis but it's rather irrelevant at present.

What this did bring out was that it is unlikely that testing NTFS throughput via a Windows virtual machine is going to produce accurate data. Still, I can compare it to the Linux XFS solution, which should at least tell me whether its performance is within an order of magnitude for streaming loads. So that's the next step of this four-part series, delayed because I need to write some Java code to do what my script with 'dd' did.

-ELG

Update: My scrap heap assemblage of spare parts disintegrated -- the motherboard suddenly decided it was in 6-beep "help I can't see memory!" heaven and no amount of processor and/or memory swapping made it happy -- and thus the NTFS test never got done. Oh well.

Saturday, February 23, 2013

Part II: Enter XFS

So in the previous episode, I had benchmarked btrfs at 298Mb/sec total throughput on 8 simultaneous simulated video streams to disk, and set up a Linux RAID10 array on my six 2Tb 7200 rpm drives. The raw drives have a total streaming throughput of 110Mb apiece. I left the RAID10 array to rebuilding overnight, and went to sleep.

So what is the raw throughput of the RAID10 array and how much CPU does it chew up while doing so? I tested that today. The total raw throughput of three RAID0 stripes on those drives should be 330Mbytes/sec. Through the MD layer with a single full-speed stream I got 311Mb/sec, or roughly 6% overhead caused by the Linux kernel and the RAID10 layer. The RAID10 layer was using approximately 16% of one core accounted to flush-9:10, which is quite reasonable for the amount of work being done.

Next step was to put an XFS filesystem onto this RAID device. Note that I did not even consider putting an EXT4 onto a 6 terabyte filesystem, EXT4 is not suitable for video streaming for a number of reasons I won't detail here. EXT4 is a fine general purpose filesystem, far more reliable than it has any right to be ocnsidering its origin, but has significant performance issues with very large files in a streaming application.

The first question is, does putting the XFS log on a SSD improve performance? So I created an XFS filesystem with the log device on the SSD and the filesystem proper on /dev/md10 (the RAID10 device) and did my streaming tests again. This time it settled down to 303Mb/sec, or roughly 8% overhead. Also, because XFS only logs metadata changes, I noted that virtually no I/O was going to the log device.

Note that XFS is aggregating writes into far bigger writes than my raw writes to the MD10 device, so you cannot say that XFS has only 2% overhead over direct I/O to the raw devices. It reduces MD10 overhead due to its aggregation and its aligning of blocks to RAID stripes also. Still, it is clear that XFS is the king of high-performance streaming I/O on Linux -- as has been true for the past decade.

Of course XFS also has its drawbacks. XFS values speed over everything else, so XFS can, in practice, due to its aggressive write re-ordering, result in corrupted files in the event of a power failure or kernel panic or watchdog-forced reboot. XFS is quite acceptable for video recording data, where you may corrupt the last few seconds of video recorded to disk but you'll lose far more data due to the power outage. Add in the Linux MD layer and the MD write hole, where partial-stripe updates cannot be reconciled (as versus the COW updates of BTRFS or ZFS where the old data is still available and is reverted to if the new stripe did not complete, resulting in a file that at least is consistent, though missing the last update) and it is clear that XFS should be used for important data only on top of a hardware RAID subsystem with battery-backed cache, and should not be used for absolute mission-critical data like, say, payroll, unless the features that make it perform so well on streaming loads are turned off. Appropriate tools for appropriate tasks and all that...

So in any event, it is clear that XFS, BTRFS, and ZFS are at present useful for entirely different subsets of problems, but for video streaming XFS still remains king. Next, I take a look at what Windows will do when talking NTFS to that MD10 device via libvirtd and kvm... I will also compare to what Linux does when talking XFS to that MD10 device via libvirtd and kvm.

-Eric Lee Green

Friday, February 22, 2013

Filesystem performance - btrfs, zfs, xfs, ntfs

I was somewhat curious about filesystem performance for video streaming purposes. So I set up a test system. The test system was:
  • 12-disk Supermicro chassis with SAS1/SATA2 backplane, scrap (has motherboard issues with the sensors and a broken backplane connector and was rescued from the scrap heap)
  • SuperMicro X8DTU-F motherboard with two 2.4Ghz Xeon processors (Nehalem architecture) and 20Gb of memory (the odd amount of memory is due to being populated from the contents of my junk bin)
  • Six 2Tb 7200 RPM drives (all the SATA-2 drives that I could scrounge up from my junk bin)
  • One LSI 9211-4I SAS2 HBA controller (purchased for this purpose)
  • One Intel 160Gb SATA-2 SSD (replaced with a larger SSD in my laptop)
There was also another 64Gb SSD used as the boot drive for Fedora 18, the OS used for this test. Fedora 18 was chosen because it has a recent BTRFS with RAID10 support and because ZFS On Linux will compile easily on it. The following configurations were tested:
  1. BTRFS with 6 drives configured as RAID10
  2. ZFS with 6 drives set up as three 2 disk mirror pairs, striped. I experimented with the 160Gb SSD as logging device, and without it as logging device.
  3. XFS on top of a Linux MD RAID10
  4. A Windows 7 virtual machine running in KVM virtualization environment with the MD RAID10 exported as a physical device to the virtual machine.
I did not test ext4 because I know from experience that its performance on large filesystems with large files is horrific.

Note that XFS on top of RAID10 is subject to data loss, unlike BTRFS and ZFS which include integrity guarantees. Windows 7 in a virtual machine on top of MD RAID10 is subject to even more data loss plus has approximately 5% I/O performance overhead in my experience (the host system chews up a huge amount of CPU, but CPU was not in short supply for this test). The purpose was not to propose them as serious solutions to this problem (though for security camera data loss of a few seconds of data in the event of a power failure may be acceptable) but, rather, to compare BTRFS and ZFS performance with the "old" standards in the area.

The test itself was simple. I set up a small script that set up 8 dd processes with large block sizes running in parallel streaming /dev/zero data to the filesystem, similar to what might occur if eight extremly high definition cameras were simultaneously streaming data to the filesystem. Compression was off because that would have turned it into a CPU test, heh. At intervals I would killall -USR1 dd and harvest the resulting performance numbers.

My experience with these hard drives as singletons is that they are fundamentally capable of streaming approximately 110Mb/sec to the platters. Because ZFS and BTRFS are both COW filesystems, they should have been utilizing the full streaming capability of the drives if they properly optimized their disk writes and had zero overhead. In practice, of course, that doesn't happen.

First I tried BTRFS. After some time it settled down to approximately 298Mb/sec throughput to the RAID10 test volume. This infers approximately 10% overhead (note that since RAID10 is striped mirrors, multiply by two to get total bandwidth, then divide by six to get per-drive bandwidth). The drive lights stayed almost constantly lit.

Next I tried ZFS with the log on the SSD. I immediately noticed that my hard drive lights were "loping" -- they were brightly lit then there were occasional burps or pauses. The final result was somewhat disappointing -- approximately 275Mb/sec throughput, or roughly 17% penalty compared to raw drive performance.

But wait. That's an older Intel SSD that had been used in a laptop computer for some time before I replaced it with a larger SSD. Perhaps the common ZFS wisdom to put the log file onto an SSD is not really that good? So the next thing I tried was to destroy the testpool and re-create it from scratch without the log device, and see whether that made a performance difference. The result was no performance difference. Again I was averaging 275Mb/sec throughput. To say that I'm surprised is an understatement -- "common knowledge" says putting the ZFS log on a SSD is a huge improvement. Doesn't appear to be true, at least not for streaming workloads with relatively few but large transactions.

In other words, don't use ZFS for performance. Its performance appears to have... issues... on Linux, likely due to the SEL (Solaris Emulation Layer), though it is quite acceptable for many purposes (a 17% I/O performance penalty sucks, but let's face it, most people don't max out their I/O subsystems anyhow). It's the same licensing issues that have held up ZFS adoption elsewhere, Sun created their license for their own purposes, which are not the same as the Open Source community's purposes, and Oracle appears to agree with Sun. On the other hand ZFS is *really* handy for a backup appliance, due to the ability to snapshot backups and replicate them off-site, thereby providing all three of the functions of backups (versioning, replication, and offsite disaster recovery). For limited purposes ZFS's deduplication can be handy also (I use it on virtual machine backups since virtual machine files get rsync'ed to the backup appliance every time but most of the blocks are the same as the last time it was rsync'ed -- ZFS merely links those blocks to the previously-existing blocks rather than use up all my disk). That's a purpose that I've put it to, and am having excellent results with it. Note that the latest BTRFS code in the very latest Linux kernel has RAIDZ-like functionality so that's no longer an advantage of ZFS over BTRFS.

Finally I went to the hoary old md RAID10 driver. This caused some havoc because md RAID10 insists upon scanning and possibly rebuilding every block on the RAID array before you get full-performance access to it. I changed the max rebuild speed to well above the physical capability of the hardware and MD10 reported that it was rebuilding at 315Mb/sec. This means approximately a 4% performance overhead from MD10 compared to the raw physical hard drive speed. The kernel thread md10_raid10 was using 66% of one core, and the kernel thread md10_resync was using 15% of another core. This tends to indicate that if I had a full 12-disk RAID10 array, I'd be maxing out a core and get lower performance -- fairly disappointing. Intransa's code has similar bottlenecks (probably caused by the bio callbacks to the RAID layer from the driver layer), but I'd expected the native code to perform better than 10 year old code that wasn't even originally designed for modern kernels and was not originally designed to talk to local storage and can do so only via a shim layer that I am, alas, intimately familiar with (Intransa's RAID layer was originally designed to talk to IP-connected RAID cabinets for scale-out storage). So it goes.

So now the RAID is rebuilding. I'll come back tomorrow with a new post after it's rebuilt and look at what xfs and ntfs on top of that RAID layer do. In the meantime, don't attach too much importance to these numbers. Remember, this is being done on scrap hardware as basically a "I wonder" as to how good (or how bad) the new filesystems are compared to the oldies for my particular purpose (streaming a lot of video data to the drives). YMMV and all that. And testing it on modern hardware -- SATA3 drives and backplanes, Sandy/Ivy motherboard and processors, etc. -- likely would result in much faster numbers. Still, for my purposes, this random scrap assemblage gives me enough information. So it goes.

- Eric Lee Green

* Disclaimer - I am currently chief storage appliance engineer at Intransa, Inc., a maker of iSCSI storage appliances for video surveillance. This blog post was not, however, conducted using Intransa-owned equipment or on company time, and is my own opinion and my own opinion only.