Saturday, August 25, 2012

Linux block layer, BTRFS, and ZFS On Linux

Long time no blog. Lately I've been stuck way down in the 2.6.32 kernel block device midlayer, both initiating I/O to block devices via the submit_bio interface, and also setting up a midlayer driver.

What I'm finding is that things are a bit of a mess in the 2.6.32 kernel when it comes to device pulls and device removals. When I chug down into the scsi midlayer I see that it's supposed to be completing all bios with -EIO, but there's still situations where when I yank a drive out of the chassis, I don't get all of my endios back with errors because of races in the kernel between device removal and device teardown. The net result is I have no idea what actually made it to disk or not. Note that you will NOT see this racy behavior on a normal system where the completion (almost) always wins the race, I was generating thousands of I/O's per second to 48 disks with CPU usage pretty much maxed out as part of load testing to see how things worked at the limits.

Now, that's no problem for my particular application, which is somewhat RAID-like, or for the Linux MD layer, or for single standalone block device filesystems for that matter. What's on the disk is on the disk, and when the standalone filesystem is remounted it'll know its state at that point by looking at the log. For the RAID type stacking drivers, when it comes back the RAID layer will note that its RAID superblock is out of data and rebuild the disk via mirror or ECC recovery, a somewhat slow process but the end result is a disk in known state. So when I get the disk removal event I mark the device as failed, quit accepting I/O for it, and mark all the work pending for that device that hasn't already endio'ed as errored, and if an endio sneaks in afterwards and tries to mark that work item again as errored, no big deal (although I put a check in the endio so that it simply noops if the work already was completed by the disk removal event). This means I have to keep a work pool around, but that's a good idea anyhow since otherwise I'd be thrashing kalloc/kfree, and if the drive comes back I'll re-use that pool again.

So traditional RAID and standalone devices don't have a problem here. Where a problem exists is with filesystems like btrfs and zfs that do replication on a per-file-block level rather than on a per-block-device-block level. If they can't log whether a block made it or not because they never got an endio, they can get confused. btrfs appears to err on the side of caution (i.e., assumes it didn't get replicated, and replicate it elsewhere if possible) but when the missing volume comes back and has that additional replica on it, strange things may happen. ZFSonLinux is even worse, since its SEL (Solaris Emulation Layer) appears to assume that bios always complete, and deadlocks waiting for bio's to complete rather than properly handle disk remove events. (Note that I haven't really gone digging into SEL, I'm basing this solely on observed behavior).

The good news: The popularity of btrfs among kernel developers appears to be motivating the Linux kernel team to fix this situation in newer kernels. I was chugging through the block subsystem in 3.5 to see if there was something that could be backported to make 2.6.32 behave a bit better here, and noticed some significant new functionality to make the block subsystem more flexible and robust. I didn't backport any of it because it was easier to just modify my kernel module to behave well with the default 2.6.32 behavior (I'm talking *extensive* changes in the block layer in recent kernels), but it appears that the end result is that btrfs on the 3.5 kernel should be *significantly* more reliable than the backported version of btrfs that Red Hat has put into their 2.6.32 kernel on RHEL 6.3.

So that's my recommendation: If you want to run btrfs right now, go with the latest kernel, whether it's on RHEL 6.3 or Fedora 17 or whatever. And you know the reason for my recommendation now. Red Hat has *not* backported any of those block layer changes back to RHEL 6.3, so what you have with btrfs on their stock 2.6.32 kernel is the equivalent of having a knight in shining armor that's missing the breast-plate. Sort of renders the whole exercise useless, in the end.


No comments:

Post a Comment