Showing posts with label filesystems. Show all posts
Showing posts with label filesystems. Show all posts

Sunday, September 27, 2015

SSD: This changes everything

So someone commented on my last post where I predicted that providing block storage to VM's and object storage for apps was going to be the future of storage, and he pointed out some of the other ramifications of SSD. To whit: Because SSD removes a lot of the I/O restrictions that have held back applications in the past, we are now at the point where CPU in many cases is the restriction. This is especially true since Moore's Law has seemingly gone AWOL. The Westmere Xeon processors in my NAS box on the file cabinet beside my desk aren't much slower than the latest Ivy Bridge Xeon processors. The slight bump in CPU speed is far exceeded by the enormous bump in IOPS that comes with replacing rotational storage with SSD's.

I have seen that personally, myself, in watching a Grails application max out eight CPU cores while not budging the iometer on a database server running off of SSD's. What that implies is that the days of simply throwing CPU at inefficient frameworks like Grails are limited. In the future efficient algorithms and languages are going to come back in fashion to use all this fast storage that is taking over the world.

But that's not what excites me about SSD's. That's just a shuffling of priorities. What excites me about SSD's is that they free us from the tyranny of the elevator. The elevator is the requirement that we sweep the disk drive heads from bottom to top, then from top to bottom, in order to optimize reads. This in turn puts some severe restrictions on how we lay out storage for block storage -- the storage must be stored contiguously so that filesystems layered on top of the block storage can properly schedule I/O out of their buffers to satisfy the elevator. This in turn means we're stuck with the RAID write hole unless we have battery backed cache -- we can't do COW RAID stripe block replacement (that is, write altered blocks of a RAID stripe at some new location on the device then alter a stripe map table to point at those new locations and add the old locations to a free list) because a filesystem on top of the block device would not be able to schedule the elevator properly. The performance of the block storage system would fall over. Thus why traditional iSCSI/Fiber Channel vendors present contiguous LUNs to their clients.

As a result when we've tried to do COW in the past, we did it at the filesystem level so that the filesystem could properly schedule the elevator. Thus ZFS and BTRFS. They manage their own redundancy rather than using RAID at the block layer to handle their redundancy, and ideally want to directly manage the block devices. Unfortunately that really doesn't map well to a block storage back end that is based on LUNs, and furthermore, doesn't map well to virtual machine block devices represented as files on the LUN -- virtual machines all have their own elevators doing what they think are sequential ordered writes, but the COW filesystems are writing at random places, so read performance inside the virtual machines becomes garbage. Thus VMware's VMFS, which is an extent-based clustered filesystem that, again, due to the tyranny of the elevator, keeps the blocks of a virtual machine's virtual disk file located largely contiguously on the underlying block storage so that the individual virtual machines' elevators can schedule properly.

So VMFS talking to clustered block storage is one way of handling things, but then you run into limits on the number of servers that can talk to a single LUN that in turn makes it difficult to manage because you end up with hundreds of LUN's for hundreds of physical compute servers and have to schedule the LUNs so they're only active on the compute servers that have virtual machines on that specific LUN (in order to avoid hitting the limits on number of servers allowed to access a single LUN). What is needed is the ability to allocate block storage on the back end on a per-virtual-machine basis, and have the same capabilities on that back end that VMFS gives us on a single LUN -- the ability to do snapshots, the ability to do sparse LUN's, the ability to copy snapshots as new volumes, and so forth. And have it all managed by the cloud infrastructure software. This was difficult back in the days of rotational storage because we were slaves of the elevator, because we had to make sure that all this storage ended up contiguous. But now we don't -- the writes have to be contiguous, due to the limitations of SSD, but reads don't. And it's the reads that forced the elevator -- scheduling contiguous streams of writes (from multiple virtual machines / multiple files on those virtual machines) has always been easy.

I suspect this difficulty in managing VMFS on top of block storage LUNs for large numbers of ESXi compute servers is why Tintri decided to write their own extent-based filesystem and serve it as a NFS datastore to ESXi boxes, rather than as block storage LUN's. NFS doesn't have the limits on number of computers that can connect. But I'm not convinced that, going forward, this is going to be the way to do things. VSphere is a mature product that has likely reached the limits of its penetration. New startups today are raised in the cloud, primarily on Amazon's cloud, and they want a degree of flexibility to spin virtual machines up and down that make life difficult with a product that has license limits. They want to be able to spin up entire test constellations of servers to run multi-day tests on large data sets, then destroy them with a keystroke. They can do this with Amazon's cloud. They want to be able to do this on their local clouds too. The future is likely to be based on the KVM/QEMU hypervisor and virtualization layer, which can use NFS data stores but they already have the ability to present an iSCSI LUN to a virtual machine as a block device. Add in some local SSD caching at the local hypervisor level to speed up writes (as I explained last month), and you have both the flexibility of the cloud and the speed of SSD. You have the future -- a future that few storage vendors today seem to see, but one that the block storage vendors in particular are well equipped to capture if they're willing and able to pivot.

Finally, there is a question as to whether storage and compute should be separate things altogether. Why not have compute in the same box as your storage? There's two problems with that though: 1) you want to upgrade compute capability to faster processors on a regular basis without disrupting your data storage, and b) density of compute servers is much higher than density of data servers, i.e., you can put four compute blades into the same 2U space as a 24-bay data server. And as pointed out above, compute power is now going to be the limiting factor for many applications, not IOPs. Finally, you want the operational capability to add more compute servers as needed. When our team used up the full capacity of our compute servers, I just added another compute server -- I had plenty of storage. Because the demand for compute and memory just keeps going up as our team has more combinations of customer hardware and software to test, it's likely I'm going to continue to have to scale compute servers far more often than I have to scale storage servers.

So this has gone on much too long but the last thing to cover is this: Will storage boxes go the way of the dodo bird, replaced by software-defined solutions like Ceph on top of large numbers of standard Linux storage servers serving individual disks as JBOD's? It's possible, I suppose -- but it seems unlikely due to the latency of having to locate disk blocks scattered across a network. I do believe that commodity hardware is going to win everything except the high end big iron database business in the end because the performance of commodity hardware has risen to the point where it's pointless to design your own hardware rather than purchase it off the shelf from a vendor like Supermicro. But there is still going to be a need for a storage stack tied to that hardware in the end because pure software defined solutions are unable to do rudimentary things like, e.g., use SES to blink the LED of a disk bay whose SSD has failed. In the end providing an iSCSI LUN directly to a virtual machine requires both a software support side that is clearly software defined, and a hardware support side where the hardware is managed by the solution. This in turn implies that we'll continue to have storage vendors shipping storage boxes in the future -- albeit storage boxes that will incorporate increasingly large amounts of software that runs on infrastructure servers to define important functions like, e.g., spinning up a virtual machine that has a volume attached of a given size and IOPs guarantee.

-ELG

Saturday, August 1, 2015

The quest for an integrated storage stack

In prior posts I've mentioned the multitude of problems with the standard Linux storage stack. It's inflexible -- once you've set up a stack (usually LV->VG->PV->MD->BLOCK) and opened a filesystem on it, you cannot modify it to, e.g., add a replication layer to the stac. It lacks the ability to do geographic replication in any reasonable fashion. The RAID layer in particular lacks the ability to write to (and replay) a battery-backed RAM cache to deal with the RAID 5 write hole (which, despite its name, also applies to other RAID levels and results in silently corrupted data). Throw iSCSI into this equation to provide block devices to virtual machines and, potentially, to do replication to block devices on other physical machines, and things get even more complex.

One method that has been proposed to deal with these issues is to simply not use a storage stack at all. Thus we have ZFS and BTRFS, which attempt to move the RAID layer and logical volume layers into the filesystem. This certainly solves the problem of corrupted data, but at a significant penalty in terms of performance, especially on magnetic media where the filesystem swiftly becomes fragmented. As a result running virtual machines using "block devices" that are actually files on a BTRFS filesystem results in extremely poor "disk" performance on the virtual machines. A file on a log-based subsystem is simply a poor substitute for an extent on a block device. Furthermore, use of these filesystems for databases has proven to be woefully slow compared to using a normal filesystem like XFS on top of a RAID-10 layer.

The other method that has been to abandon the Linux storage stack except as a provider of individual block devices and instead layer a distributed system like Ceph on top of it. My tests with Ceph have not been particularly promising. Performance of Ceph block devices at an individual virtual machine level were abysmal. There appears to be three reasons for this: 1) Overly pessimistic assumptions about writes on the part of Ceph, 2) The inherent latencies involved in a distributed storage stack, and 3) the fact that Ceph reads/writes via XFS filesystems layered on top of block devices, rather than to extents on raw block devices. For the latter, in my experience you will see *at least* a 10% degradation in virtual machine block device performance if the block device is implemented as a file on top of XFS rather than directly to a LVM extent.

In both cases, I wonder if we are throwing out the cart because the horse has asthma. I've worked as a software engineer for two of the pioneers of Linux-based storage -- Agami Systems, which did a NAS device with an integrated storage system, and Intransa Inc., which did scalable iSCSI storage systems with an integrated block storage subsystem. Both suffered the usual fate of pioneers -- i.e., face down dead with arrows in the back, though it took longer with Intransa than with Agami. Both wrote storage stacks for Linux which solved most of the problems of the current Linux storage stack, though each solved a different subset of those problems. There are still a significant number of businesses which do not need the expense and complexity of a full OpenStack data center in order to solve their problems, but which do need things like, e.g., logged geographic replication to replicate their data to an offsite location, something which Intransa solved ten years ago (but which, alas, died with Intransa), or real-time snapshots of virtual machine block devices at the host device level, or ...

In short: Despite the creation of distributed systems like CEPH and integrated storage management filesystems like BTRFS, there is a significant need for an integrated storage stack for Linux -- one that allows flexibility in configuring both block devices and network filesystems, which allows for easy scalability and management, which has modern features such as logged geographic replication, battery backed RAM cache support (or at least fast SSD log device support at the MD layer), and allows dynamic insertion of components into the software stack much as you could create a replication layer in the Intransa StorStac and have it sync then replicate to a remote device without ever unmounting any filesystem or making the iSCSI target inaccessible. There is simply a large number of businesses which just don't need the expense and complexity of a full OpenStack data center, which indeed don't need more than a pair of iSCSI / NAS storage appliances (a pair in order to handle replication and snapshotting), and the current Linux storage stack lacks fundamental functionality that was implemented over a decade ago but never integrated into Linux itself. It may not be possible to bring all the concepts that Agami and Intransa created into Linux (though I'll point out that all of Intransa's patents are now owned by a patent entity that allows free use for Open Source software), but we should attempt to bring as many of them as possible into the standard Linux storage stack -- because the cloud is the cloud, but most smaller businesses have no need for the cloud, they just need reliable local storage for their local physical and virtual machines.

-ELG

Friday, April 26, 2013

On spinning rust and SSD's.

I got my Crucial M4 512GB SSD back for my laptop. It failed about three weeks ago, when I turned on my laptop it simply wasn't there. Complete binary failure mode -- it worked, then it didn't work. So I took it out of the laptop, verified in an external USB enclosure that it didn't "spin up" there either, installed a 750Gb WD Black 7200 rpm rust-spinner that was in my junk box for some project or another, and re-installed Windows and restored my backups. Annoying, but not fatal by any means. I've had to get used to the slow speed of spinning rust again versus the blazingly fast SSD, but at least I'm up and running. So this weekend I get to make another full backup, then swap out the rust for the SSD again.

At work I've had to replace several of the WD 2TB Enterprise drives in the new Linux-based infrastructure when smartd started whining about uncorrectable read errors. When StorStac got notification of that sort of thing it re-wrote the sector from the RAID checksums and that usually resolved it. The Linux 3.8 kernel's md RAID6 layer apparently doesn't do that, requiring me to kick the drive out of the md, slide in a replacement, fire off a rebuild, and then haul the drive over to my desktop and slide it in there and run a blank-out (write zeroes to the entire drive). Sometimes that resolves the issue, sometimes the drive really *is* toast, but at least it was an analog error (just one or two bad places on the drive), not a complete binary error (the entire drive just going blammo).

SSD's are the future. The new COW filesystems such as ZFS and BTRFS really don't do too well on spinning rust, because by their very nature they fragment badly over time. That doesn't matter on SSD's, it does matter with rust-spinners, for obvious reasons. With ZFS you can still get decent performance on rust if you use a second-level SSD cache, that's how I do my backup system here at home (which is an external USB3 hard drive and an internal SSD in my server), BTRFS has no such mechanism at present but to a certain extent compensates by having a (manual) de-fragmentation process that can be run from time to time during "off" hours. Still, both filesystems clearly prefer SSD to rotational storage. It's just the nature of the beast. And those filesystems have sufficient advantages in terms of functionality and reliability (except in virtualized environments as virtual machine filesystems -- but more on that later) that if your application can afford SSD's, that alone may be the tipping point that makes you go to SSD-based storage rather than rotational storage.

Still, it's clear to me that, at this time, SSD is still an immature technology subject to catastrophic failure with no warning. Rotational storage usually gives you warning, you start getting SMART notifications about sectors that cannot be read, about sectors being relocated, and so forth. So when designing an architecture for reliability, it is unwise to have an SSD be a single point of failure, as is often done for ESXi servers that lack hardware RAID cards supported by ESXi. It might *seem* that SSD is more reliable than rotational storage. And on paper, that may even be true. But the reality is that because the nature of the failures is different, in *reality* rotational storage gives you a much better chance of detecting and recovering from a failing drive than SSD's do. That may, or may not be important for your application -- in RAID it clearly isn't a big deal, since you'll be replacing the drive and rebuilding a new drive anyhow -- but for things like an ESXi boot drive it's something you should consider.

-ELG

Tuesday, September 25, 2012

BTRFS vs ZFSonLinux: How do they compare?

  • Integration with Linux
    • ZFS: Not integrated. Has its own configuration database (not /etc/fstab), has its own boot order for mounting filesystems (not definable by you), cannot be told to bring a filesystem up after iSCSI comes up or down before iSCSI goes down.
    • BTRFS: It's just another Linux filesystem as far as the system is concerned. You bring a pool up by mounting it (preferably by label) in /etc/fstab and can define the mount order so it comes up after iSCSI.
  • Snapshots
    • ZFS: Full snapshot creation and removal capabilities, well exploited by the FreeBSD port 'zfs-periodic'. Snapshots appear in a special "dot" directory rather than cluttering up the main filesystem. This script is relatively easy to port to Linux.
    • BTRFS: Snapshots are created as "clones" of subvolumes, and destroyed as if they were subvolumes. They can be created either read-write or read-only.
  • RAID: Both of these use filesystem-level RAID where filesystem objects are stored redundantly, either as entire clones (RAID1) or, in the case of ZFS, via RAID
    • ZFS: Raid1 (mirroring) and RaidZ (similar to RAID5, except that it never does partial-stripe writes because it does variable stripe size -- the size of an object is the size of a stripe). Note that due to ZFS's COW implementation, an update to a RAID stripe cannot be corrupted by a power loss halfway through the write (see: RAID5 write hole)-- the old copy of the data (prior to the start of the write) is instead accessed when power comes back on.
    • BTRFS: Raid1 (mirroring). BTRFS currently has nothing like RaidZ. Note that putting a BTRFS filesystem on top of a software mdadm RAID5 will not give you the same reliability and performance as RaidZ, since you will still have the random write hit of partial-stripe writes and will still have the RAID5 write hole where if a stripe update fails due to power loss halfway thru the stripe write, the entire stripe is corrupted.
  • Portability
    • ZFS: A ZFS filesystem can be read / written on: Linux (via either ZFS/Fuse or ZFSonLinux), FreeBSD, OpenIndiana, and MacOS (via Zevo). Requires extra 3rd party software to be installed on Linux and MacOS, comes standard with FreeBSD and OpenIndiana.
    • Linux: Any recent Linux distribution (one with a 3.x vintage kernel) has BTRFS built in. Your BTRFS pools will be immediately available when you upgrade to a newer kernel or a newer Linux distribution, with no need to install any additional software. However, BTRFS doesn't run on any other OS.
  • Stability
    • On Linux, both BTRFS and ZFS are listed as "experimental". ZFSonLinux uses SEL (the Solaris Emulation Layer) as a "shim" between ZFS proper and Linux. Unfortunately this is sort of like nailing jello to a tree, while the underlying Linux block layer API hasn't changed in years, locking inside that block layer API has been in constant turmoil ever since the 2.6.30 timeframe as the last vestiges of the Big Kernel Lock were ferreted out and sent to the great bit bucket in the sky. The end result is that code that *used* to work may or may not cause deadlocks or strange races that cause an oops with current Linux kernels -- *UNLESS* it was developed as part of that current Linux kernel, as BTRFS is, in which case the person who changes the locks is responsible to make sure that all other kernel modules that are part of the next kernel release change their locks to match.
    • Summary: On Linux, this is a tie. BTRFS is under rapid development. ZFS is attempting to nail jello to a tree from outside the Linux kernel. Use for production data of either system on Linux is not recommended. If you want a production server running a production-quality modern snapshotting filesystem, use ZFS on FreeBSD.
Final summary:

If you must use Linux, and you must have a modern snapshotting filesystem, and you can live with a RAID1 limitation on data redundancy, I would strongly recommend going with BTRFS. The reason for this is that BTRFS is only going to get better on Linux, while ZFS is always going to be fighting the nail-jello-to-a-tree issue where Linux keeps changing underneath it and breaking things in weird ways. Unless ZFS is included as part of the Linux kernel -- and Oracle's lawyers will never allow changing the license to GPL in order to allow that -- there simply is no way ZFS will ever achieve stability except with specific kernel versions shipped with specific distributions. And even there I'm dubious.

If you need the stability of ZFS, I strongly recommend using FreeBSD and not using Linux. I have personal experience dealing with the issues that come with supporting an emulation layer on top of the Linux block layer, including dealing with some deadlocks and races caused by locking changes inside recent kernels that caused a six-week delay in the release of an important product, and I honestly cannot say that any current ZFSonLinux implementation will continue to work with the next kernel revision. I can reliably say that BTRFS will work with the next kernel revision. While production servers don't change kernel revisions often, only once every three or four years, if the next version of the server OS doesn't happen to be one that is well supported by ZFSonLinux's then-current SEL implementation, you have problems.

So: Linux -- BTRFS. If you need the functionality of ZFS -- FreeBSD. Enough said on that.

Wednesday, September 12, 2012

This is the droid you're looking for

I have a new toy now. I ditched my aging iPhone 4 upon completion of its contract (and ported its number into Google Voice), and now have a brand new Samsung Galaxy S3.

So far it's mostly all good. Battery life is bad, but we already knew that. I tried several different home screen programs but I'm sticking with TouchWiz for now because the updated one for the Galaxy S3 works as well as anything else I tried, even the backported Jellybean launcher. It has lousy reception inside company HQ but so did the iPhone 4, just an AT&T thing I guess (my Verizon iPad has great reception inside company HQ). I'm still looking for a clean solution for automatically syncing my photos into iPhoto, but iSyncr is doing a reasonably good job of getting them onto my Macbook so I'm not too displeased.

Thus far I've found substitutes for everything I did on my iPhone except one: There is no good offline GPS program like the Magellan program that I used on the iPhone. Supposedly TomTom is going to be remedying that soon. We'll see.

So anyhow, I have found one bug in the Galaxy S3's ICS Android version: It does not handle exFAT very well. I found this out the hard way when my 64GB microSD card quit working and reported, "Damaged SD Card". Indeed, checking the Internet, it appears that random exFAT corruption is an epidemic on the Galaxy S3. This afflicts any microSD over 32GB, since Microsoft officially says FAT32 won't go over 32GB. This is, of course, a lie -- FAT32 is quite capable of handling terabyte-sized filesystems -- but because Microsoft enforces this limit in all their filesystem tools, nobody knew it was a lie until they actually looked at FAT32 and realized hey, this will work with bigger filesystems! (Though there is still that nasty 4GB limit on file size to contend with).

So how did I resolve this problem? First, I put the flash into a Windows 7 laptop and ran chkdisk on it. This found and fixed some problems. But when I put it back into the Galaxy S3 it *still* said "Damaged SD Card" despite the fact that Windows 7 said it was clean. So I resolved to reformat as FAT32. I copied the data off, and then had to go find a tool that would actually format a 64GB microSD card as FAT32 since the Windows disk manager won't do so: EaseUS Partition Master.

At that point it was just a matter of copying the data back on, which was very... very... slow since Windows operates SD cards in sync mode. As in, an hour slow. I know where the async flag lives in Windows and could have flipped it, but it was trash night so I did chores around the house instead. At the end of the process I inserted the microSD into the Sammy and... no more "Damaged SD Card".

Executive summary: If you buy a microSD card with greater than 32GB capacity, it is likely that it is formatted with Microsoft's proprietary exFAT filesystem and will not work well Android unless you reformat it, even if it appears to work correctly at first. exFAT support is not supported well because it is patented by Microsoft and thus does not have the magic of dozens of eyes of Open Source developers noticing and fixing bugs in it. So reformat it using the EaseUS tool above (NOT the internal Samsung formatter, it'll put the buggy exFAT filesystem back onto it) *before* you put stuff on it. Otherwise you'll be going through this whole time-consuming dance yourself sooner or later. Fun, it was not.

-ELG