Thursday, November 21, 2013

EBS, the killer app for Amazon EC2

So I gave Ceph a try. I set up a Ceph cluster in my lab with three storage servers connected via 10 gigabit Ethernet. The data store on each machine is capable of approximately 300 megabytes per second streaming throughput. So I created a Ceph block device and ran streaming I/O with large writes to it. As in, 1 megabyte writes. If writing to the raw data store, I get roughly 300 megabytes per second throughput. If writing through ceph, I get roughly 30 megabytes per second throughput.

"But wait!" the folks on the Ceph mailing list said. "We promise good *aggregate* throughput, not *individual* throughput." So I created a second Ceph block device and striped it with the first. And did my write again. And got... 30 megabytes per second. Apparently Ceph aggregated all I/O coming from my virtual machine and serialized it. And this is what its limit was.

This seems to be a hard limit with Ceph if you're not logging to a SSD. And not any old SSD. One optimized for small writes. There seems to be some fundamental architectural issues with Ceph that keep it from performing at anywhere near hardware speed. The decision to log everything, regardless of whether your application needs that level of reliability, appears to be one of those decisions. So Ceph simply doesn't solve a problem that's interesting to me. My users are not going to tolerate running at 1/10th the speed they're accustomed to running, and my management is not going to appreciate me telling them that we need to buy some very pricy and expensive enterprise SSD hardware when the current hardware with the stock Linux storage stack on it runs like a scalded cat without said SSD hardware. There's no "there" there. It just won't work for me.

So that's Ceph. Looks great on paper, but the reality for me is that it's 1/10th the speed without any real benefits for me over just creating iSCSI volumes on my appliances, pointing my virtual machines at them, and then using MD RAID on the virtual machine to do replication between the iSCSI servers. Yes that's annoying to manage, but at least it works, works *fast*, and my users are happy.

At which point let's talk about Amazon EC2. Frankly, EC2 sucks. First, it isn't very elastic. I have autoscaling alerts set up to spin up new virtual machines when load on my cloud reaches a certain point. Okay, fine. That's elastic, right? But: While an image is booting up and configuring, it's using 100% CPU. Which means that your alarm goes off *again* and spin up more instances, ad infinitum, until you hit the upper limit you configured for the autoscaling group, *unless* you put hysteresis in there to wait until the instance is up before you spin up another instance. So: It took five minutes to spin up a new instance. That's not acceptable. If the load has kept going up in the meantime, that means you might never catch up. I then created a new AMI that had Tomcat and other necessary software already pre-loaded. That brought the spin up time to three minutes -- one minute for Amazon to actually create the instance, one minute for the instance to boot and run Puppet to pull in the application .war file payload from the puppetmaster, and one minute for Tomcat to actually start up. Acceptable... barely. This is elastic... if elastic is eons in computer time. The net result is to encourage you to spin up new instances before you need them, spin up *multiple* instances at a time when spinning up instances, and then take a while before tearing them back down again once load goes down. Not the best way to handle things by any means, unless you're Amazon and making money by forcing people to keep excess instances hanging around.

Then let's not talk about the fact that CloudFormation is configured via JSON. A format deliberately designed without comments for data interchange between computers, *not* for configuring application stacks. Whoever specified JSON as the configuration language needs to be taken out behind Amazon HQ and beat with a clue stick until well and bloody. XML is painful enough as a configuration language. JSON is pure torture. Waterboarding is too good for that unknown programmer.

And then there's expense. Amazon EC2 is amazingly expensive compared to somebody like, say, Digital Ocean or Linode. My little 15 virtual machine cloud would cost roughly $300/month at Linode, less at Digital Ocean. You can figure four times that amount for EC2.

So why use EC2? Well, my Ceph experiment should clue you in there: it's all about EBS, the Elastic Block Store. See, I have data that I need to store up in the cloud. A *lot* of data. And if you create EBS-optimized virtual machines and striped MD RAID arrays across multiple EBS volumes, your I/O is wicked fast. With just two EBS volumes I can easily exceed 1,000 IOPS and 100 megabytes per second when doing pg_restore of database files. With more EBS volumes I could do even better.

Digital Ocean has nothing like EBS. Linode has nothing like EBS. Rackspace and HP have something like EBS (in public beta for HP right now, so not sure how much to trust it), but they don't have a good instance size match with what I need and if you go the next size up, their pricing is even more ridiculous than Amazon's. My guess is that as OpenStack matures and budget providers adopt it, you're going to see prices come down for cloud computing and you're going to see more people providing EBS-like functionality. But right now OpenStack is chugging away furiously trying to match Amazon's feature set, and is unstable enough that only providers like HP and Linode who have hundreds of network engineers to throw at it could possibly do it right. Each iteration gets better so hopefully the next iteration will be better. (Note from 10 months later: nope. Still not ready for mere mortals). Finally, there's Microsoft's Azure. I've heard good things about it, oddly enough. But I'm still not trusting it too much for Linux hosting, given that Microsoft only recently started giving grudging support to Linux. Maybe in six months or so I'll return and look at it again. Or maybe not. We'll see.

So Amazon's cloud it is. Alas. We look at the Amazon bill every month and ask ourselves, "surely there is a better alternative?" But the answer to that question has remained the same for each of the past six months that I've asked it, and it remains the same for one reason: EBS, the Elastic Block Store.


Saturday, August 24, 2013

The Linux storage stack: is it ready for prime time yet?

I've been playing with LIO quite a bit since rolling it into production for Viakoo's infrastructure (and at home for my personal experiments). It works quite a bit differently from the way that Intransa's Storstac worked. Storstac created a target for each volume being exported, while with LIO you have a single target that exports a LUN for each volume being exported. The underlying Linux kernel functionality is there to create a target per volume, but the configuration infrastructure is oriented around the LUN per volume paradigm.
Not a big deal, you might say. But it does make a difference when connecting with the Windows initiator. With the Windows initiator, the target per volume paradigm allows you to see what volume a particular LUN is connected to (assuming you give your targets descriptive names, which StorStac does). This in turn allows you to easily coordinate management of a specific target. For example to resize it you can offline it in Windows, stop exporting it on the storage service, rescan in Windows, expand the volume on your storage server, re-export it on your storage server, then online it in Windows and expand your filesystem to fill the newly opened up space. Still, this is not a big deal. LIO does perform quite well and does have the underlying capabilities that can serve the enterprise. So what's missing to keep the Linux storage stack from being prime time? Well, here's what I see:
  1. Ability to set up replication without taking down your filesystems / iSCSI exports. Intransa StorStac had replication built in, you simply set up a volume the same size on the remote target, told the source machine to replicate the volume to the remote target, and it started replicating. Right now replication is handled by DRBD in the Linux storage stack. DRBD works very well for its problem set -- local area high availability replication -- but to set up a replication after the fact on a LVM volume simply isn't possible. You have to create a drbd volume on top of a LVM volume, then copy your data into the new drbd volume. One way around this would be to automatically create a drbd volume on top of each LVM volume in your storage manager, but that adds overhead (and clutters your device table) and presents problems for udev at device assembly time. And still does not solve the problem of:
  2. Geographic replication: StorStac at one time had the ability to do logged replication across a WAN. That is, assuming that your average WAN bandwidth is high enough to handle the number of writes done during the course of a workday, a log volume will collect the writes and ship them across the WAN in the correct order to be applied at the remote end. If you must do a geographic failover due to, say, California falling into the sea, you lose at most whatever log entries have not yet been applied at the remote end. Most filesystems will handle that in a recoverable manner as long as the writes are being applied in the correct order (which they are). DRBD *sort of* has the ability to do geographic replication via an external program, "drbd-proxy", that functions in much the same way as StorStac replication (that is, it keeps a log of writes in a disk volume and replays them to the remote server), but it's not at all integrated into the solution and is excruciatingly difficult to set up (which is true of drbd in general).
  3. Note that LVM also has replication (of a sort) built in, via its mirror capability. You can create a replication storage pool on the remote server as a LVM volume, export it via LIO, import it via open-iscsi, create a physical volume on it, then create mirror volumes specifying this second physical volume as the place you want to put the mirror. LVM also does write logging so can handle the geographic situation. The problem comes with recovery, since what you have on the remote end is a logical volume that has a physical volume inside it that has one or more logical volumes inside it. The circumlocutions needed to actually mount and use those logical volumes inside that physical volume inside that logical volume are non-trivial, it may in fact be necessary to mount the logical volume as a loopback device then do pvscan/lvscan on the loopback device to get at those volumes. It is decidedly *not* as easy as with StorStac, where a target is a target, whether it's the target for a replication or for a client computer.
So clearly replication in the Linux storage stack is a mess, nowhere near the level of ease of use or functionality as the antiquated ten-year-old Intransa StorStac storage stack. The question is, how do we fix it? I'll think about that for a while, but meanwhile there's another issue: Linux doesn't know about SES. This is a Big Deal for big servers. SES is the SCSI Enclosure Services protocol that is implemented by most SAS fanout chips and allows control of, amongst other things, the blinky lights that can be used to identify a drive (okay, so mdmonitor told you that /dev/sdax died, where the heck is that physically located?!) . There are basically two variants extant nowadays, SAS and SAS2, that are very slightly different (alas, I had to modify StorStac to talk to the LSI SAS2X24 expander chip which very slightly changed a mode page that we depended upon to find the slot addresses). Linux itself has no notion that things like SAS disk enclosures even exist, much less any idea how to blink lights in them.

And finally, there is the RAID5/RAID6 write hole issue. Right now the only reliable way to have RAID5/RAID6 on Linux is with a hardware RAID controller that has a battery-backed stripe cache. Unfortunately once you do this, you can no longer monitor drives via smartd to catch failures before they happened (yes, I do this, and yes, it works -- I caught several drives in my infrastructure that were doing bad things before they actually failed and replaced them before I had to deal with a disaster recovery situation), you can no longer take advantage of your server's gigabytes of memory to keep a large stripe cache so that you don't have to keep thrashing the disks to load stripes in the case of random writes (if the stripe is already in cache, you just update the cache and write the dirty blocks back to the drives, rather than have to reload the entire stripe) and you can also no longer take advantage of the much faster RAID stripe computations allowed by modern server hardware (it's amazing how much faster you can do RAID stripe calculations with a 2.4Ghz Xeon than you can with an old embedded MIPS processor running at much slower speeds). In addition it is often very difficult to manage these hardware RAID controllers from within Linux. For these reasons (and other historical issues not of interest at the moment) StorStac always used software RAID. Historically, StorStac used battery-backed RAM logs for its software RAID to cache outstanding writes and recover from outages, but such battery-backed RAM log devices don't exist for modern commodity hardware such as the 12-disk Supermicro server that's sitting next to my desk. It doesn't matter anyhow, because even if it did exist, there's no provision in the current Linux RAID stack to use it.

So what's the meaning of all this? Well, the replication issue is... troubling. I will discuss that more in the future. On the other hand, things like Ceph are handling it at the filesystem level now, so perhaps block level replication via iSCSI or other block-level protocols isn't as important as it used to be. For the rest, it appears that the only thing lacking is a management framework and a utility to handle SES expander chips. The RAID[56] write hole is troublesome, but in reality data loss from that is quite rare, so I won't call it a showstopper. It appears that we can get 90% of what the Intransa StorStac storage stack used to do by using current Linux kernel functionality and a management framework on top of that, and the parts that are missing are parts that few people care about.

What does that mean for the future? Well, your guess is as good as mine. But to answer the question about the Linux storage stack: Yes, it IS ready for prime time -- with important caveats, and only if a decent management infrastructure is written to control it (because the current md/lvm tools are a complete and utter fail as anything other than tools to be used by higher-level management tools). The most important caveat being, of course, that no enterprise Linux distribution has been released yet with LIO (I am using Fedora 18 currently, which is most decidedly *not* what I want to use long-term for obvious reasons). Assuming that Red Hat 7 / Centos 7 will be based on Fedora 18, though, it appears that the Linux storage stack is the closest to being ready for prime time as it's ever been, and proprietary storage stacks are going to end up migrating to the current Linux functionality or else fall victim to being too expensive and fragile to compete.


Sunday, August 11, 2013

The killer app for virtualization

The killer application for virtualization is... running legacy operating systems.

This isn't a new thought on my part. When I was designing the Intransa StorStac 7.20 storage appliance platform I deliberately put virtualization drivers into it so that we could run Intransa StorStac as a virtual appliance on some future hardware platform not supported by the 2.6.32 kernel. And yes, that works (no joke, I tried it out of course, the only thing that didn't work was sensors but if Viakoo ever wants to deliver a virtualized IntransaBrand appliance I know how to fix the sensors). My thought was future-proofing -- I could tell from the layoffs and from the unsold equipment piled up everywhere that Intransa was not long for the world, so I decided to leave whoever bought the carcass a platform that had some legs on it. So it has drivers for the network chips in the X9 series SuperMicro motherboards (Sandy/Ivy Bridge) as well as the virtualization drivers. So there's now a pretty reasonable migration path to keep StorStac running into the next decade... first migrate it to Sandy/Ivy Bridge physical hardware, then once that's EOL'ed migrate it to running on top of a virtual platform on top of Haswell or its successors.

But what brought it to mind today was ZFS. I need some of the features of the LIO iSCSI stack and some of the newer features of libvirtd for some things I am doing, so have ended up needing to run a recent Fedora on my big home server (which is now up to 48 gigabytes of memory and 14 terabytes of storage). The problem is that two of those storage drives are offsite backups from work (ZFS replication, duh) and I need to use ZFS to apply the ZFS diffsets that I haul home from work. That was not a problem for Linux kernels up to 3.9, but now Fedora 18/19 have rolled out 3.10, and ZFSonLinux won't compile against the 3.10 kernel. I found that out the hard way when the new kernel came in, and DKMS spit up all over the floor because of ZFS.

The solution? Virtualization to the rescue! I rolled up a Centos 6.4 virtual machine, pushed all the ZFS drives into it, gave it a fair chunk of memory, and voila. One legacy platform that can sit there happily for the next few years doing its thing, while the Fedora underneath it changes with the seasons.

Of course that is nothing new. A lot of the infrastructure that I migrated from Intransa's equipment onto Viakoo's equipment was virtualized servers dating in some cases all the way back to physical servers that Intransa bought in 2003 when they got their huge infusion of VC money. Still, it's just a practical reminder of the killer app for virtualization -- the fact that it allows your OS and software to survive despite underlying drivers and architectures changing with the seasons. Now making your computer work faster can be done without changing anything at all about it -- just buy a couple of new virtualization servers with the very latest fastest hardware and then migrate your virtual machines to them. Quick, easy, and terrifies OS vendors (especially Microsoft) like crazy because now you no longer need to buy a new OS to run on the new hardware, you can just keep using your old reliable OS forever.


Thursday, July 25, 2013

No magic bullet, I guess

Amazon introduced a new service, Opsworks, back in the spring. This is supposed to make cloud stack creation, cloud formation, and instance management easier to handle than their older Puppet-based CloudFormation service. Being able to fail over gracefully from master to slave database, for example, would be a Very Good Thing, and they appear to have hooks that can allow that to happen (via running Chef code when a master fails). Similarly, if load gets too high on web servers they can automagically spawn new ones.

Great idea. Sort of. Except it seems to have two critical issues: 1) it doesn't appear to have any way to handle our staging->production cycle, where the transactions coming into production are replicated to staging during our testing cycle, then eventually staging is rolled to production via mechanisms I won't go into right now, and 2) it doesn't appear to actually work -- it claims that the default security groups that are needed for its version of Chef to work don't exist, and they never appeared later on either. Which isn't because I lack permission to create security groups, because I created one for an earlier prototype of my deployment. This appears to be a sporadic bug that multiple people have reported, where the default security groups aren't created for one reason or another.

Eh. Half baked and not ready for production. Oh well. Amazon *did* say it was Beta, after all. They're right.

Wednesday, July 24, 2013

ORM Not Considered Harmful

Recently I had the task of moving a program from one database engine to another. The program primarily used Hibernate. The actual job of moving it from one database to another was... locating the JDBC module for the other database, locating the dialect name for the other database, telling Hibernate about both, and, mostly, it just worked. Except. Except there were six places where the program issued genuine, real actual SQL. Those six queries had to be re-written because they used a feature of the original database engine that didn't exist on the other database engine. Still, six queries are a lot easier to re-write than hundreds of queries. I still consider Spring/Hibernate to be evil. But this demonstrates that an ORM with a query dialect engine does have significant advantages in making your program portable into different environments. Being able to prototype your program on your personal desktop with MySQL then deploy it against a real database like Oracle without changing anything about the program other than a couple of configuration settings... that is really cool. And useful, since it causes productivity to go sky high. Now to see if there's any decent ORM for Java... -ELG

Thursday, May 30, 2013

When tiny changes come back to bite

A Java network application that connected to a storage array's control port via an SSL socket mysteriously quit working when moved from Java 6 to Java 7. Not *all* the time, but just in certain configurations. The application was written based on Oracle's own example code which hasn't changed since the mid 'oughts, so everybody was baffled. My assignment: Figure out what was going on.

The first thing I did was create a test framework with Java 6, Java 7, and a Python re-implementation and validate that yes, the problem was something that Java 7 was doing. Java 6 worked. Python via the cPython engine worked. Java 7 didn't work, it logged in (the first transaction) then the next transaction (actually sending a command) failed. Python via the jYthon engine on top of Java 7 didn't work. I tried on both Oracle's official Java 7 JDK and Red Hat's OpenJDK that came with Fedora 18, neither worked. So it clearly had something to do with differences in the Java 7 SSL stack.

I then went to Oracle's site to look at the change notices on difference between Java 6 and Java 7. There was nothing that looked interesting. Setting and sse.enableSNIExtension=false were two things that looked like they might account for some difference. Maybe the old storage array was using an old protocol. So I ran the program under Java 6 with, found out what protocol it was using, then hardwired my SSL ciphers list to use that protocol as default. I tested again under Java 7, and it still didn't work.

Then I ran the program on the two different environments with on my main method on both Java 6 and Java 7. The Java 6 was the OpenJDK from the Centos 6.4 distribution. The Java 7 was the OPenJDK on Fedora 18. Two screens, one on each computer, later, and the output on both of them was identical -- up until the transmission of the second SSL packet. The right screen (the Fedora 18 screen) split it into *two* SSL packets.

But why? I downloaded the source code to OpenJDK 6 and OpenJDK 7 and extracted them, and set out to figure out what was going on here. The end result: A tiny bit of code in called needToSplitPayload() was called from the SSL engine, and if that code says yes, it splits the packet. So... is the cipher a CBC mode? Yes. is it the first app output record? No. Is Record.enableCBCProtection set? Wait, where is that set?! I head over to, and find that it's set from the property "jsse.enableCBCProtection" at JVM startup.

The end result: I call System.setProperty("jsse.enableCBCProtection", "false"); in my initializer (or set it on the command line) and everything works.

So what's the point of CBC protection? Well, the point is to deal with traffic analysis attacks against web servers. With snooping on the traffic you can figure out what transactions are happening and figure out important details of the session. The problem is that in my application, the receiving storage array apparently wants packets to stay, well, packets, not multiple packets. It's not a problem for the code I wrote, I look for begin and end markers and don't care how many packets it takes to collect a full set of data to shove into my XML parser, but I have no control over the legacy storage array end of things. It apparently does *not* look for begin and end markers, it just grabs a packet and expects it to be the whole tamale.

So all in all, CBC protection is a Good Thing -- except in my application, where a legacy storage array cannot handle it. Or in a wide variety of other applications where it slows down the application to the point of uselessness by breaking up nice big packets into teeny little packets that clog the network. But at least Oracle gave us a way to disable it for these applications. The only thing I'll whine about is this: Oracle apparently implemented this functionality *without* adding it to their list of changes. So I had to actually reverse-engineer the Java runtime to figure out that one obscure system property is apparently set to "false" by default in OpenJDK 6 and set to "true" by default in OpenJDK 7. Which is not The Way It Spozed To Be, but at least thanks to the power of Open Source it was not a fatal problem -- just an issue of reading through source code of the runtime until locating the part that broke up SSL packets, then tracing backwards through the decision tree up to the root. Chalk up one more reason to use Open Source -- if something mysterious is happening, you can pretty much always figure it out. It might take a while, but that's why we make the big bucks, right?


Wednesday, May 1, 2013


I'm going to speak heresy here: Object-Relational Mappers such as Hibernate are evil. I say that as someone who wrote an object-relational mapper back in 2000 -- the BRU Server server is written as a master class that maps objects to a record in a MySQL database, which is then inherited by child classes that implement the specific record types in the MySQL database. The master class takes care of adding the table to the database if it doesn't exist, as well as populating the Python objects on queries. The child classes take care of business logic and generating queries that don't map well to the ORM, but pass the query results to generators to produce proper object sets out of SQL data sets.

So why did this approach work so well for BRU Server, allowing us to concentrate on the business logic rather than the database logic and allowing its current owners to maintain the software for ten years now, while it fails so harshly for atrocities like Hibernate? One word: complexity. Hibernate attempts to handle all possible cases, and thus ends up producing terrible SQL queries while making things that should be easy difficult, but that's because it's a general purpose mapper. The BRU Server team -- all four of us -- understood that if we were going to create a complete Unix network backup solution within the six months allotted to us, complexity was the enemy. We understood the compromises needed between the object model and the relational model, and the fact that Python was capable of expressing sets of objects as easily as it was capable of expressing individual objects meant that the "object=record" paradigm was fairly easy to handle. We wrote as much ORM as we needed -- and no more. In some cases we had to go to raw relational database programming, but because our ORM was so simple we had no problems with that. There were no exceptions being thrown because of an ORM caching data that no longer existed in the database, and the actual objects for things like users and servers could do their thing without worrying about how to actually read and write the database.

In the meantime, I have not run into any Spring/Hibernate project that actually managed to produce usable code performing acceptably well in any reasonable time frame with a team of a reasonable size. I was at one company that decided to use the PHP-based code that three of us had written in four weeks' time as the prototype for the "real" software, which of course was going to be Java and Spring and Hibernate and Restful and all the right buzzwords, because real software isn't written in PHP of course (though our code solved the problem and didn't require advanced degrees to understand). Six months later and a cast of almost a dozen and no closer to release than at the beginning of the project, the entire project was canned and the project team fired (not me, I was on another project, but I had a friend on that project and she was not a happy camper). I don't know how much money was wasted on that project, but undoubtedly it hurried the demise of that company.

But maybe I'm just not well informed. It wouldn't be the first time, after all. So can anybody point me to a Spring/Hibernate project that is, say, around 80,000 lines of code written in under five months' time by a team of four people, that not only does database access but also does some hard-core hardware-level work slinging massive amounts of data around in a three-box client-server-agent architecture with multiple user interfaces (CLI and GUI/Web minimum)? That can handle writing hundreds of thousands of records per hour then doing complex queries on those records with MySQL without falling over? We did this with BRU Server, thanks to Python and choosing just enough ORM for what we needed (not to mention re-using around 120,000 lines of "C" code for the actual backup engine components), and no more (and no less). The ORM took me a whole five (5) days to write. Five. Days. That's it. Granted, half of that is because of Python and the introspection it allows as part of the very definition of the language. But. Five days. That's how much you save by using Spring/Hibernate over using a language such as Ruby or Python that has proper introspection and doing your own object-relational mapping. Five days. And I submit that the costs of Spring/Hibernate are far, far worse, especially for the 20% of projects that don't map well onto the Spring/Hibernate model, such as virtually everything that I do (since I'm all about system level operations).


Saturday, April 27, 2013

Configuring shared access for KVM/libvirt VM's

Libvirt has some nice migration features in the latest RHEL/Centos 6.4 to let you move virtual machines from one server to the other, assuming that you . But if you try it with VM's set to auto-start on server startup, you'll swiftly run into problems the next time you reboot your compute servers -- the same VM will try to start up on multiple compute servers.

The reality is that unlike ESXi, which by default locks the VMDK file so that only a single virtual machine can use it at a time, thus meaning that the same VM set to start up on multiple servers will only start on one (that wins the race), libvirtd by default does *not* include any sort of locking. You have to configure a lock manager to do so. In my case, I configured 'sanlock', which has integration with libvirtd. So on each KVM host configured to access shared VM datastore /shared/datastore :

  • yum install sanlock
  • yum install libvirt-lock-sanlock
Now set up sanlock to start at system boot, and start it up:
  • chkconfig wdmd on
  • chkconfig sanlock on
  • service wdmd start
  • service sanlock start
On the shared datastore, create a locking directory and give it username/ID sanlock:sanlock and permissions for anybody who is in group sanlock to write to it:
  • cd /shared/datastore
  • mkdir sanlock
  • chown sanlock:sanlock sanlock
  • chmod 775 sanlock
Finally, you have to update the libvirtd configuration to use the new locking directory. Edit /etc/libvirt/qemu_sanlock.conf with the following:
  • auto_disk_leases = 1
  • disk_lease_dir = /shared/datastore/sanlock
  • host_id = 1
  • user = "sanlock"
  • group = "sanlock"
Everything else in the file should be commented out or a blank line. Host ID must be different for each compute host, I started counting at 1 and counted up for each compute host. And edit /etc/libvirt/qemu.conf to set the lock manager:
  • lock_manager = "sanlock"
(the line is probably already there, just commented out. Un-comment it). At this point, stop all your VM's on this host (or migrate them to another host), and either reboot (to make sure all comes up properly) or just restart libvirtd with
  • service libvirtd restart
Once you've done this on all servers, try starting up a virtual machine you don't care about on two different servers at the same time. The second attempt should fail with a locking error., At the end of the process it's always wise to shut down all your virtual machines and re-start your entire compute infrastructure that's using the sanlock locking to make sure everything comes up correctly. So-called "bounce tests" are painful, but the only way to be *sure* things won't go AWOL at system boot. If you have more than three compute servers I instead *strongly* suggest that you go to an OpenStack cloud instead, because things become unmanageable swiftly using this mechanism. At present the easiest way to deploy OpenStack appears to be Ubuntu, which has pre-compiled binaries on both their LTS and current distribution releases for OpenStack Grizzly, the latest production release of OpenStack as of this writing. OpenStack takes care of VM startup and shutdown cluster-wide and simply won't start a VM on two different servers at the same time. But that's something for another post. -ELG

Friday, April 26, 2013

On spinning rust and SSD's.

I got my Crucial M4 512GB SSD back for my laptop. It failed about three weeks ago, when I turned on my laptop it simply wasn't there. Complete binary failure mode -- it worked, then it didn't work. So I took it out of the laptop, verified in an external USB enclosure that it didn't "spin up" there either, installed a 750Gb WD Black 7200 rpm rust-spinner that was in my junk box for some project or another, and re-installed Windows and restored my backups. Annoying, but not fatal by any means. I've had to get used to the slow speed of spinning rust again versus the blazingly fast SSD, but at least I'm up and running. So this weekend I get to make another full backup, then swap out the rust for the SSD again.

At work I've had to replace several of the WD 2TB Enterprise drives in the new Linux-based infrastructure when smartd started whining about uncorrectable read errors. When StorStac got notification of that sort of thing it re-wrote the sector from the RAID checksums and that usually resolved it. The Linux 3.8 kernel's md RAID6 layer apparently doesn't do that, requiring me to kick the drive out of the md, slide in a replacement, fire off a rebuild, and then haul the drive over to my desktop and slide it in there and run a blank-out (write zeroes to the entire drive). Sometimes that resolves the issue, sometimes the drive really *is* toast, but at least it was an analog error (just one or two bad places on the drive), not a complete binary error (the entire drive just going blammo).

SSD's are the future. The new COW filesystems such as ZFS and BTRFS really don't do too well on spinning rust, because by their very nature they fragment badly over time. That doesn't matter on SSD's, it does matter with rust-spinners, for obvious reasons. With ZFS you can still get decent performance on rust if you use a second-level SSD cache, that's how I do my backup system here at home (which is an external USB3 hard drive and an internal SSD in my server), BTRFS has no such mechanism at present but to a certain extent compensates by having a (manual) de-fragmentation process that can be run from time to time during "off" hours. Still, both filesystems clearly prefer SSD to rotational storage. It's just the nature of the beast. And those filesystems have sufficient advantages in terms of functionality and reliability (except in virtualized environments as virtual machine filesystems -- but more on that later) that if your application can afford SSD's, that alone may be the tipping point that makes you go to SSD-based storage rather than rotational storage.

Still, it's clear to me that, at this time, SSD is still an immature technology subject to catastrophic failure with no warning. Rotational storage usually gives you warning, you start getting SMART notifications about sectors that cannot be read, about sectors being relocated, and so forth. So when designing an architecture for reliability, it is unwise to have an SSD be a single point of failure, as is often done for ESXi servers that lack hardware RAID cards supported by ESXi. It might *seem* that SSD is more reliable than rotational storage. And on paper, that may even be true. But the reality is that because the nature of the failures is different, in *reality* rotational storage gives you a much better chance of detecting and recovering from a failing drive than SSD's do. That may, or may not be important for your application -- in RAID it clearly isn't a big deal, since you'll be replacing the drive and rebuilding a new drive anyhow -- but for things like an ESXi boot drive it's something you should consider.


Thursday, April 25, 2013


I must admit that I have a low opinion of journalists, tech journalists in particular. I've been interviewed several times over the years and only once has the result been accurate. In all the other cases, what I said was spun to fit the journalist's preconceived notion of what the story should be, and to bleep with the truth.

What I cannot understand is why, if a tech journalist cannot interview people in the know because they had to sign a NDA in order to obtain certain assets for a specified price, said journalist would go ahead and publish a story based entirely upon speculation and a single source that may or may not know the details of whatever legal agreements were signed. It's not professional, it's not ethical, and it's not right. But it's the way tech "journalism" is done here in the Silicon Valley. I guess making a living by being unprofessional and unethical doesn't bother some people. So it goes.


Monday, April 1, 2013


> Realm shutdown

Click on the picture for high resolution. Today we decommissioned the only 10gbit Intransa iSCSI storage realm in existence. There were only two ever built, and only one was ever sold. This one was built by Douglas Fong for use by Intransa IT and has 24 4-disk IP-connected disk shelves in six cabinets, for a total of 96 250gb IDE hard drives talking to two SMC/Dell switches via 48 1gbit connections. The SMC/Dell switches are then connected to the two clustered controller units via 10Gbit Ethernet, which then exports iSCSI to the two SMC/Dell switches above it via 10Gbit Ethernet. This whole concept was designed for scale-out storage, when you needed more storage you just added more of the blue boxes (or, later, the grey boxes to the left) and incidentally this also made the result faster.

Two things became clear as I was prepping the changeover from this 2/3rds rack of equipment to 4u worth of generic Linux storage. The first was that the Intransa box was infinitely easier to manage than my 24 disks worth of Linux-based storage, despite having four times as many spindles. This is because the Intransa software did policy-based storage allocation. You told it you wanted a new volume with 5-disk RAID5 or 4-disk RAID10 or whatever, and it went out and either found existing RAID groups and put your new volume there, or found enough disks to create a new RAID group and put your volume there. You didn't have to worry about how to lay out RAID groups or volumes on top of RAID groups and exporting to iSCSI, it all Just Happened.

The second thing that became apparent was that this beast was fast -- seriously fast. The orange cable at the top right is the 10Gbit Ethernet cable going to my new infrastructure that I used to migrate the volumes off of this pile of blue boxes. Surprisingly, the limit was my new Linux storage boxes, not the Intransa storage -- I was pulling data off at 200 megabytes/second, the max I could pull in via my two 1Gbit Ethernet connectors. It seems that if you have enough spindles, even 250gb IDE drives can generate a significant number of iops. It would have been interesting to see exactly how fast it was, but unfortunately I'm still working on getting the Intel 10Gbit cards working in the Linux storage servers (I am now going to use copper SFP+ cables, since it is clear that the Intel cards aren't going to work with the optical SFP+ modules that I have), so was restricted to two 1Gbit connections.

Sadly, the pile of dead drives on top of the pile of blue cabinets are one indication of why it's being retired. The 250Gb Maxtor drives in this thing were manufactured in 2004 and were starting to fail. My supply of spare parts was limited. In addition, this beast is horrifically complex -- even the person who built it had trouble getting it up and running the last time it was moved, and our new little startup certainly wouldn't be able to get it up and going by ourselves, so we settled for getting the intellectual property off of it onto our own generic Linux server equipment. Finally, it and the backup replica realm beside it took up a huge amount of space and power, the two Linux servers do in 8U what required an entire rack full of equipment to do with this seven-to-nine-year-old Intransa equipment. So it was time, albeit with a bit of sadness too. Intransa had some great ideas and solid gear. They could not, alas, make money with it.

I played taps on my Irish whistle as the realm shut down.

-- ELG

Saturday, March 30, 2013

Why no cloud?

So I promised I'd explain why I was setting up normal Linux-based storage and normal KVM/ESXi compute servers for our new small business's network rather than an OpenStack private cloud, so I'll do so.
  1. One risky technology per deployment. It's about risk management -- the ability to manage risks in a reasonable manner. If you have multiple risky technologies, the interactions between risks rise exponentially and cause risks to be unmanageable. Normal Linux-based storage is a mature technology with over a decade of active deployment in production environments with the exception of the LIO iSCSI target. I concluded that the LIO iSCSI target was a necessity in our environment because the TGTD target provided with stable Linux distributions has multiple serious deficiencies (see earlier postings) that render it nothing more than a toy, and our legacy infrastructure was based around iSCSI talking to that pile of ancient Intransa blue boxes. So I've reached my limit on new technologies. Meanwhile OpenStack is multiple immature technologies under active development. Add that to LIO and the existing VMware ESX/ESXi servers' need for block storage and I'd require multiple storage networks to mitigate the risks. Which brings up...
  2. Power and space budget. My power and space budget allows for one storage network with a total of 8U of space and 1000 watts of power consumption. I don't have power and space for two storage networks, one for OpenStack and one for ESX/ESXi.
  3. Performance. The majority of what my network provides to end users is file storage via NFS and CIFS. In an OpenStack deployment file servers run as virtual machines talking to back end storage via iSCSI. This scales very well in large installations, but I don't have the power and space budget for a large installation so that's irrelevant. Running the NAS storage stack directly on the storage boxes results in much better responsiveness and real-world performance than running the NAS storage stack on a virtual machine talking to the storage boxes via iSCSI, even if the theoretical performance should be the same. The biggest issue is that this limits the size and performance of any particular data store to one storage box, but the reality is that this isn't a particularly big limitation for our environment, since we have far more iops and storage on a single storage box than any single data store in our environment will use for quite some time. (My rule of thumb is that no ext4 data store will ever be over 1Tb and no xfs data store will ever be over 2Tb, due to various limitations of those filesystems in a NAS environment... any other filesystem runs into issue #1, one risky technology per deployment, and I already hit that with LIO)
  4. Deep understanding of the underlying technologies. The Linux storage stack has been mature for many years now, with the exception of LIO. I know its source code at the kernel level fairly well. If there is an issue, I know how to resolve it, even to the point of poking bytes into headers on disk drive blocks to make things work. Recovery from failure thus is low risk (see #1). OpenStack is a new and immature technology. If there is an issue, we could be down for days while I chase around in the source code trying to figure out what went wrong and how to fix it.
Note that this is *not* a slam on OpenStack as a technology, or saying that you should not use one of the OpenStack cloud providers such as RackSpace or HP. They have massive redundancies in their OpenStack deployment and people on staff who have the expertise to manage it, and do not have to deal with legacy infrastructure requirements such as our ESXi servers with their associated Windows payloads. Plus they are based around a totally different workload. Our in-house workload is primarily a NAS workload for workstations, and our compute workload is primarily a small number of virtualized test servers or build servers for our software in a variety of environments as well as a handful of infrastructure servers to e.g. handle DNS. What OpenStack mostly gives you is the ability to manage massive numbers of storage servers and massive numbers of compute servers and massive numbers of virtual machines on those compute servers, none of which is our local workload.

The workload that RackSpace etc. are supporting is mostly about Big Data and Big Compute in the cloud or about web server farms in the cloud. All of that has far larger space and power requirements than our little two-rack data center can ever provide, and the reality is that we simply use their infrastructure when we have those requirements rather than attempt to replicate their infrastructure in-house. It simply isn't reasonable for a small business to try to replicate RackSpace or Amazon AWS in-house. We don't have the space and power for the massive amount of infrastructure they use to achieve redundancy and reliability, we don't have the requirement for our local workload, and we don't have the in-house expertise. In the end, it's a case of using the appropriate technology for the appropriate task -- and for what I'm attempting to achieve for the local infrastructure of a small business, using NAS-based Linux storage was more appropriate than attempting to shoe-horn our workload into an infrastructure that would give us no more capability for our needs but would cost us in terms of power, space, performance, and maintainability.


Sunday, March 24, 2013

Making auto-proxy configuration work

Okay, so I finally got auto-proxy browser configuration to work with ClearOS. It required a couple of different things.

First, you'll need to install the web server plugin in the ClearOS marketplace. Yes, I know you don't want a web server running on your router. But there's not much choice, wpad.dat is served via http on port 80. Just don't add a firewall rule allowing connecting to it from outside your network (note that in ClearOS you have to explicitly allow external access to services) and you'll be fine.

Next, in your DNS configuration on your master DNS server (whether that's on the ClearOS server or elsewhere), set up pointing at your ClearOS server. If the ClearOS server is providing DNS that's pretty easy, just use the web interface.

Okay, now we're at the end of what the web interface can do for you. We'll need to do some things via ssh now. Make sure ssh access is turned on in your firewall rules (in the GUI), and ssh in as root/yourpassword.

In /var/www/html create a file 'wpad.dat' with these contents:

function FindProxyForURL(url,host)
return “PROXY″;
Replace the with the actual address of your ClearOS server, and if you're running the content filter, replace the 3128 with 8080.

Now access "http://wpad.your.domain" with your web browser. You should see that file appear. But that's not going to get your auto-proxy working on Chrome, at least, because Chrome wants a MIME type of application/x-ns-proxy-autoconfig rather than  text/plain . So let's go set up the MIME type. In the directory /etc/httpd/conf.d create a file 'wpad.conf' with these contents:
<Files "wpad.dat">
   ForceType application/x-ns-proxy-autoconfig

And one 'apachectl restart' later, there you are. Your browsers on the network will now auto-configure their proxy settings to go through the ClearOS proxy.


A tale of two routers

One of the things I want to set up at the office to preserve precious Internet bandwidth is a general proxy/firewall box. This would sit between our current infrastructure and the Internet and do firewall-y type things plus provide VPN connectivity. We have an ancient Cisco that is providing VPN connectivity but it requires a proprietary client that is hard to come by unless you have a Cisco support contract, which isn't available for this antique. Given that I have plenty of fairly recent vintage surplus server equipment hanging around plus a few quad-port Ethernet cards it seemed to me to be a no-brainer to slap the cards into a spare server and toss Linux on it and run with that as the router.

The problem, of course, is time and complexity. I know how to use iptables. I know how to set up things like squid and openvpn and dhcp and so forth. But I really don't want to. I really, REALLY don't want to. I have better things to do with my life. So I went out to find general-purpose router distributions that would do all that hard work for me. Given the equipment available to me (mostly Nehalems with 2.4ghz quad-core processors and 6gb of RAM, modest by the standards of modern compute engines, but way more system than our border router needs), I didn't need to settle for one of the tiny little distributions that are intended to fit on flash memory chips on tiny embedded systems. I could put a full-fledged Linux on there. After some research, I settled on two distributions that are based on a stable core distribution: ClearOS, which is based on Centos 6, and Zentyal, which is based on Ubuntu 12.04 LTS. I know how to manage both Centos and Ubuntu since I've used them in production for years, so if all else failed I could take the starting point that the GUI configurator gave me and fix it to work.

Before doing this at the office I had to of course do a proof of concept. And the perfect proof of concept was my home network, which has five PC's and five devices on it as well as the wireless access point and the managed switch both of which have their own addresses. So I grabbed a decommissioned box that had some pretty hefty stats (Core I7-950 with 12gb of RAM) but no drives (since I'd moved the drives to the big file server box), found a pair of 2.5" drives to fit in its front-loading slot to swap out ClearOS and Zentyal, and set to installing.

I'd been playing with Zentyal for a while at work, seeing if I could make sense of whether it would replace our ancient Windows 2003 domain controllers, and so I started with ClearOS. It slid onto my server just fine, locating my expensive 4 port Intel NIC card and the on-board Intel NIC. I configured it to provide separate networks for my wireless and wired networks (so I could monitor what was happening on the wireless network specifically), and route all traffic out the cable modem connection. All was well. I played with the proxy server settings. That worked pretty well, with the bizarre exception that I can't figure out how to make the automatic proxy settings work, I enabled the Apache server and created the correct wpad file and I see Chrome using it in the Apache server logs but Chrome isn't doing applying the settings for some reason. Okay, something to check out on Zentyal when I do it. I then configured OpenVPN and installed OpenVPN clients on my Android and iPad (I already know OpenVPN works on Windows, Linux, and Mac, duh). My first couple of attempts to connect from my Android didn't work and I was baffled. Finally I clicked on the firewall module and noticed no rule had been created to allow OpenVPN connections when I configured OpenVPN. Point, click, allow, and all works well. iPad worked fine too once I got the certificates on there, which required using iTunes (bleh!) but at that point the iPad OpenVPN software was up and going. And finally I got the mail relay up and going, which forwards all outbound smtp traffic to my mail server in the proto-cloud which then forwards it onwards. Again there were some interesting limitations -- I see no place to set the name and password to authenticate with the remote smartmail server, for example -- but that's easy enough to fix by hand.

Okay, so there's a couple of small glitches but things pretty much were going smoothly with ClearOS. The main issues I ran into with ClearOS were between my ears, i.e., I didn't RTFM and forgot to set up things that needed setting up or set them up incorrectly. So next I shut down the ClearOS system, slid out its drive, slid the new drive in, and installed Zentyal. That, on the other hand... that was pretty much a disaster. It crashed halfway through the setup wizard. It crashed after it updated. It crashed trying to set up the mail relay. The OpenVPN functionality worked but the user interface left a lot to be desired. I noticed that it'd set my domain to a domain and set it back to my own domain, and that pretty much was all she wrote -- it wiped out my VPN, it wiped out the user LDAP directory, and put the system into a completely unusable state.

Which is a shame, because I really wanted to like Zentyal. It is based on a newer and arguably better Linux distribution than ClearOS, and it has some really nice features. But I just can't deal with software that crashes when we're talking about a mission-critical server. It just isn't going to work. There's some places elsewhere in my infrastructure that Zentyal can live, but the border router? Nope. Not happening. So it goes.

Friday, March 22, 2013

Storage migrations

I spent much of today setting up a pair of Linux servers to migrate data off of a 2005-vintage Intransa storage array. The Intransa storage array still works fine, but clearly the end is in sight -- I have a limited supply of spares and three drives died within the past two months alone. So I set up a 10Gb fiber connection between the iSCSI switch for the Intransa array and the iSCSI switch for my new(old) commodity Linux servers (a pair of previous-generation Supermicro 12-disk servers with a 12-disk JBOD apiece), exported iSCSI volumes via lio, told Windows to mirror its various volumes to the new volumes, and let'er rip. Note that I did traditional RAID here because I don't have the cycles or the CPU's to implement an internal cloud for a small company, and that these storage servers are also providing regular file shares via NFS and Samba (CIFS). I deliberately kept things as simple as possible in order to make it more easily manageable. In the process some clear issues with the current Linux storage stack became apparently. Thumbnail summary: The Linux storage stack is to professional storage stacks such as the old Intransa stack (or modern-day HDC or HP stacks) as Soviet toilet paper was to Charmin. Soviet toilet paper could serve as sandpaper -- it was rough, annoying, and it did the job but you certainly didn't like it. Same deal with the Linux storage stack, with the additional caveat that there are some things that antique Intransa gear would do that are pretty much impossible with the current Linux storage stack.

Rather than go off onto a long rant, here's some things to think about:

  1. The Intransa unit integrated the process of creating a volume, assigning it to a volume group (RAID array) that fit your desired policy creating one if necessary (this Intransa installation has six storage cabinets each with 16 drives, so individually adding 96 drives to RAID arrays then managing which ones your volumes got placed upon would have been nightmarish) and then exporting the resulting volume as an iSCSI volume. All of this is a multi-step manual process on Linux.
  2. You can create replications to replicate a volume to another Intransa realm (either local or geographically remote) at any point in time after a volume has been created, without taking the volume offline. On Linux, you have to take the volume offline, unexport it from iSCSI and/or NFS, layer drbd on top of it, then tell everybody (iscsi, NFS, fstab) to access the volume at its drbd device name now rather than at the old LVM volume name. Hint: Taking volumes offline to do something this simple is *not* acceptable in a production environment!
  3. Scaling out storage by adding storage cabinets is non-trivial. I had to bring up my storage cabinets one at a time so I could RAID6 or RAID10 the cabinets (depending upon whether I was doing a scale or performance share) without spanning cabinets with my RAID groups, because spanning cabinets with SAS is a Bad Idea for a number of reasons. Policy-based storage management -- it's a great idea.
  4. Better hope that your Linux server doesn't kernel panic, because there's no battery-backed RAM cache to keep unwritten data logged. It still mystifies me that nobody has implemented this idea for the Linux software RAID layer. Well, except for Agami back in 2004, and Intransa back in 2004, neither of which are around anymore and where the hardware that implemented this idea is no longer available even if they were. And Agami did it at the disk controller level, actually, while Intransa did it by the simple expedient of entirely bypassing the Linux block layer. These first-generation disk cabinets have each 4-disk shelf IP-connected to a gigabit switch pair that then uplinks via a 10Gb link to the controllers, iSCSI requests flow in via 10Gb from the iSCSI switch pair to the controllers, are processed internally to turn them into volume and RAID requests which then get turned into disk shelf read/write requests that flow out the network on the other end of the stack, and nowhere in any of this does the Linux block layer come into play. That's why it was so easy to add the battery-backed cache -- no Linux block layer to get in the way.
The last of which brings to the forefront the role of the Linux block layer. The Linux block layer is this primitive thing that was created back in the IDE disk days and hasn't advanced much since. There have been attempts via write barriers and other mechanisms to make it work in a more reliable way that doesn't lose filesystems so often, and those efforts have worked to a certain extent, but the reality is that you have lvm and dm and drbd and the various raid layers and iscsi and then filesystems all layered on top like a cake and making sure that data that comes in at the top of the cake makes it to the disk drives at the bottom without a confectionery disaster inbetween... well, it's not simple. Just ask the BTRFS team. In private. Out of earshot of young children. Because some things are just too horrible for young ears to hear.

And I said I wasn't going to go off on a long rant. Oh well. So anyhow, next thing I'll do is talk about why I went with traditional RAID cabinets rather than creating a storage cloud, the latter of which would have taken care of some of my storage management issues by making it, well, cloudy. But that is a discussion for another day.


Thursday, March 14, 2013

The end of Google Reader

So Google Reader is going away, and people has a sad, me included, because it is by far the best RSS reader out there. On the other hand, as someone who has worked in the industry, I can pretty much tell you *why* it is going away (this is speculation on my part, but speculation that matches the actual known facts): It is going away because within months, Google is planning on making some internal infrastructure changes which will completely break Google Reader beyond any hope of repair.

The core problem is that Google Reader is old code. It originated back in Google's early days (well, 2005 is sorta early), when they didn't have any well-defined internal API's. So Google Reader depends on deep dark secrets of Google's actual infrastructure implementation, rather than using a well-defined internal API that will keep working when the infrastructure changes. The result has been that Google Reader has continually experienced outages for the past five years of its life as the infrastructure changes. It's costing Google money to keep hacking at it to keep it running, and they're not making money on it. And fixing that would require a re-write to a stable API in common with other products that wouldn't break anytime that the infrastructure changes -- something they're not going to do on something that's not making them any money.

I guess the takeaway from this is simple: Turning everything you do into an API is hard, but the alternative is that bit rot will eventually kill your code. This is, for example, why I created a complete virtualization infrastructure for the Intransa virtualized cluster product that hid all details of virtualization behind an API. The reality is that we only needed to touch ESXi in a handful of places -- but by hiding the details of that behind an API, I guaranteed that when we moved to a different virtualization system in the future (such as Xen or KVM) then all that would need changing would be the virtualization API, not any of the internal workings of Storstac. This is also one of the things that Linus has been fairly successful at doing over the past ten years or so with the Linux kernel. He may have broken the block subsystem by applying the big-kernel-lock removal patches, but at least he didn't break the API. Some additional changes have been added to the block API so that filesystems like BTRFS can work better, but the core API still remains the same as it's been for quite a few years now.

But clearly this wasn't done for Google products back in the early part of the '00s, and now it's just too difficult to maintain the less-used code. Google has to upgrade their infrastructure -- and how their infrastructure works -- in order to continue to scale. What that means is that products that weren't written to a consistent internal shared API are going to continue getting sloughed off, unless there is enough interest (and possible money to be made) to justify a re-write against a stable API. That's just how reality works. Oh well.


Monday, March 4, 2013

Patchwork and maintainability

Way back in the mists of time, I was there at the start of Agami. Agami made a cool NAS system with a filesystem that did things that nobody except NetApp was doing at the time and that, for that matter, no current shipping Linux filesystem will do -- and it was Linux. Specifically, it was Red Hat Enterprise Linux 3 hacked to run the 2.6.7 kernel (because 2.4 simply wouldn't do what Agami wanted to do). I remember that vividly because I was the person who hacked RHEL3 to make it work with the 2.6 kernel -- it required some specific changes to the init scripts run at system boot to look at things where 2.6 put them (in particular, 2.6 added sysfs and moved a bunch of stuff out of procfs) plus some changes to the Linux distribution itself (e.g. modutils).

The reason I mention that is because I was talking to the Director of Software Development of a network security appliance company and mentioned that I'd spent some time recently modifying Intransa's kernel block drivers to work around bugs in the Linux 2.6.32 kernel, specifically, to work around some races under heavy load that had been introduced into block device teardown that would either OOPS you or cause hung I/O. He asked me, puzzled, "why didn't you just fix the kernel bugs?"

Well, it was a fair question. The fact that the races I ran into are a result of the removal of the Big Kernel Lock and would require significant re-factoring of the kernel locks to make them go away, and furthermore that they're hard to reproduce and debug, is one issue. I looked at later kernel versions to see how that played out, and the changes to the block layer were deep and intrusive. It was much simpler to simply modify our software to cope with the misbehavior. But that's not what I answered him with. I answered him "because if we start hacking on the Linux kernel that opens us up to a world of hurt from a maintainability point of view when we want to move to a new kernel version, from a licensing point of view with the GPL... it just isn't the right thing to do."

I was explicitly thinking about Agami's 2.6.7 kernel there. Agami patched 2.6.7 to a fare-thee-well. I'm not sure how many patches in toto they were applying to their kernel, but it was at least in the hundreds. By contrast, the last two appliance companies I've worked for -- Resilience and Intransa -- patched the kernel only if it was completely and utterly unavoidable. When I ported the Resilience kernel patches from RHEL3 to RHEL5, there was less than a dozen patches, and they were all to fix driver issues with specific network drivers for obsolete hardware that, alas, we still needed to support. We'd submitted those patches upstream and it turned out that I ended up only needing to apply five patches total, the rest had already been applied upstream. The situation with Intransa's kernel is even simpler. There is a .config file, and one(1) patch that basically exports a kernel API that we need for one of our driver modules. I also apply an upstream LSI mpt2sas driver and an upstream networking driver needed for the specific hardware in our current and next-generation servers, but those are compiled separately as part of our software compile, not as part of the kernel compile (i.e., they're disabled in the kernel compile and isolated in a "3rdparty" directory where they're easy to remove once we transition to a kernel that has the required drivers in it). Everything else is self-contained in our own modules.

The result is that while it was a pain to change kernel versions from the kernel shipped with Intransa StorStac 7.12 to the kernel shipped with Intransa StorStac 7.20, it was a manageable pain -- it took me roughly two weeks of work to figure out what was happening with the block layer locking and around two weeks to create and debug work-arounds in our kernel drivers in the three places that were running into race issues, and the other issues were just a matter of some of the include files moving around. Meanwhile, Agami had painted themself into such a corner with their heavily hacked kernel (amongst other issues) that it proved exceedingly difficult for them to move off of 2.6.7 even though 2.6.7 had severe stability issues under heavy load -- I had to scale back many of the hardware tests that I wrote for the manufacturing line because they were making the kernel fall over under heavy load, and I was trying to test that the machine was assembled correctly and that the hardware was working correctly, I wasn't trying to test for bugs in the kernel.

These issues come into play mostly when there are quantum shifts in underlying hardware architectures requiring either a new kernel version or significant back-porting of new drivers and kernel features. We had to switch kernel versions between StorStac 7.2 and StorStac 7.11 because of the introduction of Nehalem-based server hardware. We had to switch kernel versions between StorStac 7.12 and StorStac 7.20 because of the introduction of Sandy Bridge based server hardware. In both cases backporting the architecture and driver support back to an earlier kernel would have been *much* more difficult than simply porting the StorStac kernel drivers to the new kernel. And it was all because of the decision to keep our kernel code as independent of the kernel as possible -- if there had been extensive Intransa modifications to the kernel, we could have never done it within the short amount of time that it was done (the 7.12 to 7.20 development cycle was four months -- *including* re-basing to a new Linux distribution).

So I suppose the takeaway from all this is:

  1. If you are applying a lot of patches to the kernel, stop and think of different ways of handling things. The Linux kernel will have bugs. Always. Think of workarounds that will work with those bugs, and continue working once those bugs are fixed. Bonus brownie points if the workarounds also improve performance by, e.g., adding bio pending and free lists that mostly mitigate against needing to kmalloc bio's once the system has been under load for a while.
  2. Keep your own stuff in your own .ko modules, don't go patching mainline kernel code to add your own functionality unless there's just no alternative (such as one NIC driver that did not implement the ability to set the MAC address in the NIC chip, which we needed for cluster failover of the Resilience appliance).
  3. If you need to really hack on a kernel subsystem, either a) create a new module by a different name, or b) consider different solutions.
  4. And above all: Consider maintainability from the beginning. Because if you don't, you, too, can go out of business within two years of delivering your first (and last) product...

Sunday, February 24, 2013

Part III: Enter KVM

See: The next test is envisioned to be NTFS. This will require writing a small Java program to do what I did from the shell on Unix. But before that, I wanted to quantify the performance loss caused by KVM I/O virtualization.

I installed Fedora 18 on a KVM virtual machine via virt-manager and pushed /dev/md10 (the 6-disk RAID10 array) into the virtual machine as a virtio device. I then did raw I/O to /dev/vdb (what it showed up as in the virtual machine), and found that I was getting roughly the same performance as native -- which, as you recall, was 311Mb/sec. I was getting 308Mb/sec, which is close enough to be no real difference. The downside was that I was using 130% of a CPU core between the virtio driver and kflushd (using write-back mode rather than write-through mode), i.e., using up one CPU core plus 1/3rd of another to transfer the data from the VM to the LSI driver. For the purposes of this test, that is acceptable -- I have 8 cores in this machine, remember.

The next question was whether XFS performance would show the same excellent results in the VM that it showed native. This proved to be somewhat disappointing. The final result was around 280mb/sec -- or barely faster than what I was getting from ZFS. My guess is that natively XFS tries to align writes with RAID stripes for the sake of performance, but with the RAID array hidden behind the emulation layer provided by the virtualization system, it was not able to do so. That, combined with the fact that it only had half as much buffer cache to begin with (due to my splitting the RAM between the KVM virtual machine and the host OS -- i.e., 10Gb apiece) made it more difficult to effectively schedule I/O. I/O on the KVM side was "bursty" -- it would burst up to 1 gigabyte per second, then down to 0 gigabyte per second, as shown by 'dstat'. This similarly caused I/O on the host side to be somewhat "bursty". Also, this tends to support the assertion that it's the SEL (Solaris Emulation Layer) that's causing ZFS's relatively poor streaming performance when compared to BTRFS, since the SEL effectively puts the filesystem behind an emulation layer too. It also supports the assertion that the Linux kernel writers have spent a *lot* of time working on optimizations of the filesystem/block layer interface in the recent Linux kernels. It also raises the question of whether hardware RAID controllers -- which similarly hide the physical description of the actual RAID system behind a firmware-provided abstraction layer -- would have a similar negative impact upon filesystem performance. If I manage to snag a hardware RAID controller for cheap I might investigate that hypothesis but it's rather irrelevant at present.

What this did bring out was that it is unlikely that testing NTFS throughput via a Windows virtual machine is going to produce accurate data. Still, I can compare it to the Linux XFS solution, which should at least tell me whether its performance is within an order of magnitude for streaming loads. So that's the next step of this four-part series, delayed because I need to write some Java code to do what my script with 'dd' did.


Update: My scrap heap assemblage of spare parts disintegrated -- the motherboard suddenly decided it was in 6-beep "help I can't see memory!" heaven and no amount of processor and/or memory swapping made it happy -- and thus the NTFS test never got done. Oh well.

Saturday, February 23, 2013

Part II: Enter XFS

So in the previous episode, I had benchmarked btrfs at 298Mb/sec total throughput on 8 simultaneous simulated video streams to disk, and set up a Linux RAID10 array on my six 2Tb 7200 rpm drives. The raw drives have a total streaming throughput of 110Mb apiece. I left the RAID10 array to rebuilding overnight, and went to sleep.

So what is the raw throughput of the RAID10 array and how much CPU does it chew up while doing so? I tested that today. The total raw throughput of three RAID0 stripes on those drives should be 330Mbytes/sec. Through the MD layer with a single full-speed stream I got 311Mb/sec, or roughly 6% overhead caused by the Linux kernel and the RAID10 layer. The RAID10 layer was using approximately 16% of one core accounted to flush-9:10, which is quite reasonable for the amount of work being done.

Next step was to put an XFS filesystem onto this RAID device. Note that I did not even consider putting an EXT4 onto a 6 terabyte filesystem, EXT4 is not suitable for video streaming for a number of reasons I won't detail here. EXT4 is a fine general purpose filesystem, far more reliable than it has any right to be ocnsidering its origin, but has significant performance issues with very large files in a streaming application.

The first question is, does putting the XFS log on a SSD improve performance? So I created an XFS filesystem with the log device on the SSD and the filesystem proper on /dev/md10 (the RAID10 device) and did my streaming tests again. This time it settled down to 303Mb/sec, or roughly 8% overhead. Also, because XFS only logs metadata changes, I noted that virtually no I/O was going to the log device.

Note that XFS is aggregating writes into far bigger writes than my raw writes to the MD10 device, so you cannot say that XFS has only 2% overhead over direct I/O to the raw devices. It reduces MD10 overhead due to its aggregation and its aligning of blocks to RAID stripes also. Still, it is clear that XFS is the king of high-performance streaming I/O on Linux -- as has been true for the past decade.

Of course XFS also has its drawbacks. XFS values speed over everything else, so XFS can, in practice, due to its aggressive write re-ordering, result in corrupted files in the event of a power failure or kernel panic or watchdog-forced reboot. XFS is quite acceptable for video recording data, where you may corrupt the last few seconds of video recorded to disk but you'll lose far more data due to the power outage. Add in the Linux MD layer and the MD write hole, where partial-stripe updates cannot be reconciled (as versus the COW updates of BTRFS or ZFS where the old data is still available and is reverted to if the new stripe did not complete, resulting in a file that at least is consistent, though missing the last update) and it is clear that XFS should be used for important data only on top of a hardware RAID subsystem with battery-backed cache, and should not be used for absolute mission-critical data like, say, payroll, unless the features that make it perform so well on streaming loads are turned off. Appropriate tools for appropriate tasks and all that...

So in any event, it is clear that XFS, BTRFS, and ZFS are at present useful for entirely different subsets of problems, but for video streaming XFS still remains king. Next, I take a look at what Windows will do when talking NTFS to that MD10 device via libvirtd and kvm... I will also compare to what Linux does when talking XFS to that MD10 device via libvirtd and kvm.

-Eric Lee Green

Friday, February 22, 2013

Filesystem performance - btrfs, zfs, xfs, ntfs

I was somewhat curious about filesystem performance for video streaming purposes. So I set up a test system. The test system was:
  • 12-disk Supermicro chassis with SAS1/SATA2 backplane, scrap (has motherboard issues with the sensors and a broken backplane connector and was rescued from the scrap heap)
  • SuperMicro X8DTU-F motherboard with two 2.4Ghz Xeon processors (Nehalem architecture) and 20Gb of memory (the odd amount of memory is due to being populated from the contents of my junk bin)
  • Six 2Tb 7200 RPM drives (all the SATA-2 drives that I could scrounge up from my junk bin)
  • One LSI 9211-4I SAS2 HBA controller (purchased for this purpose)
  • One Intel 160Gb SATA-2 SSD (replaced with a larger SSD in my laptop)
There was also another 64Gb SSD used as the boot drive for Fedora 18, the OS used for this test. Fedora 18 was chosen because it has a recent BTRFS with RAID10 support and because ZFS On Linux will compile easily on it. The following configurations were tested:
  1. BTRFS with 6 drives configured as RAID10
  2. ZFS with 6 drives set up as three 2 disk mirror pairs, striped. I experimented with the 160Gb SSD as logging device, and without it as logging device.
  3. XFS on top of a Linux MD RAID10
  4. A Windows 7 virtual machine running in KVM virtualization environment with the MD RAID10 exported as a physical device to the virtual machine.
I did not test ext4 because I know from experience that its performance on large filesystems with large files is horrific.

Note that XFS on top of RAID10 is subject to data loss, unlike BTRFS and ZFS which include integrity guarantees. Windows 7 in a virtual machine on top of MD RAID10 is subject to even more data loss plus has approximately 5% I/O performance overhead in my experience (the host system chews up a huge amount of CPU, but CPU was not in short supply for this test). The purpose was not to propose them as serious solutions to this problem (though for security camera data loss of a few seconds of data in the event of a power failure may be acceptable) but, rather, to compare BTRFS and ZFS performance with the "old" standards in the area.

The test itself was simple. I set up a small script that set up 8 dd processes with large block sizes running in parallel streaming /dev/zero data to the filesystem, similar to what might occur if eight extremly high definition cameras were simultaneously streaming data to the filesystem. Compression was off because that would have turned it into a CPU test, heh. At intervals I would killall -USR1 dd and harvest the resulting performance numbers.

My experience with these hard drives as singletons is that they are fundamentally capable of streaming approximately 110Mb/sec to the platters. Because ZFS and BTRFS are both COW filesystems, they should have been utilizing the full streaming capability of the drives if they properly optimized their disk writes and had zero overhead. In practice, of course, that doesn't happen.

First I tried BTRFS. After some time it settled down to approximately 298Mb/sec throughput to the RAID10 test volume. This infers approximately 10% overhead (note that since RAID10 is striped mirrors, multiply by two to get total bandwidth, then divide by six to get per-drive bandwidth). The drive lights stayed almost constantly lit.

Next I tried ZFS with the log on the SSD. I immediately noticed that my hard drive lights were "loping" -- they were brightly lit then there were occasional burps or pauses. The final result was somewhat disappointing -- approximately 275Mb/sec throughput, or roughly 17% penalty compared to raw drive performance.

But wait. That's an older Intel SSD that had been used in a laptop computer for some time before I replaced it with a larger SSD. Perhaps the common ZFS wisdom to put the log file onto an SSD is not really that good? So the next thing I tried was to destroy the testpool and re-create it from scratch without the log device, and see whether that made a performance difference. The result was no performance difference. Again I was averaging 275Mb/sec throughput. To say that I'm surprised is an understatement -- "common knowledge" says putting the ZFS log on a SSD is a huge improvement. Doesn't appear to be true, at least not for streaming workloads with relatively few but large transactions.

In other words, don't use ZFS for performance. Its performance appears to have... issues... on Linux, likely due to the SEL (Solaris Emulation Layer), though it is quite acceptable for many purposes (a 17% I/O performance penalty sucks, but let's face it, most people don't max out their I/O subsystems anyhow). It's the same licensing issues that have held up ZFS adoption elsewhere, Sun created their license for their own purposes, which are not the same as the Open Source community's purposes, and Oracle appears to agree with Sun. On the other hand ZFS is *really* handy for a backup appliance, due to the ability to snapshot backups and replicate them off-site, thereby providing all three of the functions of backups (versioning, replication, and offsite disaster recovery). For limited purposes ZFS's deduplication can be handy also (I use it on virtual machine backups since virtual machine files get rsync'ed to the backup appliance every time but most of the blocks are the same as the last time it was rsync'ed -- ZFS merely links those blocks to the previously-existing blocks rather than use up all my disk). That's a purpose that I've put it to, and am having excellent results with it. Note that the latest BTRFS code in the very latest Linux kernel has RAIDZ-like functionality so that's no longer an advantage of ZFS over BTRFS.

Finally I went to the hoary old md RAID10 driver. This caused some havoc because md RAID10 insists upon scanning and possibly rebuilding every block on the RAID array before you get full-performance access to it. I changed the max rebuild speed to well above the physical capability of the hardware and MD10 reported that it was rebuilding at 315Mb/sec. This means approximately a 4% performance overhead from MD10 compared to the raw physical hard drive speed. The kernel thread md10_raid10 was using 66% of one core, and the kernel thread md10_resync was using 15% of another core. This tends to indicate that if I had a full 12-disk RAID10 array, I'd be maxing out a core and get lower performance -- fairly disappointing. Intransa's code has similar bottlenecks (probably caused by the bio callbacks to the RAID layer from the driver layer), but I'd expected the native code to perform better than 10 year old code that wasn't even originally designed for modern kernels and was not originally designed to talk to local storage and can do so only via a shim layer that I am, alas, intimately familiar with (Intransa's RAID layer was originally designed to talk to IP-connected RAID cabinets for scale-out storage). So it goes.

So now the RAID is rebuilding. I'll come back tomorrow with a new post after it's rebuilt and look at what xfs and ntfs on top of that RAID layer do. In the meantime, don't attach too much importance to these numbers. Remember, this is being done on scrap hardware as basically a "I wonder" as to how good (or how bad) the new filesystems are compared to the oldies for my particular purpose (streaming a lot of video data to the drives). YMMV and all that. And testing it on modern hardware -- SATA3 drives and backplanes, Sandy/Ivy motherboard and processors, etc. -- likely would result in much faster numbers. Still, for my purposes, this random scrap assemblage gives me enough information. So it goes.

- Eric Lee Green

* Disclaimer - I am currently chief storage appliance engineer at Intransa, Inc., a maker of iSCSI storage appliances for video surveillance. This blog post was not, however, conducted using Intransa-owned equipment or on company time, and is my own opinion and my own opinion only.