Saturday, April 27, 2013

Configuring shared access for KVM/libvirt VM's

Libvirt has some nice migration features in the latest RHEL/Centos 6.4 to let you move virtual machines from one server to the other, assuming that you . But if you try it with VM's set to auto-start on server startup, you'll swiftly run into problems the next time you reboot your compute servers -- the same VM will try to start up on multiple compute servers.

The reality is that unlike ESXi, which by default locks the VMDK file so that only a single virtual machine can use it at a time, thus meaning that the same VM set to start up on multiple servers will only start on one (that wins the race), libvirtd by default does *not* include any sort of locking. You have to configure a lock manager to do so. In my case, I configured 'sanlock', which has integration with libvirtd. So on each KVM host configured to access shared VM datastore /shared/datastore :

  • yum install sanlock
  • yum install libvirt-lock-sanlock
Now set up sanlock to start at system boot, and start it up:
  • chkconfig wdmd on
  • chkconfig sanlock on
  • service wdmd start
  • service sanlock start
On the shared datastore, create a locking directory and give it username/ID sanlock:sanlock and permissions for anybody who is in group sanlock to write to it:
  • cd /shared/datastore
  • mkdir sanlock
  • chown sanlock:sanlock sanlock
  • chmod 775 sanlock
Finally, you have to update the libvirtd configuration to use the new locking directory. Edit /etc/libvirt/qemu_sanlock.conf with the following:
  • auto_disk_leases = 1
  • disk_lease_dir = /shared/datastore/sanlock
  • host_id = 1
  • user = "sanlock"
  • group = "sanlock"
Everything else in the file should be commented out or a blank line. Host ID must be different for each compute host, I started counting at 1 and counted up for each compute host. And edit /etc/libvirt/qemu.conf to set the lock manager:
  • lock_manager = "sanlock"
(the line is probably already there, just commented out. Un-comment it). At this point, stop all your VM's on this host (or migrate them to another host), and either reboot (to make sure all comes up properly) or just restart libvirtd with
  • service libvirtd restart
Once you've done this on all servers, try starting up a virtual machine you don't care about on two different servers at the same time. The second attempt should fail with a locking error., At the end of the process it's always wise to shut down all your virtual machines and re-start your entire compute infrastructure that's using the sanlock locking to make sure everything comes up correctly. So-called "bounce tests" are painful, but the only way to be *sure* things won't go AWOL at system boot. If you have more than three compute servers I instead *strongly* suggest that you go to an OpenStack cloud instead, because things become unmanageable swiftly using this mechanism. At present the easiest way to deploy OpenStack appears to be Ubuntu, which has pre-compiled binaries on both their LTS and current distribution releases for OpenStack Grizzly, the latest production release of OpenStack as of this writing. OpenStack takes care of VM startup and shutdown cluster-wide and simply won't start a VM on two different servers at the same time. But that's something for another post. -ELG

Friday, April 26, 2013

On spinning rust and SSD's.

I got my Crucial M4 512GB SSD back for my laptop. It failed about three weeks ago, when I turned on my laptop it simply wasn't there. Complete binary failure mode -- it worked, then it didn't work. So I took it out of the laptop, verified in an external USB enclosure that it didn't "spin up" there either, installed a 750Gb WD Black 7200 rpm rust-spinner that was in my junk box for some project or another, and re-installed Windows and restored my backups. Annoying, but not fatal by any means. I've had to get used to the slow speed of spinning rust again versus the blazingly fast SSD, but at least I'm up and running. So this weekend I get to make another full backup, then swap out the rust for the SSD again.

At work I've had to replace several of the WD 2TB Enterprise drives in the new Linux-based infrastructure when smartd started whining about uncorrectable read errors. When StorStac got notification of that sort of thing it re-wrote the sector from the RAID checksums and that usually resolved it. The Linux 3.8 kernel's md RAID6 layer apparently doesn't do that, requiring me to kick the drive out of the md, slide in a replacement, fire off a rebuild, and then haul the drive over to my desktop and slide it in there and run a blank-out (write zeroes to the entire drive). Sometimes that resolves the issue, sometimes the drive really *is* toast, but at least it was an analog error (just one or two bad places on the drive), not a complete binary error (the entire drive just going blammo).

SSD's are the future. The new COW filesystems such as ZFS and BTRFS really don't do too well on spinning rust, because by their very nature they fragment badly over time. That doesn't matter on SSD's, it does matter with rust-spinners, for obvious reasons. With ZFS you can still get decent performance on rust if you use a second-level SSD cache, that's how I do my backup system here at home (which is an external USB3 hard drive and an internal SSD in my server), BTRFS has no such mechanism at present but to a certain extent compensates by having a (manual) de-fragmentation process that can be run from time to time during "off" hours. Still, both filesystems clearly prefer SSD to rotational storage. It's just the nature of the beast. And those filesystems have sufficient advantages in terms of functionality and reliability (except in virtualized environments as virtual machine filesystems -- but more on that later) that if your application can afford SSD's, that alone may be the tipping point that makes you go to SSD-based storage rather than rotational storage.

Still, it's clear to me that, at this time, SSD is still an immature technology subject to catastrophic failure with no warning. Rotational storage usually gives you warning, you start getting SMART notifications about sectors that cannot be read, about sectors being relocated, and so forth. So when designing an architecture for reliability, it is unwise to have an SSD be a single point of failure, as is often done for ESXi servers that lack hardware RAID cards supported by ESXi. It might *seem* that SSD is more reliable than rotational storage. And on paper, that may even be true. But the reality is that because the nature of the failures is different, in *reality* rotational storage gives you a much better chance of detecting and recovering from a failing drive than SSD's do. That may, or may not be important for your application -- in RAID it clearly isn't a big deal, since you'll be replacing the drive and rebuilding a new drive anyhow -- but for things like an ESXi boot drive it's something you should consider.


Thursday, April 25, 2013


I must admit that I have a low opinion of journalists, tech journalists in particular. I've been interviewed several times over the years and only once has the result been accurate. In all the other cases, what I said was spun to fit the journalist's preconceived notion of what the story should be, and to bleep with the truth.

What I cannot understand is why, if a tech journalist cannot interview people in the know because they had to sign a NDA in order to obtain certain assets for a specified price, said journalist would go ahead and publish a story based entirely upon speculation and a single source that may or may not know the details of whatever legal agreements were signed. It's not professional, it's not ethical, and it's not right. But it's the way tech "journalism" is done here in the Silicon Valley. I guess making a living by being unprofessional and unethical doesn't bother some people. So it goes.


Monday, April 1, 2013


> Realm shutdown

Click on the picture for high resolution. Today we decommissioned the only 10gbit Intransa iSCSI storage realm in existence. There were only two ever built, and only one was ever sold. This one was built by Douglas Fong for use by Intransa IT and has 24 4-disk IP-connected disk shelves in six cabinets, for a total of 96 250gb IDE hard drives talking to two SMC/Dell switches via 48 1gbit connections. The SMC/Dell switches are then connected to the two clustered controller units via 10Gbit Ethernet, which then exports iSCSI to the two SMC/Dell switches above it via 10Gbit Ethernet. This whole concept was designed for scale-out storage, when you needed more storage you just added more of the blue boxes (or, later, the grey boxes to the left) and incidentally this also made the result faster.

Two things became clear as I was prepping the changeover from this 2/3rds rack of equipment to 4u worth of generic Linux storage. The first was that the Intransa box was infinitely easier to manage than my 24 disks worth of Linux-based storage, despite having four times as many spindles. This is because the Intransa software did policy-based storage allocation. You told it you wanted a new volume with 5-disk RAID5 or 4-disk RAID10 or whatever, and it went out and either found existing RAID groups and put your new volume there, or found enough disks to create a new RAID group and put your volume there. You didn't have to worry about how to lay out RAID groups or volumes on top of RAID groups and exporting to iSCSI, it all Just Happened.

The second thing that became apparent was that this beast was fast -- seriously fast. The orange cable at the top right is the 10Gbit Ethernet cable going to my new infrastructure that I used to migrate the volumes off of this pile of blue boxes. Surprisingly, the limit was my new Linux storage boxes, not the Intransa storage -- I was pulling data off at 200 megabytes/second, the max I could pull in via my two 1Gbit Ethernet connectors. It seems that if you have enough spindles, even 250gb IDE drives can generate a significant number of iops. It would have been interesting to see exactly how fast it was, but unfortunately I'm still working on getting the Intel 10Gbit cards working in the Linux storage servers (I am now going to use copper SFP+ cables, since it is clear that the Intel cards aren't going to work with the optical SFP+ modules that I have), so was restricted to two 1Gbit connections.

Sadly, the pile of dead drives on top of the pile of blue cabinets are one indication of why it's being retired. The 250Gb Maxtor drives in this thing were manufactured in 2004 and were starting to fail. My supply of spare parts was limited. In addition, this beast is horrifically complex -- even the person who built it had trouble getting it up and running the last time it was moved, and our new little startup certainly wouldn't be able to get it up and going by ourselves, so we settled for getting the intellectual property off of it onto our own generic Linux server equipment. Finally, it and the backup replica realm beside it took up a huge amount of space and power, the two Linux servers do in 8U what required an entire rack full of equipment to do with this seven-to-nine-year-old Intransa equipment. So it was time, albeit with a bit of sadness too. Intransa had some great ideas and solid gear. They could not, alas, make money with it.

I played taps on my Irish whistle as the realm shut down.

-- ELG