Saturday, March 17, 2012

Random notes on iSCSI storage

When you're using eSXI/vSphere for your virtualization host, iSCSI storage is actually useful. That's because VMWare's vmfs3 filesystem is by default a clustering filesystem. What that means is that if your iSCSI target is capable of operating in cluster mode -- i.e., accept initiators from multiple hosts connected at the same time -- iSCSI block storage can be used for ultra-quick failovers on your VMware servers (amongst other things it can be used for). And the performance is *significantly* better than NFS datastores, because VMware can store vmdk files as physically contiguous extents with vmfs3, while VMware has no control of how a NFS server physically lays out vmdk files on disk. This is important because all modern operating systems use an "elevator" algorithm for their filesystem cache flushes that assumes that the underlying block storage is physically contiguous from block 0 to block n, and if the underlying storage is *not* physically contiguous, you end up with either the possibility of lost writes (if the NFS host is running in asynchronous mode) or with the NFS host's disks thrashing all over the place and performance sucking like a male prostitute at a Republican convention.

So anyhow, just wanted to share a technique I used to rescue a failing machine. The machine involved was a Red Hat Enterprise Linux 4 machine that I wanted to migrate to virtualization for the simple reason that one of its drives had failed. 30GB of the first drive was used for actual data, most of the system was empty.

So first thing first, I created a blank virtual machine on the ESXi host and told VMware to create a drive big enough to hold all the data on the old RHEL4 machine. Then I connected that virtual machine's hard drive to a Centos6 virtual machine as a virtual hard drive. Then I exported that virtual hard drive via tgtd / iSCSI to the RHEL4 machine and connected to that target from the RHEL4 machine's iSCSI initiator. On the RHEL4 machine I then dd'ed the first hundred blocks from its physical hard drive to the iSCSI hard drive (which was something like /dev/sdc, I'd checked /proc/partitions before telling the target to scan so I could know what showed up), did a 'sfdisk -R /dev/sdc' to re-read the partition table on /dev/sdc, then copied the /boot partition (after unmounting it) as a byte by byte copy: 'dd if=/dev/sda1 of=/dev/sdc1'. Then I did

  • pvcreate /dev/sdc2
  • vgcreate rootgroup /dev/sdc2
  • lvcreate -n rootvol -L 16G rootgroup
  • lvcreate -n swapvol -L 2G rootgroup
  • lvcreate -n extravol -L 16G rootgroup
  • vgscan -a
  • lvscan -a
  • mkfs -t ext3 /dev/mapper/rootgroup-rootvol
  • mkswap /dev/mapper/rootgroup-swapvol
  • mkfs -t ext3 /dev/mapper/rootgroup-extravol
I then mounted my new volumes in their correct hierarchy (so that when I chrooted to them I'd see /boot and etc. in their right places) and did your typical pipelined tar commands to do file-by-file copies of / and /extra to their new location, and while that was going on I edited /etc/fstab and chrooted to the new environment and mounted /proc and /sys and did a mkinitrd to capture the new root volume. Though I do suggest that you have the rescue disk handy as an ISO image on an ESXi datastore so you can mount it in case of problems -- which I did, but unrelated to any of this (it was related to the failure that caused me to do the migration in the first place).

So how did this data transfer perform? Well, basically at the full speed of the source hard drive, which was a 500GB IDE hard drive.

Anyhow, having used the Linux iSCSI target daemon, tgtd, here as well as extensively for other projects, let me just say that it sucks big-time compared to "real" targets. How does it suck? Let me count the ways:

  1. Online storage management simply doesn't exist with tgtd. You can't do *anything* to manage a iSCSI target that someone's already connected to, you can't even stop tgtd!
  2. For that matter, storage management period doesn't exist with tgtd. For example, you can't increase the size of a target once you've created it by adding more backing store to an already existing up and running iSCSI target, it simply is.
  3. tgtd gets into regular fights with the Linux kernel about who owns the block devices that it's trying to export. It's basically useless for exporting block devices because of that -- if there's a md array on the block device or a lvm volume set on the block device, the Linux kernel will claim it long before tgtd gets ownership of it. Thing is, you don't have any control over what the initiator puts onto a block device, so you're kind of stuck there, you have to manually stop the target, deactivate the RAID array and / or volume group, then manually start the target in order to get control over the physical device to export it.
  4. tgtd has the most obscure failure mode I've ever encountered: if it can't do something it will still happily export the volume, just as a 0-length volume. WTF?!
My conclusion: tgtd is a toy, useful only for experimenting and one-off applications. It doesn't have the storage management capabilities needed for a serious iSCSI target. Some of that storage management could be built around it, but the fact that you cannot modify a tgtd target while anybody is connected to it means that you can't do things that the big players -- or even the little guys like the Intransa appliance that I'm using for the backing store for my eSXI host -- have been able to do for years. Even on the antique nine-year-old Intransa realm that's hosting some of our older data (which is migrating to a new one but that takes time) I can expand the size of an iSCSI target in real time, for example. I then tell my initiator to re-scan, it notices "hey, my target has gotten bigger!" and informs the kernel of such, then I can use the OS's native utilities to expand a current filesystem to fill the additional space. None of that's possible with tgtd for the simple reason that tgtd won't do real-time live storage management. Toy. Just sayin'.


Friday, March 2, 2012

Best practices for virtualization

A series of notes...

  1. vSphere/ESXi: Expensive. Inscrutable licensing scheme -- they have more SKU's than my employer, almost impossible to tell what you need for your application. Closest thing to It Just Works in virtualization. Call them the Apple of virtualization.
  2. Xen : Paravirtualization gives it an advantage in certain applications such as virtualized Linux VM's in the cloud. Paravirtualization generally is faster than hypervirtualization, though most hypervisors now include paravirtualized device drivers to ease that pain. Xen doesn't Just Work, it's more an erector set. Citrix's XenServer is the closest that Xen gets to vSphere's 'Just Works', I need to download it and try it out.
  3. KVM : The future. Integrating the hypervisor and the OS allows much better performance. That's why VMware wrote their own kernel with integrated hypervisor. Current issues: Management is the biggest difficulty. There is difficulty creating clustered filesystems for swift failover or migration of virtual machines (ESXi's VMFS is a cluster file system -- point several ESXi systems at a VMFS filesystem on a iSCSI or Fiber Channel block storage, and they'll all be able to access virtual machines on that system). Most KVM systems set up to do failover / migration in production use NFS instead, but NFS performs quite poorly for the typical virtualization workload for numerous reasons (may discuss later). Closest thing to VMFS performance for VM disks is using LVM volumes or clustered LVM (if using iSCSI block storage), but there are no management tools for KVM allowing you to set up LVM pools and manage them for virtual machine storage with snapshots and so forth. Virtual disk performance on normal Linux filesystems, via the qcow2 format, sucks whether you're talking ext4, xfs, or nfs. In short, the raw underlying bits and pieces are all there, but there is not a management infrastructure to use them. Best practice performance-wise for clustered setup: NFS share for metadata (xml description files of VM's, etc.), iSCSI or Fiber Channel block storage possibly sliced/diced with clustered LVM for the VM disks.
So what am I going to use today if I'm a busy IT guy who wants something that Just Works? VMware vSphere. Duh. If, on the other hand, I'm building a thousand-node cluster, a) it's probably my full time job so I have time to spend futzing with things like clustered LVM, and b) the cost of vSphere for a cluster that large would be astronomical so would decidedly make paying my salary to implement Xen or KVM on said cluster more palatable than paying VMware.


Random notes on automating Windows builds

  1. Install the version control system of choice. In this case, bitkeeper, but any other CLI-drivable vc system will work.
  2. Check out your directory tree of the various products you are going to be building.
  3. Install the *real* Microsoft Visual Studio 10
  4. Create a solution (Microsoft-speak for "makefile" though it isn't) for each of the individual solutions you are building as part of your overall product and make sure each solution builds. This will be saved in the file Foo.vcxproj (for some solution named Foo) in each solution's root directory.
  5. Add the directory that 'devend' and 'nmake' lives in to your PATH in your system config. Control Panel->System->Advanced SYstem Settings->Environment Variables, edit user variable Path. My Path looks like: C:\Users\\bitkeeper;C:\Program Files (x86)\Microsoft Visual Studio 10.0\Common7\IDE;C:\Program Files (x86)\Microsoft Visual Studio 10.0\VC\bin
  6. Create a standard Unix-style makefile in the parent directory that has a list of the subdirectories to recurse into, then in each subdirectory, a Makefile that has 'devenv /build Foo.vcxproj' to build and 'devend /clean Foo.vcxproj' to clean.
  7. Test your make file with 'make', make sure that the proper .exe files are produced in each of your subdirectories.
  8. With Visual Studio closed, install Wix and Votive
  9. Use Votive to build a WiX project file and "compile" it to XML.
Once you've done this, then you can edit the Makefile at the root of your project so that after it recurses the directory, it runs the WiX commands:
  1. candle product.wxs
  2. light product.wixobj
The output should be product.msi.

Install Jenkins to integrate with the source control system, check from time to time for new checkins, then fire off a build via use of nmake when the checkins happen. Jenkins does run under Windows, with some caveats. See e.g. Setting up Jenkins as a Windows service. Biggest issue may be getting email to go out on Windows, will have to investigate that further once I get to that point.