Showing posts with label vmware. Show all posts
Showing posts with label vmware. Show all posts

Sunday, September 27, 2015

SSD: This changes everything

So someone commented on my last post where I predicted that providing block storage to VM's and object storage for apps was going to be the future of storage, and he pointed out some of the other ramifications of SSD. To whit: Because SSD removes a lot of the I/O restrictions that have held back applications in the past, we are now at the point where CPU in many cases is the restriction. This is especially true since Moore's Law has seemingly gone AWOL. The Westmere Xeon processors in my NAS box on the file cabinet beside my desk aren't much slower than the latest Ivy Bridge Xeon processors. The slight bump in CPU speed is far exceeded by the enormous bump in IOPS that comes with replacing rotational storage with SSD's.

I have seen that personally, myself, in watching a Grails application max out eight CPU cores while not budging the iometer on a database server running off of SSD's. What that implies is that the days of simply throwing CPU at inefficient frameworks like Grails are limited. In the future efficient algorithms and languages are going to come back in fashion to use all this fast storage that is taking over the world.

But that's not what excites me about SSD's. That's just a shuffling of priorities. What excites me about SSD's is that they free us from the tyranny of the elevator. The elevator is the requirement that we sweep the disk drive heads from bottom to top, then from top to bottom, in order to optimize reads. This in turn puts some severe restrictions on how we lay out storage for block storage -- the storage must be stored contiguously so that filesystems layered on top of the block storage can properly schedule I/O out of their buffers to satisfy the elevator. This in turn means we're stuck with the RAID write hole unless we have battery backed cache -- we can't do COW RAID stripe block replacement (that is, write altered blocks of a RAID stripe at some new location on the device then alter a stripe map table to point at those new locations and add the old locations to a free list) because a filesystem on top of the block device would not be able to schedule the elevator properly. The performance of the block storage system would fall over. Thus why traditional iSCSI/Fiber Channel vendors present contiguous LUNs to their clients.

As a result when we've tried to do COW in the past, we did it at the filesystem level so that the filesystem could properly schedule the elevator. Thus ZFS and BTRFS. They manage their own redundancy rather than using RAID at the block layer to handle their redundancy, and ideally want to directly manage the block devices. Unfortunately that really doesn't map well to a block storage back end that is based on LUNs, and furthermore, doesn't map well to virtual machine block devices represented as files on the LUN -- virtual machines all have their own elevators doing what they think are sequential ordered writes, but the COW filesystems are writing at random places, so read performance inside the virtual machines becomes garbage. Thus VMware's VMFS, which is an extent-based clustered filesystem that, again, due to the tyranny of the elevator, keeps the blocks of a virtual machine's virtual disk file located largely contiguously on the underlying block storage so that the individual virtual machines' elevators can schedule properly.

So VMFS talking to clustered block storage is one way of handling things, but then you run into limits on the number of servers that can talk to a single LUN that in turn makes it difficult to manage because you end up with hundreds of LUN's for hundreds of physical compute servers and have to schedule the LUNs so they're only active on the compute servers that have virtual machines on that specific LUN (in order to avoid hitting the limits on number of servers allowed to access a single LUN). What is needed is the ability to allocate block storage on the back end on a per-virtual-machine basis, and have the same capabilities on that back end that VMFS gives us on a single LUN -- the ability to do snapshots, the ability to do sparse LUN's, the ability to copy snapshots as new volumes, and so forth. And have it all managed by the cloud infrastructure software. This was difficult back in the days of rotational storage because we were slaves of the elevator, because we had to make sure that all this storage ended up contiguous. But now we don't -- the writes have to be contiguous, due to the limitations of SSD, but reads don't. And it's the reads that forced the elevator -- scheduling contiguous streams of writes (from multiple virtual machines / multiple files on those virtual machines) has always been easy.

I suspect this difficulty in managing VMFS on top of block storage LUNs for large numbers of ESXi compute servers is why Tintri decided to write their own extent-based filesystem and serve it as a NFS datastore to ESXi boxes, rather than as block storage LUN's. NFS doesn't have the limits on number of computers that can connect. But I'm not convinced that, going forward, this is going to be the way to do things. VSphere is a mature product that has likely reached the limits of its penetration. New startups today are raised in the cloud, primarily on Amazon's cloud, and they want a degree of flexibility to spin virtual machines up and down that make life difficult with a product that has license limits. They want to be able to spin up entire test constellations of servers to run multi-day tests on large data sets, then destroy them with a keystroke. They can do this with Amazon's cloud. They want to be able to do this on their local clouds too. The future is likely to be based on the KVM/QEMU hypervisor and virtualization layer, which can use NFS data stores but they already have the ability to present an iSCSI LUN to a virtual machine as a block device. Add in some local SSD caching at the local hypervisor level to speed up writes (as I explained last month), and you have both the flexibility of the cloud and the speed of SSD. You have the future -- a future that few storage vendors today seem to see, but one that the block storage vendors in particular are well equipped to capture if they're willing and able to pivot.

Finally, there is a question as to whether storage and compute should be separate things altogether. Why not have compute in the same box as your storage? There's two problems with that though: 1) you want to upgrade compute capability to faster processors on a regular basis without disrupting your data storage, and b) density of compute servers is much higher than density of data servers, i.e., you can put four compute blades into the same 2U space as a 24-bay data server. And as pointed out above, compute power is now going to be the limiting factor for many applications, not IOPs. Finally, you want the operational capability to add more compute servers as needed. When our team used up the full capacity of our compute servers, I just added another compute server -- I had plenty of storage. Because the demand for compute and memory just keeps going up as our team has more combinations of customer hardware and software to test, it's likely I'm going to continue to have to scale compute servers far more often than I have to scale storage servers.

So this has gone on much too long but the last thing to cover is this: Will storage boxes go the way of the dodo bird, replaced by software-defined solutions like Ceph on top of large numbers of standard Linux storage servers serving individual disks as JBOD's? It's possible, I suppose -- but it seems unlikely due to the latency of having to locate disk blocks scattered across a network. I do believe that commodity hardware is going to win everything except the high end big iron database business in the end because the performance of commodity hardware has risen to the point where it's pointless to design your own hardware rather than purchase it off the shelf from a vendor like Supermicro. But there is still going to be a need for a storage stack tied to that hardware in the end because pure software defined solutions are unable to do rudimentary things like, e.g., use SES to blink the LED of a disk bay whose SSD has failed. In the end providing an iSCSI LUN directly to a virtual machine requires both a software support side that is clearly software defined, and a hardware support side where the hardware is managed by the solution. This in turn implies that we'll continue to have storage vendors shipping storage boxes in the future -- albeit storage boxes that will incorporate increasingly large amounts of software that runs on infrastructure servers to define important functions like, e.g., spinning up a virtual machine that has a volume attached of a given size and IOPs guarantee.

-ELG

Friday, March 2, 2012

Best practices for virtualization

A series of notes...

  1. vSphere/ESXi: Expensive. Inscrutable licensing scheme -- they have more SKU's than my employer, almost impossible to tell what you need for your application. Closest thing to It Just Works in virtualization. Call them the Apple of virtualization.
  2. Xen : Paravirtualization gives it an advantage in certain applications such as virtualized Linux VM's in the cloud. Paravirtualization generally is faster than hypervirtualization, though most hypervisors now include paravirtualized device drivers to ease that pain. Xen doesn't Just Work, it's more an erector set. Citrix's XenServer is the closest that Xen gets to vSphere's 'Just Works', I need to download it and try it out.
  3. KVM : The future. Integrating the hypervisor and the OS allows much better performance. That's why VMware wrote their own kernel with integrated hypervisor. Current issues: Management is the biggest difficulty. There is difficulty creating clustered filesystems for swift failover or migration of virtual machines (ESXi's VMFS is a cluster file system -- point several ESXi systems at a VMFS filesystem on a iSCSI or Fiber Channel block storage, and they'll all be able to access virtual machines on that system). Most KVM systems set up to do failover / migration in production use NFS instead, but NFS performs quite poorly for the typical virtualization workload for numerous reasons (may discuss later). Closest thing to VMFS performance for VM disks is using LVM volumes or clustered LVM (if using iSCSI block storage), but there are no management tools for KVM allowing you to set up LVM pools and manage them for virtual machine storage with snapshots and so forth. Virtual disk performance on normal Linux filesystems, via the qcow2 format, sucks whether you're talking ext4, xfs, or nfs. In short, the raw underlying bits and pieces are all there, but there is not a management infrastructure to use them. Best practice performance-wise for clustered setup: NFS share for metadata (xml description files of VM's, etc.), iSCSI or Fiber Channel block storage possibly sliced/diced with clustered LVM for the VM disks.
So what am I going to use today if I'm a busy IT guy who wants something that Just Works? VMware vSphere. Duh. If, on the other hand, I'm building a thousand-node cluster, a) it's probably my full time job so I have time to spend futzing with things like clustered LVM, and b) the cost of vSphere for a cluster that large would be astronomical so would decidedly make paying my salary to implement Xen or KVM on said cluster more palatable than paying VMware.

-ELG

Thursday, March 24, 2011

Sound, Flash, and VMware Player

This is some notes on how to get sound and flash working on a Scientific Linux 64-bit guest ( rebranded Red Hat Enterprise Linux 6 ) running inside VMware Player on Windows 7 64-bit:

Flash: Note that Chrome for 64-bit Linux does *not* include the integrated Flash player that all other platforms have. You'll need to download 64-bit Flash beta from Adobe. At the time of this writing, that's at Adobe Labs. Once you extract the tgz file, you'll be left with a file "libflashplayer.so", simply copy it to /usr/lib64/mozilla/plugins , restart Chrome or Firefox, and you're set.

Sound: VMware's sound system crashes with the default Ubuntu / Red Hat sound configuration. This is apparently because VMware doesn't bother emulating all pieces of the hardware they say they're emulating, and when ALSA touches the missing bits, VMware disables the sound device. That's easy enough to fix though. From the Gnome menu, go to System->Preferences->Sound. In the Sound preferences, click on Hardware. Change the profile at the bottom from "Analog Stereo Duplex" to "Analog Stereo Output". Then on the little icon of the speaker on the bottom margin of the VMware Player, right-click it and select "Connect". After about a minute it'll turn green and you'll be able to play sound again.

Why this happens: Though the Ensoniq device emulated is capable of stereo duplex operation (i.e., both recording and outputting at the same time), I know because I actually had one of the physical cards back in the day and used it for that purpose along with a multitrack recorder program, VMware's emulation of the device is not capable of such and thus you must disable that capability. Unless you were intending to record audio within the Linux virtual machine (*not* recommended, VMware's timing is not sufficiently good to get good results there), this has no actual effect -- you can still play Flash videos from Chrome (and Firefox presumably) and hear the sound.

So if you're running Windows 7 on your physical hardware because you need the graphics performance for games and aren't willing to do the dual-card IOMMU hack with Xen that I demonstrated previously (or your hardware simply doesn't have IOMMU support, or you're not willing to use the latest bleeding-edge OpenSUSE as your platform), you can still have the far more secure web browsing environment of Linux available, and with VMware Player you can give Linux any additional drives beyond your boot drive for Linux to manage. Linux works far better as a server than Windows 7 does -- it can provide CIFS, AFP, iSCSI, and do it all on a much better software RAID stack than Microsoft's, as well as using LVM to manage that space, for which Windows 7 has no equivalent. And VMware's Unity system actually works pretty well with SL6/RHEL6 on Windows, although not as well as it works in VMware Fusion on MacOS (the issue being that the little Unity menu icon gets put into the screen menu bar on MacOS and has a native look and feel, while shows up as a clunky little usually-invisible icon above the Start menu on Windows). Which means that you can mix Linux Chrome windows, Windows IE windows (ick! But there's a couple of applications I need for customer support that requires IE plugins that don't exist for any other browser), Linux shell windows, and hoary old Outlook all on the same screen and manage them using the normal Aero Peek icons at the bottom or, if you have a Logitech mouse with the Logitech drivers installed, by assigning one of the side buttons to Window Switcher (their clone of Apple's Expose'). And unlike earlier versions of Windows, Windows 7 is stable -- in fact, I've never managed to make it crash. It's just very annoying... but that's why Apple is still in business, after all. Because Apple makes computers that are not annoying. For a price. A big price, alas...

-ELG