Wednesday, May 1, 2013

ORM

I'm going to speak heresy here: Object-Relational Mappers such as Hibernate are evil. I say that as someone who wrote an object-relational mapper back in 2000 -- the BRU Server server is written as a master class that maps objects to a record in a MySQL database, which is then inherited by child classes that implement the specific record types in the MySQL database. The master class takes care of adding the table to the database if it doesn't exist, as well as populating the Python objects on queries. The child classes take care of business logic and generating queries that don't map well to the ORM, but pass the query results to generators to produce proper object sets out of SQL data sets.

So why did this approach work so well for BRU Server, allowing us to concentrate on the business logic rather than the database logic and allowing its current owners to maintain the software for ten years now, while it fails so harshly for atrocities like Hibernate? One word: complexity. Hibernate attempts to handle all possible cases, and thus ends up producing terrible SQL queries while making things that should be easy difficult, but that's because it's a general purpose mapper. The BRU Server team -- all four of us -- understood that if we were going to create a complete Unix network backup solution within the six months allotted to us, complexity was the enemy. We understood the compromises needed between the object model and the relational model, and the fact that Python was capable of expressing sets of objects as easily as it was capable of expressing individual objects meant that the "object=record" paradigm was fairly easy to handle. We wrote as much ORM as we needed -- and no more. In some cases we had to go to raw relational database programming, but because our ORM was so simple we had no problems with that. There were no exceptions being thrown because of an ORM caching data that no longer existed in the database, and the actual objects for things like users and servers could do their thing without worrying about how to actually read and write the database.

In the meantime, I have not run into any Spring/Hibernate project that actually managed to produce usable code performing acceptably well in any reasonable time frame with a team of a reasonable size. I was at one company that decided to use the PHP-based code that three of us had written in four weeks' time as the prototype for the "real" software, which of course was going to be Java and Spring and Hibernate and Restful and all the right buzzwords, because real software isn't written in PHP of course (though our code solved the problem and didn't require advanced degrees to understand). Six months later and a cast of almost a dozen and no closer to release than at the beginning of the project, the entire project was canned and the project team fired (not me, I was on another project, but I had a friend on that project and she was not a happy camper). I don't know how much money was wasted on that project, but undoubtedly it hurried the demise of that company.

But maybe I'm just not well informed. It wouldn't be the first time, after all. So can anybody point me to a Spring/Hibernate project that is, say, around 80,000 lines of code written in under five months' time by a team of four people, that not only does database access but also does some hard-core hardware-level work slinging massive amounts of data around in a three-box client-server-agent architecture with multiple user interfaces (CLI and GUI/Web minimum)? That can handle writing hundreds of thousands of records per hour then doing complex queries on those records with MySQL without falling over? We did this with BRU Server, thanks to Python and choosing just enough ORM for what we needed (not to mention re-using around 120,000 lines of "C" code for the actual backup engine components), and no more (and no less). The ORM took me a whole five (5) days to write. Five. Days. That's it. Granted, half of that is because of Python and the introspection it allows as part of the very definition of the language. But. Five days. That's how much you save by using Spring/Hibernate over using a language such as Ruby or Python that has proper introspection and doing your own object-relational mapping. Five days. And I submit that the costs of Spring/Hibernate are far, far worse, especially for the 20% of projects that don't map well onto the Spring/Hibernate model, such as virtually everything that I do (since I'm all about system level operations).

-ELG

Saturday, April 27, 2013

Configuring shared access for KVM/libvirt VM's

Libvirt has some nice migration features in the latest RHEL/Centos 6.4 to let you move virtual machines from one server to the other, assuming that you . But if you try it with VM's set to auto-start on server startup, you'll swiftly run into problems the next time you reboot your compute servers -- the same VM will try to start up on multiple compute servers.

The reality is that unlike ESXi, which by default locks the VMDK file so that only a single virtual machine can use it at a time, thus meaning that the same VM set to start up on multiple servers will only start on one (that wins the race), libvirtd by default does *not* include any sort of locking. You have to configure a lock manager to do so. In my case, I configured 'sanlock', which has integration with libvirtd. So on each KVM host configured to access shared VM datastore /shared/datastore :

  • yum install sanlock
  • yum install libvirt-lock-sanlock
Now set up sanlock to start at system boot, and start it up:
  • chkconfig wdmd on
  • chkconfig sanlock on
  • service wdmd start
  • service sanlock start
On the shared datastore, create a locking directory and give it username/ID sanlock:sanlock and permissions for anybody who is in group sanlock to write to it:
  • cd /shared/datastore
  • mkdir sanlock
  • chown sanlock:sanlock sanlock
  • chmod 775 sanlock
Finally, you have to update the libvirtd configuration to use the new locking directory. Edit /etc/libvirt/qemu_sanlock.conf with the following:
  • auto_disk_leases = 1
  • disk_lease_dir = /shared/datastore/sanlock
  • host_id = 1
  • user = "sanlock"
  • group = "sanlock"
Everything else in the file should be commented out or a blank line. Host ID must be different for each compute host, I started counting at 1 and counted up for each compute host. And edit /etc/libvirt/qemu.conf to set the lock manager:
  • lock_manager = "sanlock"
(the line is probably already there, just commented out. Un-comment it). At this point, stop all your VM's on this host (or migrate them to another host), and either reboot (to make sure all comes up properly) or just restart libvirtd with
  • service libvirtd restart
Once you've done this on all servers, try starting up a virtual machine you don't care about on two different servers at the same time. The second attempt should fail with a locking error., At the end of the process it's always wise to shut down all your virtual machines and re-start your entire compute infrastructure that's using the sanlock locking to make sure everything comes up correctly. So-called "bounce tests" are painful, but the only way to be *sure* things won't go AWOL at system boot. If you have more than three compute servers I instead *strongly* suggest that you go to an OpenStack cloud instead, because things become unmanageable swiftly using this mechanism. At present the easiest way to deploy OpenStack appears to be Ubuntu, which has pre-compiled binaries on both their LTS and current distribution releases for OpenStack Grizzly, the latest production release of OpenStack as of this writing. OpenStack takes care of VM startup and shutdown cluster-wide and simply won't start a VM on two different servers at the same time. But that's something for another post. -ELG

Friday, April 26, 2013

On spinning rust and SSD's.

I got my Crucial M4 512GB SSD back for my laptop. It failed about three weeks ago, when I turned on my laptop it simply wasn't there. Complete binary failure mode -- it worked, then it didn't work. So I took it out of the laptop, verified in an external USB enclosure that it didn't "spin up" there either, installed a 750Gb WD Black 7200 rpm rust-spinner that was in my junk box for some project or another, and re-installed Windows and restored my backups. Annoying, but not fatal by any means. I've had to get used to the slow speed of spinning rust again versus the blazingly fast SSD, but at least I'm up and running. So this weekend I get to make another full backup, then swap out the rust for the SSD again.

At work I've had to replace several of the WD 2TB Enterprise drives in the new Linux-based infrastructure when smartd started whining about uncorrectable read errors. When StorStac got notification of that sort of thing it re-wrote the sector from the RAID checksums and that usually resolved it. The Linux 3.8 kernel's md RAID6 layer apparently doesn't do that, requiring me to kick the drive out of the md, slide in a replacement, fire off a rebuild, and then haul the drive over to my desktop and slide it in there and run a blank-out (write zeroes to the entire drive). Sometimes that resolves the issue, sometimes the drive really *is* toast, but at least it was an analog error (just one or two bad places on the drive), not a complete binary error (the entire drive just going blammo).

SSD's are the future. The new COW filesystems such as ZFS and BTRFS really don't do too well on spinning rust, because by their very nature they fragment badly over time. That doesn't matter on SSD's, it does matter with rust-spinners, for obvious reasons. With ZFS you can still get decent performance on rust if you use a second-level SSD cache, that's how I do my backup system here at home (which is an external USB3 hard drive and an internal SSD in my server), BTRFS has no such mechanism at present but to a certain extent compensates by having a (manual) de-fragmentation process that can be run from time to time during "off" hours. Still, both filesystems clearly prefer SSD to rotational storage. It's just the nature of the beast. And those filesystems have sufficient advantages in terms of functionality and reliability (except in virtualized environments as virtual machine filesystems -- but more on that later) that if your application can afford SSD's, that alone may be the tipping point that makes you go to SSD-based storage rather than rotational storage.

Still, it's clear to me that, at this time, SSD is still an immature technology subject to catastrophic failure with no warning. Rotational storage usually gives you warning, you start getting SMART notifications about sectors that cannot be read, about sectors being relocated, and so forth. So when designing an architecture for reliability, it is unwise to have an SSD be a single point of failure, as is often done for ESXi servers that lack hardware RAID cards supported by ESXi. It might *seem* that SSD is more reliable than rotational storage. And on paper, that may even be true. But the reality is that because the nature of the failures is different, in *reality* rotational storage gives you a much better chance of detecting and recovering from a failing drive than SSD's do. That may, or may not be important for your application -- in RAID it clearly isn't a big deal, since you'll be replacing the drive and rebuilding a new drive anyhow -- but for things like an ESXi boot drive it's something you should consider.

-ELG

Thursday, April 25, 2013

Irresponsible

I must admit that I have a low opinion of journalists, tech journalists in particular. I've been interviewed several times over the years and only once has the result been accurate. In all the other cases, what I said was spun to fit the journalist's preconceived notion of what the story should be, and to bleep with the truth.

What I cannot understand is why, if a tech journalist cannot interview people in the know because they had to sign a NDA in order to obtain certain assets for a specified price, said journalist would go ahead and publish a story based entirely upon speculation and a single source that may or may not know the details of whatever legal agreements were signed. It's not professional, it's not ethical, and it's not right. But it's the way tech "journalism" is done here in the Silicon Valley. I guess making a living by being unprofessional and unethical doesn't bother some people. So it goes.

-ELG

Monday, April 1, 2013

Taps

> Realm shutdown

Click on the picture for high resolution. Today we decommissioned the only 10gbit Intransa iSCSI storage realm in existence. There were only two ever built, and only one was ever sold. This one was built by Douglas Fong for use by Intransa IT and has 24 4-disk IP-connected disk shelves in six cabinets, for a total of 96 250gb IDE hard drives talking to two SMC/Dell switches via 48 1gbit connections. The SMC/Dell switches are then connected to the two clustered controller units via 10Gbit Ethernet, which then exports iSCSI to the two SMC/Dell switches above it via 10Gbit Ethernet. This whole concept was designed for scale-out storage, when you needed more storage you just added more of the blue boxes (or, later, the grey boxes to the left) and incidentally this also made the result faster.

Two things became clear as I was prepping the changeover from this 2/3rds rack of equipment to 4u worth of generic Linux storage. The first was that the Intransa box was infinitely easier to manage than my 24 disks worth of Linux-based storage, despite having four times as many spindles. This is because the Intransa software did policy-based storage allocation. You told it you wanted a new volume with 5-disk RAID5 or 4-disk RAID10 or whatever, and it went out and either found existing RAID groups and put your new volume there, or found enough disks to create a new RAID group and put your volume there. You didn't have to worry about how to lay out RAID groups or volumes on top of RAID groups and exporting to iSCSI, it all Just Happened.

The second thing that became apparent was that this beast was fast -- seriously fast. The orange cable at the top right is the 10Gbit Ethernet cable going to my new infrastructure that I used to migrate the volumes off of this pile of blue boxes. Surprisingly, the limit was my new Linux storage boxes, not the Intransa storage -- I was pulling data off at 200 megabytes/second, the max I could pull in via my two 1Gbit Ethernet connectors. It seems that if you have enough spindles, even 250gb IDE drives can generate a significant number of iops. It would have been interesting to see exactly how fast it was, but unfortunately I'm still working on getting the Intel 10Gbit cards working in the Linux storage servers (I am now going to use copper SFP+ cables, since it is clear that the Intel cards aren't going to work with the optical SFP+ modules that I have), so was restricted to two 1Gbit connections.

Sadly, the pile of dead drives on top of the pile of blue cabinets are one indication of why it's being retired. The 250Gb Maxtor drives in this thing were manufactured in 2004 and were starting to fail. My supply of spare parts was limited. In addition, this beast is horrifically complex -- even the person who built it had trouble getting it up and running the last time it was moved, and our new little startup certainly wouldn't be able to get it up and going by ourselves, so we settled for getting the intellectual property off of it onto our own generic Linux server equipment. Finally, it and the backup replica realm beside it took up a huge amount of space and power, the two Linux servers do in 8U what required an entire rack full of equipment to do with this seven-to-nine-year-old Intransa equipment. So it was time, albeit with a bit of sadness too. Intransa had some great ideas and solid gear. They could not, alas, make money with it.

I played taps on my Irish whistle as the realm shut down.

-- ELG

Saturday, March 30, 2013

Why no cloud?

So I promised I'd explain why I was setting up normal Linux-based storage and normal KVM/ESXi compute servers for our new small business's network rather than an OpenStack private cloud, so I'll do so.
  1. One risky technology per deployment. It's about risk management -- the ability to manage risks in a reasonable manner. If you have multiple risky technologies, the interactions between risks rise exponentially and cause risks to be unmanageable. Normal Linux-based storage is a mature technology with over a decade of active deployment in production environments with the exception of the LIO iSCSI target. I concluded that the LIO iSCSI target was a necessity in our environment because the TGTD target provided with stable Linux distributions has multiple serious deficiencies (see earlier postings) that render it nothing more than a toy, and our legacy infrastructure was based around iSCSI talking to that pile of ancient Intransa blue boxes. So I've reached my limit on new technologies. Meanwhile OpenStack is multiple immature technologies under active development. Add that to LIO and the existing VMware ESX/ESXi servers' need for block storage and I'd require multiple storage networks to mitigate the risks. Which brings up...
  2. Power and space budget. My power and space budget allows for one storage network with a total of 8U of space and 1000 watts of power consumption. I don't have power and space for two storage networks, one for OpenStack and one for ESX/ESXi.
  3. Performance. The majority of what my network provides to end users is file storage via NFS and CIFS. In an OpenStack deployment file servers run as virtual machines talking to back end storage via iSCSI. This scales very well in large installations, but I don't have the power and space budget for a large installation so that's irrelevant. Running the NAS storage stack directly on the storage boxes results in much better responsiveness and real-world performance than running the NAS storage stack on a virtual machine talking to the storage boxes via iSCSI, even if the theoretical performance should be the same. The biggest issue is that this limits the size and performance of any particular data store to one storage box, but the reality is that this isn't a particularly big limitation for our environment, since we have far more iops and storage on a single storage box than any single data store in our environment will use for quite some time. (My rule of thumb is that no ext4 data store will ever be over 1Tb and no xfs data store will ever be over 2Tb, due to various limitations of those filesystems in a NAS environment... any other filesystem runs into issue #1, one risky technology per deployment, and I already hit that with LIO)
  4. Deep understanding of the underlying technologies. The Linux storage stack has been mature for many years now, with the exception of LIO. I know its source code at the kernel level fairly well. If there is an issue, I know how to resolve it, even to the point of poking bytes into headers on disk drive blocks to make things work. Recovery from failure thus is low risk (see #1). OpenStack is a new and immature technology. If there is an issue, we could be down for days while I chase around in the source code trying to figure out what went wrong and how to fix it.
Note that this is *not* a slam on OpenStack as a technology, or saying that you should not use one of the OpenStack cloud providers such as RackSpace or HP. They have massive redundancies in their OpenStack deployment and people on staff who have the expertise to manage it, and do not have to deal with legacy infrastructure requirements such as our ESXi servers with their associated Windows payloads. Plus they are based around a totally different workload. Our in-house workload is primarily a NAS workload for workstations, and our compute workload is primarily a small number of virtualized test servers or build servers for our software in a variety of environments as well as a handful of infrastructure servers to e.g. handle DNS. What OpenStack mostly gives you is the ability to manage massive numbers of storage servers and massive numbers of compute servers and massive numbers of virtual machines on those compute servers, none of which is our local workload.

The workload that RackSpace etc. are supporting is mostly about Big Data and Big Compute in the cloud or about web server farms in the cloud. All of that has far larger space and power requirements than our little two-rack data center can ever provide, and the reality is that we simply use their infrastructure when we have those requirements rather than attempt to replicate their infrastructure in-house. It simply isn't reasonable for a small business to try to replicate RackSpace or Amazon AWS in-house. We don't have the space and power for the massive amount of infrastructure they use to achieve redundancy and reliability, we don't have the requirement for our local workload, and we don't have the in-house expertise. In the end, it's a case of using the appropriate technology for the appropriate task -- and for what I'm attempting to achieve for the local infrastructure of a small business, using NAS-based Linux storage was more appropriate than attempting to shoe-horn our workload into an infrastructure that would give us no more capability for our needs but would cost us in terms of power, space, performance, and maintainability.

-ELG

Sunday, March 24, 2013

Making auto-proxy configuration work

Okay, so I finally got auto-proxy browser configuration to work with ClearOS. It required a couple of different things.

First, you'll need to install the web server plugin in the ClearOS marketplace. Yes, I know you don't want a web server running on your router. But there's not much choice, wpad.dat is served via http on port 80. Just don't add a firewall rule allowing connecting to it from outside your network (note that in ClearOS you have to explicitly allow external access to services) and you'll be fine.

Next, in your DNS configuration on your master DNS server (whether that's on the ClearOS server or elsewhere), set up wpad.yourdomain.com pointing at your ClearOS server. If the ClearOS server is providing DNS that's pretty easy, just use the web interface.

Okay, now we're at the end of what the web interface can do for you. We'll need to do some things via ssh now. Make sure ssh access is turned on in your firewall rules (in the GUI), and ssh in as root/yourpassword.

In /var/www/html create a file 'wpad.dat' with these contents:

function FindProxyForURL(url,host)
{
return “PROXY 192.168.0.1:3128″;
}
Replace the 192.168.0.1 with the actual address of your ClearOS server, and if you're running the content filter, replace the 3128 with 8080.

Now access "http://wpad.your.domain" with your web browser. You should see that file appear. But that's not going to get your auto-proxy working on Chrome, at least, because Chrome wants a MIME type of application/x-ns-proxy-autoconfig rather than  text/plain . So let's go set up the MIME type. In the directory /etc/httpd/conf.d create a file 'wpad.conf' with these contents:
<Files "wpad.dat">
   ForceType application/x-ns-proxy-autoconfig
</Files>

And one 'apachectl restart' later, there you are. Your browsers on the network will now auto-configure their proxy settings to go through the ClearOS proxy.

-ELG


A tale of two routers

One of the things I want to set up at the office to preserve precious Internet bandwidth is a general proxy/firewall box. This would sit between our current infrastructure and the Internet and do firewall-y type things plus provide VPN connectivity. We have an ancient Cisco that is providing VPN connectivity but it requires a proprietary client that is hard to come by unless you have a Cisco support contract, which isn't available for this antique. Given that I have plenty of fairly recent vintage surplus server equipment hanging around plus a few quad-port Ethernet cards it seemed to me to be a no-brainer to slap the cards into a spare server and toss Linux on it and run with that as the router.

The problem, of course, is time and complexity. I know how to use iptables. I know how to set up things like squid and openvpn and dhcp and so forth. But I really don't want to. I really, REALLY don't want to. I have better things to do with my life. So I went out to find general-purpose router distributions that would do all that hard work for me. Given the equipment available to me (mostly Nehalems with 2.4ghz quad-core processors and 6gb of RAM, modest by the standards of modern compute engines, but way more system than our border router needs), I didn't need to settle for one of the tiny little distributions that are intended to fit on flash memory chips on tiny embedded systems. I could put a full-fledged Linux on there. After some research, I settled on two distributions that are based on a stable core distribution: ClearOS, which is based on Centos 6, and Zentyal, which is based on Ubuntu 12.04 LTS. I know how to manage both Centos and Ubuntu since I've used them in production for years, so if all else failed I could take the starting point that the GUI configurator gave me and fix it to work.

Before doing this at the office I had to of course do a proof of concept. And the perfect proof of concept was my home network, which has five PC's and five devices on it as well as the wireless access point and the managed switch both of which have their own addresses. So I grabbed a decommissioned box that had some pretty hefty stats (Core I7-950 with 12gb of RAM) but no drives (since I'd moved the drives to the big file server box), found a pair of 2.5" drives to fit in its front-loading slot to swap out ClearOS and Zentyal, and set to installing.

I'd been playing with Zentyal for a while at work, seeing if I could make sense of whether it would replace our ancient Windows 2003 domain controllers, and so I started with ClearOS. It slid onto my server just fine, locating my expensive 4 port Intel NIC card and the on-board Intel NIC. I configured it to provide separate networks for my wireless and wired networks (so I could monitor what was happening on the wireless network specifically), and route all traffic out the cable modem connection. All was well. I played with the proxy server settings. That worked pretty well, with the bizarre exception that I can't figure out how to make the automatic proxy settings work, I enabled the Apache server and created the correct wpad file and I see Chrome using it in the Apache server logs but Chrome isn't doing applying the settings for some reason. Okay, something to check out on Zentyal when I do it. I then configured OpenVPN and installed OpenVPN clients on my Android and iPad (I already know OpenVPN works on Windows, Linux, and Mac, duh). My first couple of attempts to connect from my Android didn't work and I was baffled. Finally I clicked on the firewall module and noticed no rule had been created to allow OpenVPN connections when I configured OpenVPN. Point, click, allow, and all works well. iPad worked fine too once I got the certificates on there, which required using iTunes (bleh!) but at that point the iPad OpenVPN software was up and going. And finally I got the mail relay up and going, which forwards all outbound smtp traffic to my mail server in the proto-cloud which then forwards it onwards. Again there were some interesting limitations -- I see no place to set the name and password to authenticate with the remote smartmail server, for example -- but that's easy enough to fix by hand.

Okay, so there's a couple of small glitches but things pretty much were going smoothly with ClearOS. The main issues I ran into with ClearOS were between my ears, i.e., I didn't RTFM and forgot to set up things that needed setting up or set them up incorrectly. So next I shut down the ClearOS system, slid out its drive, slid the new drive in, and installed Zentyal. That, on the other hand... that was pretty much a disaster. It crashed halfway through the setup wizard. It crashed after it updated. It crashed trying to set up the mail relay. The OpenVPN functionality worked but the user interface left a lot to be desired. I noticed that it'd set my domain to a comcast.net domain and set it back to my own domain, and that pretty much was all she wrote -- it wiped out my VPN, it wiped out the user LDAP directory, and put the system into a completely unusable state.

Which is a shame, because I really wanted to like Zentyal. It is based on a newer and arguably better Linux distribution than ClearOS, and it has some really nice features. But I just can't deal with software that crashes when we're talking about a mission-critical server. It just isn't going to work. There's some places elsewhere in my infrastructure that Zentyal can live, but the border router? Nope. Not happening. So it goes.

Friday, March 22, 2013

Storage migrations

I spent much of today setting up a pair of Linux servers to migrate data off of a 2005-vintage Intransa storage array. The Intransa storage array still works fine, but clearly the end is in sight -- I have a limited supply of spares and three drives died within the past two months alone. So I set up a 10Gb fiber connection between the iSCSI switch for the Intransa array and the iSCSI switch for my new(old) commodity Linux servers (a pair of previous-generation Supermicro 12-disk servers with a 12-disk JBOD apiece), exported iSCSI volumes via lio, told Windows to mirror its various volumes to the new volumes, and let'er rip. Note that I did traditional RAID here because I don't have the cycles or the CPU's to implement an internal cloud for a small company, and that these storage servers are also providing regular file shares via NFS and Samba (CIFS). I deliberately kept things as simple as possible in order to make it more easily manageable. In the process some clear issues with the current Linux storage stack became apparently. Thumbnail summary: The Linux storage stack is to professional storage stacks such as the old Intransa stack (or modern-day HDC or HP stacks) as Soviet toilet paper was to Charmin. Soviet toilet paper could serve as sandpaper -- it was rough, annoying, and it did the job but you certainly didn't like it. Same deal with the Linux storage stack, with the additional caveat that there are some things that antique Intransa gear would do that are pretty much impossible with the current Linux storage stack.

Rather than go off onto a long rant, here's some things to think about:

  1. The Intransa unit integrated the process of creating a volume, assigning it to a volume group (RAID array) that fit your desired policy creating one if necessary (this Intransa installation has six storage cabinets each with 16 drives, so individually adding 96 drives to RAID arrays then managing which ones your volumes got placed upon would have been nightmarish) and then exporting the resulting volume as an iSCSI volume. All of this is a multi-step manual process on Linux.
  2. You can create replications to replicate a volume to another Intransa realm (either local or geographically remote) at any point in time after a volume has been created, without taking the volume offline. On Linux, you have to take the volume offline, unexport it from iSCSI and/or NFS, layer drbd on top of it, then tell everybody (iscsi, NFS, fstab) to access the volume at its drbd device name now rather than at the old LVM volume name. Hint: Taking volumes offline to do something this simple is *not* acceptable in a production environment!
  3. Scaling out storage by adding storage cabinets is non-trivial. I had to bring up my storage cabinets one at a time so I could RAID6 or RAID10 the cabinets (depending upon whether I was doing a scale or performance share) without spanning cabinets with my RAID groups, because spanning cabinets with SAS is a Bad Idea for a number of reasons. Policy-based storage management -- it's a great idea.
  4. Better hope that your Linux server doesn't kernel panic, because there's no battery-backed RAM cache to keep unwritten data logged. It still mystifies me that nobody has implemented this idea for the Linux software RAID layer. Well, except for Agami back in 2004, and Intransa back in 2004, neither of which are around anymore and where the hardware that implemented this idea is no longer available even if they were. And Agami did it at the disk controller level, actually, while Intransa did it by the simple expedient of entirely bypassing the Linux block layer. These first-generation disk cabinets have each 4-disk shelf IP-connected to a gigabit switch pair that then uplinks via a 10Gb link to the controllers, iSCSI requests flow in via 10Gb from the iSCSI switch pair to the controllers, are processed internally to turn them into volume and RAID requests which then get turned into disk shelf read/write requests that flow out the network on the other end of the stack, and nowhere in any of this does the Linux block layer come into play. That's why it was so easy to add the battery-backed cache -- no Linux block layer to get in the way.
The last of which brings to the forefront the role of the Linux block layer. The Linux block layer is this primitive thing that was created back in the IDE disk days and hasn't advanced much since. There have been attempts via write barriers and other mechanisms to make it work in a more reliable way that doesn't lose filesystems so often, and those efforts have worked to a certain extent, but the reality is that you have lvm and dm and drbd and the various raid layers and iscsi and then filesystems all layered on top like a cake and making sure that data that comes in at the top of the cake makes it to the disk drives at the bottom without a confectionery disaster inbetween... well, it's not simple. Just ask the BTRFS team. In private. Out of earshot of young children. Because some things are just too horrible for young ears to hear.

And I said I wasn't going to go off on a long rant. Oh well. So anyhow, next thing I'll do is talk about why I went with traditional RAID cabinets rather than creating a storage cloud, the latter of which would have taken care of some of my storage management issues by making it, well, cloudy. But that is a discussion for another day.

-ELG

Thursday, March 14, 2013

The end of Google Reader

So Google Reader is going away, and people has a sad, me included, because it is by far the best RSS reader out there. On the other hand, as someone who has worked in the industry, I can pretty much tell you *why* it is going away (this is speculation on my part, but speculation that matches the actual known facts): It is going away because within months, Google is planning on making some internal infrastructure changes which will completely break Google Reader beyond any hope of repair.

The core problem is that Google Reader is old code. It originated back in Google's early days (well, 2005 is sorta early), when they didn't have any well-defined internal API's. So Google Reader depends on deep dark secrets of Google's actual infrastructure implementation, rather than using a well-defined internal API that will keep working when the infrastructure changes. The result has been that Google Reader has continually experienced outages for the past five years of its life as the infrastructure changes. It's costing Google money to keep hacking at it to keep it running, and they're not making money on it. And fixing that would require a re-write to a stable API in common with other products that wouldn't break anytime that the infrastructure changes -- something they're not going to do on something that's not making them any money.

I guess the takeaway from this is simple: Turning everything you do into an API is hard, but the alternative is that bit rot will eventually kill your code. This is, for example, why I created a complete virtualization infrastructure for the Intransa virtualized cluster product that hid all details of virtualization behind an API. The reality is that we only needed to touch ESXi in a handful of places -- but by hiding the details of that behind an API, I guaranteed that when we moved to a different virtualization system in the future (such as Xen or KVM) then all that would need changing would be the virtualization API, not any of the internal workings of Storstac. This is also one of the things that Linus has been fairly successful at doing over the past ten years or so with the Linux kernel. He may have broken the block subsystem by applying the big-kernel-lock removal patches, but at least he didn't break the API. Some additional changes have been added to the block API so that filesystems like BTRFS can work better, but the core API still remains the same as it's been for quite a few years now.

But clearly this wasn't done for Google products back in the early part of the '00s, and now it's just too difficult to maintain the less-used code. Google has to upgrade their infrastructure -- and how their infrastructure works -- in order to continue to scale. What that means is that products that weren't written to a consistent internal shared API are going to continue getting sloughed off, unless there is enough interest (and possible money to be made) to justify a re-write against a stable API. That's just how reality works. Oh well.

-ELG