Confessions of a Linux Penguin: March 2013

Saturday, March 30, 2013

Why no cloud?

So I promised I'd explain why I was setting up normal Linux-based storage and normal KVM/ESXi compute servers for our new small business's network rather than an OpenStack private cloud, so I'll do so.

One risky technology per deployment. It's about risk management -- the ability to manage risks in a reasonable manner. If you have multiple risky technologies, the interactions between risks rise exponentially and cause risks to be unmanageable. Normal Linux-based storage is a mature technology with over a decade of active deployment in production environments with the exception of the LIO iSCSI target. I concluded that the LIO iSCSI target was a necessity in our environment because the TGTD target provided with stable Linux distributions has multiple serious deficiencies (see earlier postings) that render it nothing more than a toy, and our legacy infrastructure was based around iSCSI talking to that pile of ancient Intransa blue boxes. So I've reached my limit on new technologies. Meanwhile OpenStack is multiple immature technologies under active development. Add that to LIO and the existing VMware ESX/ESXi servers' need for block storage and I'd require multiple storage networks to mitigate the risks. Which brings up...
Power and space budget. My power and space budget allows for one storage network with a total of 8U of space and 1000 watts of power consumption. I don't have power and space for two storage networks, one for OpenStack and one for ESX/ESXi.
Performance. The majority of what my network provides to end users is file storage via NFS and CIFS. In an OpenStack deployment file servers run as virtual machines talking to back end storage via iSCSI. This scales very well in large installations, but I don't have the power and space budget for a large installation so that's irrelevant. Running the NAS storage stack directly on the storage boxes results in much better responsiveness and real-world performance than running the NAS storage stack on a virtual machine talking to the storage boxes via iSCSI, even if the theoretical performance should be the same. The biggest issue is that this limits the size and performance of any particular data store to one storage box, but the reality is that this isn't a particularly big limitation for our environment, since we have far more iops and storage on a single storage box than any single data store in our environment will use for quite some time. (My rule of thumb is that no ext4 data store will ever be over 1Tb and no xfs data store will ever be over 2Tb, due to various limitations of those filesystems in a NAS environment... any other filesystem runs into issue #1, one risky technology per deployment, and I already hit that with LIO)
Deep understanding of the underlying technologies. The Linux storage stack has been mature for many years now, with the exception of LIO. I know its source code at the kernel level fairly well. If there is an issue, I know how to resolve it, even to the point of poking bytes into headers on disk drive blocks to make things work. Recovery from failure thus is low risk (see #1). OpenStack is a new and immature technology. If there is an issue, we could be down for days while I chase around in the source code trying to figure out what went wrong and how to fix it.

Note that this is *not* a slam on OpenStack as a technology, or saying that you should not use one of the OpenStack cloud providers such as RackSpace or HP. They have massive redundancies in their OpenStack deployment and people on staff who have the expertise to manage it, and do not have to deal with legacy infrastructure requirements such as our ESXi servers with their associated Windows payloads. Plus they are based around a totally different workload. Our in-house workload is primarily a NAS workload for workstations, and our compute workload is primarily a small number of virtualized test servers or build servers for our software in a variety of environments as well as a handful of infrastructure servers to e.g. handle DNS. What OpenStack mostly gives you is the ability to manage massive numbers of storage servers and massive numbers of compute servers and massive numbers of virtual machines on those compute servers, none of which is our local workload.

The workload that RackSpace etc. are supporting is mostly about Big Data and Big Compute in the cloud or about web server farms in the cloud. All of that has far larger space and power requirements than our little two-rack data center can ever provide, and the reality is that we simply use their infrastructure when we have those requirements rather than attempt to replicate their infrastructure in-house. It simply isn't reasonable for a small business to try to replicate RackSpace or Amazon AWS in-house. We don't have the space and power for the massive amount of infrastructure they use to achieve redundancy and reliability, we don't have the requirement for our local workload, and we don't have the in-house expertise. In the end, it's a case of using the appropriate technology for the appropriate task -- and for what I'm attempting to achieve for the local infrastructure of a small business, using NAS-based Linux storage was more appropriate than attempting to shoe-horn our workload into an infrastructure that would give us no more capability for our needs but would cost us in terms of power, space, performance, and maintainability.

-ELG

Sunday, March 24, 2013

Making auto-proxy configuration work

Okay, so I finally got auto-proxy browser configuration to work with ClearOS. It required a couple of different things.

First, you'll need to install the web server plugin in the ClearOS marketplace. Yes, I know you don't want a web server running on your router. But there's not much choice, wpad.dat is served via http on port 80. Just don't add a firewall rule allowing connecting to it from outside your network (note that in ClearOS you have to explicitly allow external access to services) and you'll be fine.

Next, in your DNS configuration on your master DNS server (whether that's on the ClearOS server or elsewhere), set up wpad.yourdomain.com pointing at your ClearOS server. If the ClearOS server is providing DNS that's pretty easy, just use the web interface.

Okay, now we're at the end of what the web interface can do for you. We'll need to do some things via ssh now. Make sure ssh access is turned on in your firewall rules (in the GUI), and ssh in as root/yourpassword.

In /var/www/html create a file 'wpad.dat' with these contents:

function FindProxyForURL(url,host)
{
return “PROXY 192.168.0.1:3128″;
}

Replace the 192.168.0.1 with the actual address of your ClearOS server, and if you're running the content filter, replace the 3128 with 8080.

Now access "http://wpad.your.domain" with your web browser. You should see that file appear. But that's not going to get your auto-proxy working on Chrome, at least, because Chrome wants a MIME type of application/x-ns-proxy-autoconfig rather than text/plain . So let's go set up the MIME type. In the directory /etc/httpd/conf.d create a file 'wpad.conf' with these contents:

<Files "wpad.dat">
ForceType application/x-ns-proxy-autoconfig

</Files>

And one 'apachectl restart' later, there you are. Your browsers on the network will now auto-configure their proxy settings to go through the ClearOS proxy.

-ELG

A tale of two routers

One of the things I want to set up at the office to preserve precious Internet bandwidth is a general proxy/firewall box. This would sit between our current infrastructure and the Internet and do firewall-y type things plus provide VPN connectivity. We have an ancient Cisco that is providing VPN connectivity but it requires a proprietary client that is hard to come by unless you have a Cisco support contract, which isn't available for this antique. Given that I have plenty of fairly recent vintage surplus server equipment hanging around plus a few quad-port Ethernet cards it seemed to me to be a no-brainer to slap the cards into a spare server and toss Linux on it and run with that as the router.

The problem, of course, is time and complexity. I know how to use iptables. I know how to set up things like squid and openvpn and dhcp and so forth. But I really don't want to. I really, REALLY don't want to. I have better things to do with my life. So I went out to find general-purpose router distributions that would do all that hard work for me. Given the equipment available to me (mostly Nehalems with 2.4ghz quad-core processors and 6gb of RAM, modest by the standards of modern compute engines, but way more system than our border router needs), I didn't need to settle for one of the tiny little distributions that are intended to fit on flash memory chips on tiny embedded systems. I could put a full-fledged Linux on there. After some research, I settled on two distributions that are based on a stable core distribution: ClearOS, which is based on Centos 6, and Zentyal, which is based on Ubuntu 12.04 LTS. I know how to manage both Centos and Ubuntu since I've used them in production for years, so if all else failed I could take the starting point that the GUI configurator gave me and fix it to work.

Before doing this at the office I had to of course do a proof of concept. And the perfect proof of concept was my home network, which has five PC's and five devices on it as well as the wireless access point and the managed switch both of which have their own addresses. So I grabbed a decommissioned box that had some pretty hefty stats (Core I7-950 with 12gb of RAM) but no drives (since I'd moved the drives to the big file server box), found a pair of 2.5" drives to fit in its front-loading slot to swap out ClearOS and Zentyal, and set to installing.

I'd been playing with Zentyal for a while at work, seeing if I could make sense of whether it would replace our ancient Windows 2003 domain controllers, and so I started with ClearOS. It slid onto my server just fine, locating my expensive 4 port Intel NIC card and the on-board Intel NIC. I configured it to provide separate networks for my wireless and wired networks (so I could monitor what was happening on the wireless network specifically), and route all traffic out the cable modem connection. All was well. I played with the proxy server settings. That worked pretty well, with the bizarre exception that I can't figure out how to make the automatic proxy settings work, I enabled the Apache server and created the correct wpad file and I see Chrome using it in the Apache server logs but Chrome isn't doing applying the settings for some reason. Okay, something to check out on Zentyal when I do it. I then configured OpenVPN and installed OpenVPN clients on my Android and iPad (I already know OpenVPN works on Windows, Linux, and Mac, duh). My first couple of attempts to connect from my Android didn't work and I was baffled. Finally I clicked on the firewall module and noticed no rule had been created to allow OpenVPN connections when I configured OpenVPN. Point, click, allow, and all works well. iPad worked fine too once I got the certificates on there, which required using iTunes (bleh!) but at that point the iPad OpenVPN software was up and going. And finally I got the mail relay up and going, which forwards all outbound smtp traffic to my mail server in the proto-cloud which then forwards it onwards. Again there were some interesting limitations -- I see no place to set the name and password to authenticate with the remote smartmail server, for example -- but that's easy enough to fix by hand.

Okay, so there's a couple of small glitches but things pretty much were going smoothly with ClearOS. The main issues I ran into with ClearOS were between my ears, i.e., I didn't RTFM and forgot to set up things that needed setting up or set them up incorrectly. So next I shut down the ClearOS system, slid out its drive, slid the new drive in, and installed Zentyal. That, on the other hand... that was pretty much a disaster. It crashed halfway through the setup wizard. It crashed after it updated. It crashed trying to set up the mail relay. The OpenVPN functionality worked but the user interface left a lot to be desired. I noticed that it'd set my domain to a comcast.net domain and set it back to my own domain, and that pretty much was all she wrote -- it wiped out my VPN, it wiped out the user LDAP directory, and put the system into a completely unusable state.

Which is a shame, because I really wanted to like Zentyal. It is based on a newer and arguably better Linux distribution than ClearOS, and it has some really nice features. But I just can't deal with software that crashes when we're talking about a mission-critical server. It just isn't going to work. There's some places elsewhere in my infrastructure that Zentyal can live, but the border router? Nope. Not happening. So it goes.

Friday, March 22, 2013

Storage migrations

I spent much of today setting up a pair of Linux servers to migrate data off of a 2005-vintage Intransa storage array. The Intransa storage array still works fine, but clearly the end is in sight -- I have a limited supply of spares and three drives died within the past two months alone. So I set up a 10Gb fiber connection between the iSCSI switch for the Intransa array and the iSCSI switch for my new(old) commodity Linux servers (a pair of previous-generation Supermicro 12-disk servers with a 12-disk JBOD apiece), exported iSCSI volumes via lio, told Windows to mirror its various volumes to the new volumes, and let'er rip. Note that I did traditional RAID here because I don't have the cycles or the CPU's to implement an internal cloud for a small company, and that these storage servers are also providing regular file shares via NFS and Samba (CIFS). I deliberately kept things as simple as possible in order to make it more easily manageable. In the process some clear issues with the current Linux storage stack became apparently. Thumbnail summary: The Linux storage stack is to professional storage stacks such as the old Intransa stack (or modern-day HDC or HP stacks) as Soviet toilet paper was to Charmin. Soviet toilet paper could serve as sandpaper -- it was rough, annoying, and it did the job but you certainly didn't like it. Same deal with the Linux storage stack, with the additional caveat that there are some things that antique Intransa gear would do that are pretty much impossible with the current Linux storage stack.

Rather than go off onto a long rant, here's some things to think about:

The Intransa unit integrated the process of creating a volume, assigning it to a volume group (RAID array) that fit your desired policy creating one if necessary (this Intransa installation has six storage cabinets each with 16 drives, so individually adding 96 drives to RAID arrays then managing which ones your volumes got placed upon would have been nightmarish) and then exporting the resulting volume as an iSCSI volume. All of this is a multi-step manual process on Linux.
You can create replications to replicate a volume to another Intransa realm (either local or geographically remote) at any point in time after a volume has been created, without taking the volume offline. On Linux, you have to take the volume offline, unexport it from iSCSI and/or NFS, layer drbd on top of it, then tell everybody (iscsi, NFS, fstab) to access the volume at its drbd device name now rather than at the old LVM volume name. Hint: Taking volumes offline to do something this simple is *not* acceptable in a production environment!
Scaling out storage by adding storage cabinets is non-trivial. I had to bring up my storage cabinets one at a time so I could RAID6 or RAID10 the cabinets (depending upon whether I was doing a scale or performance share) without spanning cabinets with my RAID groups, because spanning cabinets with SAS is a Bad Idea for a number of reasons. Policy-based storage management -- it's a great idea.
Better hope that your Linux server doesn't kernel panic, because there's no battery-backed RAM cache to keep unwritten data logged. It still mystifies me that nobody has implemented this idea for the Linux software RAID layer. Well, except for Agami back in 2004, and Intransa back in 2004, neither of which are around anymore and where the hardware that implemented this idea is no longer available even if they were. And Agami did it at the disk controller level, actually, while Intransa did it by the simple expedient of entirely bypassing the Linux block layer. These first-generation disk cabinets have each 4-disk shelf IP-connected to a gigabit switch pair that then uplinks via a 10Gb link to the controllers, iSCSI requests flow in via 10Gb from the iSCSI switch pair to the controllers, are processed internally to turn them into volume and RAID requests which then get turned into disk shelf read/write requests that flow out the network on the other end of the stack, and nowhere in any of this does the Linux block layer come into play. That's why it was so easy to add the battery-backed cache -- no Linux block layer to get in the way.

The last of which brings to the forefront the role of the Linux block layer. The Linux block layer is this primitive thing that was created back in the IDE disk days and hasn't advanced much since. There have been attempts via write barriers and other mechanisms to make it work in a more reliable way that doesn't lose filesystems so often, and those efforts have worked to a certain extent, but the reality is that you have lvm and dm and drbd and the various raid layers and iscsi and then filesystems all layered on top like a cake and making sure that data that comes in at the top of the cake makes it to the disk drives at the bottom without a confectionery disaster inbetween... well, it's not simple. Just ask the BTRFS team. In private. Out of earshot of young children. Because some things are just too horrible for young ears to hear.

And I said I wasn't going to go off on a long rant. Oh well. So anyhow, next thing I'll do is talk about why I went with traditional RAID cabinets rather than creating a storage cloud, the latter of which would have taken care of some of my storage management issues by making it, well, cloudy. But that is a discussion for another day.

-ELG

Thursday, March 14, 2013

The end of Google Reader

So Google Reader is going away, and people has a sad, me included, because it is by far the best RSS reader out there. On the other hand, as someone who has worked in the industry, I can pretty much tell you *why* it is going away (this is speculation on my part, but speculation that matches the actual known facts): It is going away because within months, Google is planning on making some internal infrastructure changes which will completely break Google Reader beyond any hope of repair.

The core problem is that Google Reader is old code. It originated back in Google's early days (well, 2005 is sorta early), when they didn't have any well-defined internal API's. So Google Reader depends on deep dark secrets of Google's actual infrastructure implementation, rather than using a well-defined internal API that will keep working when the infrastructure changes. The result has been that Google Reader has continually experienced outages for the past five years of its life as the infrastructure changes. It's costing Google money to keep hacking at it to keep it running, and they're not making money on it. And fixing that would require a re-write to a stable API in common with other products that wouldn't break anytime that the infrastructure changes -- something they're not going to do on something that's not making them any money.

I guess the takeaway from this is simple: Turning everything you do into an API is hard, but the alternative is that bit rot will eventually kill your code. This is, for example, why I created a complete virtualization infrastructure for the Intransa virtualized cluster product that hid all details of virtualization behind an API. The reality is that we only needed to touch ESXi in a handful of places -- but by hiding the details of that behind an API, I guaranteed that when we moved to a different virtualization system in the future (such as Xen or KVM) then all that would need changing would be the virtualization API, not any of the internal workings of Storstac. This is also one of the things that Linus has been fairly successful at doing over the past ten years or so with the Linux kernel. He may have broken the block subsystem by applying the big-kernel-lock removal patches, but at least he didn't break the API. Some additional changes have been added to the block API so that filesystems like BTRFS can work better, but the core API still remains the same as it's been for quite a few years now.

But clearly this wasn't done for Google products back in the early part of the '00s, and now it's just too difficult to maintain the less-used code. Google has to upgrade their infrastructure -- and how their infrastructure works -- in order to continue to scale. What that means is that products that weren't written to a consistent internal shared API are going to continue getting sloughed off, unless there is enough interest (and possible money to be made) to justify a re-write against a stable API. That's just how reality works. Oh well.

-ELG

Monday, March 4, 2013

Patchwork and maintainability

Way back in the mists of time, I was there at the start of Agami. Agami made a cool NAS system with a filesystem that did things that nobody except NetApp was doing at the time and that, for that matter, no current shipping Linux filesystem will do -- and it was Linux. Specifically, it was Red Hat Enterprise Linux 3 hacked to run the 2.6.7 kernel (because 2.4 simply wouldn't do what Agami wanted to do). I remember that vividly because I was the person who hacked RHEL3 to make it work with the 2.6 kernel -- it required some specific changes to the init scripts run at system boot to look at things where 2.6 put them (in particular, 2.6 added sysfs and moved a bunch of stuff out of procfs) plus some changes to the Linux distribution itself (e.g. modutils).

The reason I mention that is because I was talking to the Director of Software Development of a network security appliance company and mentioned that I'd spent some time recently modifying Intransa's kernel block drivers to work around bugs in the Linux 2.6.32 kernel, specifically, to work around some races under heavy load that had been introduced into block device teardown that would either OOPS you or cause hung I/O. He asked me, puzzled, "why didn't you just fix the kernel bugs?"

Well, it was a fair question. The fact that the races I ran into are a result of the removal of the Big Kernel Lock and would require significant re-factoring of the kernel locks to make them go away, and furthermore that they're hard to reproduce and debug, is one issue. I looked at later kernel versions to see how that played out, and the changes to the block layer were deep and intrusive. It was much simpler to simply modify our software to cope with the misbehavior. But that's not what I answered him with. I answered him "because if we start hacking on the Linux kernel that opens us up to a world of hurt from a maintainability point of view when we want to move to a new kernel version, from a licensing point of view with the GPL... it just isn't the right thing to do."

I was explicitly thinking about Agami's 2.6.7 kernel there. Agami patched 2.6.7 to a fare-thee-well. I'm not sure how many patches in toto they were applying to their kernel, but it was at least in the hundreds. By contrast, the last two appliance companies I've worked for -- Resilience and Intransa -- patched the kernel only if it was completely and utterly unavoidable. When I ported the Resilience kernel patches from RHEL3 to RHEL5, there was less than a dozen patches, and they were all to fix driver issues with specific network drivers for obsolete hardware that, alas, we still needed to support. We'd submitted those patches upstream and it turned out that I ended up only needing to apply five patches total, the rest had already been applied upstream. The situation with Intransa's kernel is even simpler. There is a .config file, and one(1) patch that basically exports a kernel API that we need for one of our driver modules. I also apply an upstream LSI mpt2sas driver and an upstream networking driver needed for the specific hardware in our current and next-generation servers, but those are compiled separately as part of our software compile, not as part of the kernel compile (i.e., they're disabled in the kernel compile and isolated in a "3rdparty" directory where they're easy to remove once we transition to a kernel that has the required drivers in it). Everything else is self-contained in our own modules.

The result is that while it was a pain to change kernel versions from the kernel shipped with Intransa StorStac 7.12 to the kernel shipped with Intransa StorStac 7.20, it was a manageable pain -- it took me roughly two weeks of work to figure out what was happening with the block layer locking and around two weeks to create and debug work-arounds in our kernel drivers in the three places that were running into race issues, and the other issues were just a matter of some of the include files moving around. Meanwhile, Agami had painted themself into such a corner with their heavily hacked kernel (amongst other issues) that it proved exceedingly difficult for them to move off of 2.6.7 even though 2.6.7 had severe stability issues under heavy load -- I had to scale back many of the hardware tests that I wrote for the manufacturing line because they were making the kernel fall over under heavy load, and I was trying to test that the machine was assembled correctly and that the hardware was working correctly, I wasn't trying to test for bugs in the kernel.

These issues come into play mostly when there are quantum shifts in underlying hardware architectures requiring either a new kernel version or significant back-porting of new drivers and kernel features. We had to switch kernel versions between StorStac 7.2 and StorStac 7.11 because of the introduction of Nehalem-based server hardware. We had to switch kernel versions between StorStac 7.12 and StorStac 7.20 because of the introduction of Sandy Bridge based server hardware. In both cases backporting the architecture and driver support back to an earlier kernel would have been *much* more difficult than simply porting the StorStac kernel drivers to the new kernel. And it was all because of the decision to keep our kernel code as independent of the kernel as possible -- if there had been extensive Intransa modifications to the kernel, we could have never done it within the short amount of time that it was done (the 7.12 to 7.20 development cycle was four months -- *including* re-basing to a new Linux distribution).

So I suppose the takeaway from all this is:

If you are applying a lot of patches to the kernel, stop and think of different ways of handling things. The Linux kernel will have bugs. Always. Think of workarounds that will work with those bugs, and continue working once those bugs are fixed. Bonus brownie points if the workarounds also improve performance by, e.g., adding bio pending and free lists that mostly mitigate against needing to kmalloc bio's once the system has been under load for a while.
Keep your own stuff in your own .ko modules, don't go patching mainline kernel code to add your own functionality unless there's just no alternative (such as one NIC driver that did not implement the ability to set the MAC address in the NIC chip, which we needed for cluster failover of the Resilience appliance).
If you need to really hack on a kernel subsystem, either a) create a new module by a different name, or b) consider different solutions.
And above all: Consider maintainability from the beginning. Because if you don't, you, too, can go out of business within two years of delivering your first (and last) product...

-ELG

Confessions of a Linux Penguin

Saturday, March 30, 2013

Why no cloud?

Sunday, March 24, 2013

Making auto-proxy configuration work

A tale of two routers

Friday, March 22, 2013

Storage migrations

Thursday, March 14, 2013

The end of Google Reader

Monday, March 4, 2013

Patchwork and maintainability

About Me

Pages

My Links

Blog Archive

Geek Links

Followers