Confessions of a Linux Penguin

Eating the seed corn: Red Hat Software, CentOS, and the end of an era

2021-04-02T20:16:00.002-07:00

Or: Why we are migrating to Ubuntu Linux.

So, why did Linux win the server wars? Why is virtually every cloud server running Linux now, rather than some variety of Windows Server? Why are probably the majority of corporate web servers running Linux now, and a large variety of other internal application servers are running Linux?

Basically, Linux won the server wars because Linux won the poor college student market. Poor college students couldn't afford a copy of Windows Server to run on their own computer. They could afford a copy of Linux to run on their own computer. And twenty years later, they are the decision makers in IT departments who decide what technology to use.

So anyhow: After being one of those poor college students some years ago I'm now one of those decision makers. For example, I spec'ed and got quotes for $16,000 worth of server equipment last week as well as $1500 worth of software, and over the next few years will be doing the same for a six-figure amount of equipment and software. Granted, this isn't huge by IT terms, but multiply me by thousands, and it adds up quickly.

And one of the things I spec'ed was the embedded OS in our appliance. In 2010 I chose to use CentOS Linux for the embedded distribution in our appliance for a number of reasons:

Licensing costs. The licensing terms for Red Hat Enterprise Linux really are oriented around servers. Their per-server costs would make the appliance unprofitable. Furthermore, we're using an extremely stripped down Linux distribution for our appliance. There really is no value we would get from Red Hat support. So we went with a community distribution (CentOS) rather than a commercial distribution(RHEL or SUSE) to reflect the value that we get from the support organization around the distribution.
Stream of security patches. We don't have a team to watch CVE's and figure out whether they apply to our stripped-down Linux distribution. Instead, we watch the stream of RPM's coming out of the distribution and apply the ones applicable to our distribution on a regular basis.
Length of support. Ten years is a long time. It means we don't have to re-base our software that runs on top of the embedded OS to a new Linux distribution very often. This means less work and more play. Playing with new technologies to add to our product is a lot more fun than the grunt work of porting our software to a new Linux distribution.
Continuity with earlier versions of Red Hat Enterprise Linux. If you knew how to set up advanced networking on EL5, you know how to set it up on EL7 (with some caveats such as having to disable NetworkManager first). This made the job of re-basing our software considerably easier than with distributions that moved things around in major ways with each release.
A commitment in writing and in all major press outlets by Red Hat Software, when they acquired the CentOS project in 2014, to continue the mission of the CentOS project unchanged, which led to transitioning from Centos 6 to Centos 7 when it arrive and stabilized.

Fundamentally, Red Hat Software has broken all of the above with recent decisions such as the decision to discontinue CentOS 8 as of the end of 2021.

CentOS Stream is not a community distribution. It is a beta test stream for RHEL. As such, it is not an appropriate base for an embedded appliance distribution. So we will have to move to another community distribution, and Ubuntu is the community distribution that has the highest mindshare right now.
One reason why CentOS Stream is not a community distribution is that CentOS Stream will not be a reliable source of security patches. Yes, I understand that Facebook has their own internal distribution based on CentOS Stream. We are not Facebook. We do not have infinite money like Facebook. We do not have a team to watch the list of RPM's and apply only the finalized tested versions to our appliance distribution. Yet another reason why CentOS Stream is not appropriate for us.
One thing that kept us on CentOS rather than on Ubuntu LTS was that CentOS releases were supported for twice as long as Ubuntu LTS releases. Since that's no longer true, there's no reason to stay with a RHEL-derived distribution.
Red Hat Software broke significant continuity with earlier versions of RHEL with Enterprise Linux 8. For example, they removed all Java applet containers from the distribution other than their own JBOSS applet container. You use Tomcat? Tough. You'll have to install it from scratch and keep it updated yourself when security patches happen. And those who use the BTRFS filesystem are completely left out.
And finally, Red Hat Software has proven that they will break their commitments. So.

And Red Hat's explanation? "It was putting price pressure on Red Hat’s ability to capture some of the value that we create."

Uhm, no. Because the users of CentOS, by and large, are not going to buy Red Hat Enterprise Linux. Buying RHEL does not comport with their business model if they're a business, or with their budget if they're a poor college student. They are going to move to another community distribution, either another rebuild of RHEL such as Oracle Linux or Rocky Linux, or to a different distribution altogether such as Debian or Ubuntu.

So why are we choosing Ubuntu LTS rather than some other distribution?

Oracle Linux: I've evaluated Oracle Linux. They've made sufficient changes to make it difficult to transition from CentOS, plus have broken a few things along the way. And they don't (can't) fix the fact that Enterprise Linux 8 broke significant continuity with earlier versions of RHEL. They can't do that and maintain compatibility with RHEL 8.
Then there's the general problem with all of the RHEL rebuilds: mindshare and market share is declining drastically. CentOS is really bad as a desktop operating system, so the students we get in as interns, if they know Linux at all, usually know Ubuntu Linux or one of its derivatives. Combined with its derivatives, that makes Ubuntu the most popular desktop Linux right now. For desktop Linux use, WSL has made Ubuntu the most common Linux on the desktop in our office despite the fact that we work with CentOS or CentOS-derived distributions as our target.
Debian Linux is quite stable, but if security fixes require effort to backport to the previous stable release, they usually aren't backported. This makes it hard for us to keep our product secure without transitioning to a newer distribution.
Fedora -- it moves too fast. Decent desktop OS, but not a good fit for an embedded appliance.
The ubiquity of Ubuntu. It is available as a supported option, and fully cloud-aware, on all major cloud platforms -- Google, Microsoft, and Amazon. To create a virtual machine with many of the other Linux distributions you must first locate a third-party image, then install that possibly untrusted third-party image. And often the third-party image is not cloud aware -- e.g. it doesn't have cloud-init installed or the native cloud tools installed, making it hard to bring up new virtual machines in an automated fashion. Ubuntu is a Tier 1 supported distribution everywhere that matters.

Thus Ubuntu.

In general, the mindshare issue -- where Red Hat's mindshare amongst potential employees is declining rapidly while Ubuntu's mindshare is becoming near ubiquitous -- makes it almost a no-brainer for us. Instead of attempting to port our software stack from Centos 7 to a RHEL 8 clone as Centos 7 approaches end-of-life, we will instead port it to Ubuntu 20.04 LTS. We use a configuration management tool to actually create our appliance (we plop the deliverables onto the CM server and it configures a copy of the base OS with our changes), and the CM tool we use hides most distribution details from us, so the job of porting is mostly going to be one of adjusting package names and occasional file locations in the manifest used to create the appliance. Furthermore, we avoid the Tomcat issue. Ubuntu ships with Tomcat, while RHEL 8 does not. Granted, that's a fairly minor issue (we already have things in our manifest that manually install various software) but why make work for ourselves?

And our student interns? The ones that aren't already Ubuntu users will become Ubuntu users. And ten years from now, they will be specifying Ubuntu Linux rather than Red Hat Enterprise Linux for their corporation's deployments. Because that's how it works. If you seed your field, crops will rise. If you instead eat your seed corn, as Red Hat's current IBM pointy headed management is doing ... well, you eat well for this quarter. But good luck reaping a crop at the end of the growing season.

- ELG

Red Hat Software has phone company syndrome

2019-10-24T20:18:00.000-07:00

Specifically: They don't care, because they don't have to.

Google "jboss sucks." You get about 946,000 results. Google "tomcat sucks". You get about 260,000 results.

So which one did Red Hat Software omit from Red Hat Enterprise Linux 7, after it had been in the distribution for close to two decades?

Hint: It's the one that sucks less.

Yeah, Red Hat kept the one that sucks more (jboss) and dumped the one that sucks less (tomcat). Because they want to push their bloated buggy application server that nobody really likes (JBoss) and if that requires breaking everybody's migration path from RHEL7 to RHEL8, well.They're Red Hat Software / IBM. They don't care. They don't have to.

Of course, we all know what happened the last time that IBM tried to shove something down people's throats, when they introduced the IBM PS/2 and OS/2. It turns out that people do *not* willingly abandon their old stuff just because the incumbent near-monopoly decides to push their own proprietary junk. Competitors soon outcompeted IBM, and now they no longer make personal computers. Because it turns out that customers will go to the competition if you do customer-hostile things.

Like Ubuntu.

Which is likely going to be what I standardize on going forward, because really, if Red Hat is going to break the world every time they do a new release, why bother?

- ELG

"But how can I be waiting on a lock when I'm writing an entirely different table?"

2019-05-15T22:58:00.000-07:00

Anybody who has run long transactions on PostgreSQL has run into situations where an update on one table, let's call it sphere_config, is waiting on a write from another table, let's call it component_state, to complete.

But how, you ask? How can that be? Those are two different tables, there are no rows in common between the two writes, thus no lock to wait upon!

Well, here's the deal: The transaction that's writing component_state *earlier* wrote to sphere_config. But it's still chugging away doing writes to component_state.

But wait, you say, yes, that transaction wrote to sphere_config, but wrote to records that are not in common with my transaction, so what gives? Well, what gives is this: Uniqueness constraints. Transaction A grabbed a share lock on sphere_state when it wrote to sphere_state because there's a uniqueness constraint. Until transaction A, that's writing to component_state, finishes chugging away doing its things and COMMITs its transaction, its writes to sphere_config aren't visible to transaction B, your transaction that's trying to write to sphere_config. After all, it might ROLLBACK instead because of an error. Thus your transaction B can't update the row it's trying to update because it doesn't know yet whether it would be violating a uniqueness constraint because it doesn't know yet whether transaction A wrote that same value (because transaction A hasn't committed or rolled back yet).

Now I hear you saying, "but transaction A is operating on a completely different subset of the table, there's no way that anything it's writing can violate uniqueness for transaction B." Well, Postgres doesn't know that because of the way Postgres time travels. So when transaction B goes to write to that table, it tries to grab a share lock on it... but can't, because transaction A still has it. Until transaction A finishes, transaction B is stuck.

So, how can you deal with this? Well:

If you know you're never going to violate the uniqueness constraint, remove it.
If you need the uniqueness constraint, try to break up your big long transactions into smaller transactions. This will clear those locks faster.
Make sure your transactions get terminated by either a COMMIT or a ROLLBACK. Unfortunately some programs in languages that do exceptions have very bad error handling and, when an exception occurs that terminates execution of the database trasnaction, don't clean up the database connection behind them by issuing a ROLLBACK to the database session. Note that most people are using a connection pool between themselves and Postgres, and that connection pool may not automatically drop a connection when the program exceptions. Instead, the still-open connection (still unterminated by a ROLLBACK) may be simply put back into the pool for re-use, keeping that lock open on the . So: Try your darndest to use scaffolding like Spring's Service scaffolding that will automatically roll back transactions when there is an uncaught exception that terminates execution of a transaction. And do a lot of praying.
If you are absolutely positively sure you will never have a long-running transaction, you can add a idle_in_transaction_session_timeout (integer) parameter to your postgresql.conf . Be very careful here though, if your connection pool doesn't do checking for dropped connections you can get *very* ugly user interaction with this!

That's pretty much it. Uniqueness checking in a time-traveling database like Postgres is pretty hard to do. In MySQL, if you try to write a duplicate into the index, it will collide with the record already there and your transaction will fail at that point, but Postgres doesn't ever actually update records, it just adds a new record with a higher transaction number and lets the record with a lower transaction number eventually get garbage-collected once all transactions referring to it are finished executing. This allows multiple transactions to have multiple views of the record as of the time that each transaction started, thus insuring chronological integrity (as well as adding a potential for deadlock but that's another story), but definitely makes uniqueness harder to accomplish.

So now you know how your transaction can be waiting on a lock when you're writing an entirely different table -- uniqueness checking. How... unique. And now that you know, now you know how to fix it.

-ELG

Implementing "df" as a Powershell script

2019-02-15T14:48:00.000-08:00

Note that I first have 'DiskFree' which creates an object for each line, so that I can use it in PowerShell applets that need disk usage info in an easily digested fashion, then use the list of objects in 'df' to display in a formatted manner. This is all in my startup profile for PowerShell so I have it handy when I'm at the command line. This is an example of writing Perl-esque Powershell on the latest Windows 10. People who were weaned on Powershell back in the Windows XP days likely are going to be rather appalled because "it doesn't look like Powershell!". Well, Perl people don't like how my Perl looks either -- "it doesn't look like Perl!" -- I dunno, people who cling to inscrutable syntax just because are, well, strange. I'm more into clarity and simplicity of reading when it comes to my code.

function DiskFree($dir = "") {
   $wmiq = 'SELECT * FROM Win32_LogicalDisk WHERE Size != Null AND DriveType >= 2'
   $disks = Get-WmiObject -Query $wmiq
   [System.Collections.ArrayList]$newary = @()

   foreach ($disk in $disks) {
        # Write-Host "Processing " $disk.deviceId
        $used = ( $disk.Size - $disk.FreeSpace )
        $percent = [math]::Round($used / $disk.size * 100,1)
        $freePercent = [math]::Round($disk.FreeSpace / $disk.size * 100,1)
        if ( ($dir -ne "" -and $dir -eq $disk.DeviceID) -or ($dir -eq "")  ) {
           $dfObject = New-Object -TypeName psobject  
           $dfObject | Add-Member -MemberType NoteProperty -Name Device -Value $disk.deviceId
           $dfObject | Add-Member -MemberType NoteProperty -Name Used -Value $used
           $dfObject | Add-Member -MemberType NoteProperty -Name UsedPercent -Value $percent
           $dfObject | Add-Member -MemberType NoteProperty -Name Free -Value $disk.FreeSpace
           $dfObject | Add-Member -MemberTYpe NoteProperty -Name FreePercent -Value $freePercent
           $dfObject | Add-Member -MemberTYpe NoteProperty -Name Total -Value $disk.size
           $res = $newary.Add($dfObject)
        } 
        # Write-Host "Processed" $disk.DeviceId
    }
    $newary
}


function df($dir="") {
   DiskFree($dir) | Format-Table -Property Device,Used,UsedPercent,Free,FreePercent,Total -AutoSize
}

Amazon Aurora Postgres: First thoughts

2017-10-26T17:43:00.000-07:00

Well, I have to say that this was a bit frustrating. I never actually got my database installed into Aurora Postgres because of some serious limits of Amazon's implementation. Once I found those limits, I found that they limited my operational flexibility to the point where, for my workload, it simply doesn't work.

The biggest limits are based on the fact that Aurora Postgres doesn't use a filesystem. Rather, Amazon has created a block-based back end for Postgres that allows clustered access to the data store. The data store itself, like EBS, is replicated for performance and redundancy. This has some interesting side effects. Postgres was built around the assumption that the filesystem cache was the primary block cache. You allocate a fairly limited amount of memory to the internal Postgres shared memory pool and leave the rest to be used by the filesystem block cache. Aurora Postgres, on the other hand, must assign that memory to the internal Postgres shared memory pool in order to serve as cache since there is no filesystem and thus no filesystem block cache. Unlike the filesystem block cache pool, Postgres jobs cannot take memory away from the internal shared memory pool in order to accomplish whatever task they are doing. The end result is that internal jobs that require a lot of memory can die with out of memory errors since there's not enough memory outside the Postgres shared memory pool to allocate for that job.

The other big limitation is that Aurora Postgres has limited space for handling large sorts or indexing operations. Regular Postgres uses a directory, pgsql_tmp, in a tablespace to store temporary heap results for sorts and indexes too big to fit in work_mem (which by default is 2gb). This can be as big as your filesystem allows. If, for example, I have 500gb free in my tablespace, I have no trouble sorting an entire 150gb table into an arbitrary order then exporting it to an external consumer.

But remember, Aurora Postgres doesn't have a filesystem for its tablespace. It has a block store. Instead, Aurora Postgres instances that are doing large sorts or indexing large files use local storage, which is currently 2x the size of memory. That is, if an Aurora database instance has 72gb of memory, you only have 144gb of temporary space. Good luck sorting that 150gb table.

What this means for me is that Aurora Postgres has some interesting scalability limits when dealing with very large data sets. I'm currently managing about 2 billion rows in Postgres. Needless to say, this requires a lot of very large indexes in order to segment this data space into usable consumable subsets. Creating these indexes is a slow and tedious problem on Aurora Postgres because you have to do them one at a time, you can't do them in parallel, due to the lack of temporary space to use for the sort heaps. And if I'm querying and sorting significant subsets of this database, again Aurora Postgres has some serious limits due to the inability to expand pgsql_tmp.

My guess is that people who are dealing with much fewer rows, but who are querying those rows with much greater frequency, will have a more successful experience with Aurora Postgres. But then they'll run into the IOPS costs. Basically, to get the same IOPS that would cost me $1200/month on EBS, I'd end up paying around $4,000/month on Aurora.

So: What's the point of Aurora? Aurora does have a couple of positives. You can create additional read replicas virtually instantly, since they're just pointed at the same shared block storage. Failover simply happens, and happens almost instantly. And from a management point of view, Aurora makes the database administrator's job far simpler since you no longer have to closely monitor your tablespaces and expand your block storage as needed (and reallocate tables and indexes across multiple tablespaces using pg_repack) in order to handle growing your dataset. Still, in its current state of development, given its limitations and high costs compared to running your own cluster, I really cannot recommend Aurora Postgres.

-ELG

Yes, Virginia, there is a Cloud

2017-05-20T12:20:00.002-07:00

So a pundit, attempting to be clever, said "there is no `Cloud,' it’s just a computer sitting in a rack somewhere that you can’t see."

Except that's not true. There is a cloud, and it has nothing to do with that computer sitting in a rack somewhere that you can't see. Rather, it has to do with manageability and services that allow you to ignore the reality of that computer sitting in a rack somewhere and treat infrastructure as a service rather than as a physical piece of hardware.

Look: There's been dedicated and shared hosting for literally decades now where you could rent time on somebody else's computer that was somewhere out on the Internet. But nobody who had any sense used those for production environments once they got more than a few dozen users, because it made far more sense to host your own hardware at a data center where you could go hands-on in order to manage it. You could make sure your hardware met your durability and performance requirements, you could reconfigure your hardware as needed to add additional capability, and so forth.

Thing is, all of that is a royal pain in the butt to deal with. Been there, done that, got the four racks of gear in the back room of our shop to prove it. What AWS and other cloud services give us is usable infrastructure as a service, reconfigurable via an easy-to-use web console to meet whatever performance requirements we have. I have constellations of computers on two sides of the continent now, provisioned with whatever combination of CPU and disk space that I need to fulfill my workloads, all done via point and click from my desk in Mountain View, California. I didn't have to go out and spec hardware and purchase it. I didn't have to rack hardware. When I need to burst hardware to process some additional data, I don't need to go out and buy more hardware, then decommission it until the next time I need it, at which point it's just sitting around doing nothing. When I need infrastructure, I provision it. When I don't need it, or I want to upgrade to more performant infrastructure, I de-provision it and provision new infrastructure as needed. And all of this is happening in data centers that are put together with far more redundancy than anything I could afford to put together myself.

That's what cloud means to me. Yes, it's computers I can't see sitting in racks somewhere, but that's not the part that makes it cloud. It's the infrastructure as a service that makes it cloud. For that matter, it's Internet-connected services, period, that makes it cloud. If it's a service sitting out on the Internet somewhere out of sight of me where I don't have to manually configure hardware and can easily scale as needed, it's cloud. Claiming "it's just computers, dude!" overlooks the point entirely.

-ELG

"So do that with your smartphone, nerd boy!"

2016-09-30T23:27:00.000-07:00

That was the challenge from someone who'd read the story about an auto shop in Poland still using a Commodore 64 to balance drive shafts.

The Commodore 64 had a port on the back where you got direct digital signals from a parallel I/O chip (the 6526 CIA). So it was used in a number of embedded applications back in the day in situations where customers wouldn't actually see that critical tasks were being done by a $150 home microcomputer with a whole 64k of RAM and a 1mhz processor. When I was in school, I got a couple of contracts to do embedded stuff using the Commodore 64. The one I found most interesting was the temperature characterization of a directional drilling probe.

Directional drilling probes are used to know where the drill bit is when you're doing horizontal drilling of oil or gas wells. We calibrated the probe by mounting it in a testbed that allowed moving it into various positions, and monitoring it with a Commodore 64 bit-banging a two-wire interface. This testbed was in a magnetically calibrated chamber that could be heated or cooled upon demand. The probe itself had seven sensors -- three gravitic sensors (x, y, z), and three magnetic sensors (that aligned with the Earth's magnetic field to point towards magnetic north), in three different orientations (x, y, z), as well as a temperature sensor. These sensors went into A/D converters on the drilling probe itself, and were read out via a two-wire protocol (there were four wires total that went to the probe -- +/- power, and the CLK/DATA lines -- because running wires down a drill string is a PITA and they wanted to run as few wires as possible). The problem is that everything was heat sensitive -- "true north" ( or "true up and down" ) returned a different result from the A/D converters depending upon the temperature. And the further you go down into the Earth, the hotter it gets. You didn't want your directional drill heading off onto someone else's plot of ground just because it got hot, that could be a legal mess!

So basically, what we did was bake the probe, then watch the signals as it cooled off. A test run consisted of taking the probe up to its maximum operating temperature, pressing the ENTER key on the Commodore 64, and then turning off the oven and letting it cool down. As it cooled down, the Commodore bit-banged the values in from the probe and created a table in memory as well as graphed on the console. This was done in each of the six orientations of the probe. At the end of the test run, the table was printed out onto a piece of paper to be entered into the calibrated software that went with the probe (calibrated software that did *not* run on the Commodore 64, it ran on a standard PC under MS-DOS, and yes, I wrote that software too, based on equations I was given by their chief scientist).

So do this with a smartphone? Okay, challenge accepted! Some of the things being done with APRS and Android on ham radio would work here, that's another instance where you're interfacing a smartphone with an analog system. I would use a $25 Arduino board ( https://www.arduino.cc ) to bitbang the signals. I would use an $8 Bluetooth adapter for the Arduino that presented itself as a Bluetooth UART adapter. Then I would use the Bluetooth Serial profile on the Android phone to actually retrieve the streams of data from the Arduino, process them, display them as pretty graphs on the phone's display and, since this is now the 21st century, send them to a server on the Internet where they're stuck in a database under the particular directional drilling probe's serial number.

Of course, it's be just as easy to have the Arduino do that part too, if you choose an Arduino that has a WiFi adapter, and use the phone only to prompt the Arduino to start a test run and to display the pretty graphs being generated on the Internet server. It'd be even easier to use a laptop with built-in Bluetooth. But hey, you challenged me to do it with my phone, so there. :P .

-ELG

Security fail: FedEx.com

2016-08-10T23:06:00.001-07:00

One of the first rules of security on the Internet for avoiding phishing attacks is that you never, ever, enter your user credentials into a web site unless you know that you're talking to that web site, and that web site alone -- not some spoofed web site. The way we do this in the modern era is with SSL encryption. https provides not only encryption of the content being transmitted between you and the web site, it also provides authentication. Only one server (well, tightly controlled constellation of servers for bigger web sites) has the private key whose public key is being served to you (and which can be validated against the global public key infrastructure). And that's the server that you think you're talking to. If you type, say, https://www.google.com, you will be on the Google.com homepage. Up in your browser URL bar will be a green lock. Click on that lock, and with some clicking around (depends on browser) you will be able to view the certificate and verify that you are, in fact, connected to the one and only Google.com home page owned by the one and only Google. Then, and only then, is it okay to hit the 'Login' link up at the top right of the page and log in.

So, this is how any of us who are concerned about security operate. We don't put in a user name and password unless we see that green lock, click on it, and it says we're talking to who we think we're talking to. This is because of a hacker technique called DNS poisoning, where hackers can manage to convince your local name servers that their server, not the real Google.com's web server, is where you should go to get to https://www.google.com. They then intercept the user name and ID that you enter. Well, they can convince your DNS to give the wrong address, but they don't have Google's private key, so they can't impersonate Google. So you won't get that green lock. But they hope you won't notice. You should.

This attack is called phishing, and is used to filch your user name and password, which are then used in an automated fashion on other web sites. Usually after your DNS is poisoned, you get an email telling you to go to http://some.web.site.com because your password is expiring and your account will be deleted. So I got what looked like a phishing email from Fedex.com saying that my account was going to be deleted because I hadn't logged in for over a year, unless I logged in within the next two weeks. This actually would be normal for FedEx -- the only reason to ever log in to their web site is to set up your notification preferences that tell you that a package is on the way, has been delivered, and so forth. So with the possibility that it might actually be a valid email, I manually type https://www.fedex.com into my browser's URL bar (*never* click on a URL in email! Never!), hit the ENTER key... and immediately got kicked out to a non-encrypted site.

At which point my reaction was, "WTF? Have hackers hacked the FedEx web site and are grabbing user credentials?" But it appears that's not the case. Using the host and whois systems to resolve the IP address, it goes into Akamai's site acceleration service. Instead, it looks like pure rank incompetence. FedEx is deliberately putting their customers' user names and passwords at risk because... why? Well, because they're too stupid to know how to implement SSL in an Akamai-distributed architecture, apparently. Despite the fact that Akamai has explicitly supported SSL for years.

So anyhow, I use LastPass so my password was random gibberish in the first place, so after examining the source code of the web page to see if there were obvious problems, I logged in. At that point the web site did put me into a proper SSL-encrypted web page. But the point... the point... I should have never had to enter my user name and password into a plain text unencrypted web page in the first place. There's no -- zero -- way to authenticate that you are actually talking to the site you thought you were talking to, if you're talking to a web site that's not https. DNS poisoning attacks are ridiculously easy and could have sent me *anywhere*. The only reason I felt even halfway safe talking to this web site was because LastPass had generated me a random gibberish password for this web site a year ago, so if they *did* steal my FedEx credentials, at least they could only be used to hack my FedEx account, which would be no big deal (I don't have credit card information or billing information associated with the account, it's strictly an informational account). But still. Bad FedEx. Bad, bad FedEx. Bad DevOps team, no cookie, go to your room!

The takeaway from this:

Check those green locks. They're important. It tells you a) whether you're talking to the web site you think you're talking to or not (if it's there, you know you are, if it's not, you don't know), and b) tells you that any user names and passwords that you enter will go to the web site via an encrypted connection.
Any web site that your company provides should only ask for user names and passwords on an SSL-encrypted https page. If someone tries to go there with a plain http: url, it should immediately be forwarded to the SSL site.
If you don't follow that last rule, you will be publicly shamed, if not by me, by someone. And a public shaming is never good for your brand.
Furthermore, if you don't follow that last rule, there's many potential customers who will simply refuse to use your web site. This perhaps is not a big deal for FedEx, since most of their customers have no choice but to use their web site to schedule package pickups, but if you're providing a web service to the general public? Dude. You are leaving money on the table if you do something stupid like ask for a username and password on an unencrypted page.

This isn't rocket science, people. This is just basic Web Services 101. Get it right, or choose a different profession. I understand that McDonalds is hiring. Just sayin'.

-ELG

SSD: This changes everything

2015-09-27T20:56:00.000-07:00

So someone commented on my last post where I predicted that providing block storage to VM's and object storage for apps was going to be the future of storage, and he pointed out some of the other ramifications of SSD. To whit: Because SSD removes a lot of the I/O restrictions that have held back applications in the past, we are now at the point where CPU in many cases is the restriction. This is especially true since Moore's Law has seemingly gone AWOL. The Westmere Xeon processors in my NAS box on the file cabinet beside my desk aren't much slower than the latest Ivy Bridge Xeon processors. The slight bump in CPU speed is far exceeded by the enormous bump in IOPS that comes with replacing rotational storage with SSD's.

I have seen that personally, myself, in watching a Grails application max out eight CPU cores while not budging the iometer on a database server running off of SSD's. What that implies is that the days of simply throwing CPU at inefficient frameworks like Grails are limited. In the future efficient algorithms and languages are going to come back in fashion to use all this fast storage that is taking over the world.

But that's not what excites me about SSD's. That's just a shuffling of priorities. What excites me about SSD's is that they free us from the tyranny of the elevator. The elevator is the requirement that we sweep the disk drive heads from bottom to top, then from top to bottom, in order to optimize reads. This in turn puts some severe restrictions on how we lay out storage for block storage -- the storage must be stored contiguously so that filesystems layered on top of the block storage can properly schedule I/O out of their buffers to satisfy the elevator. This in turn means we're stuck with the RAID write hole unless we have battery backed cache -- we can't do COW RAID stripe block replacement (that is, write altered blocks of a RAID stripe at some new location on the device then alter a stripe map table to point at those new locations and add the old locations to a free list) because a filesystem on top of the block device would not be able to schedule the elevator properly. The performance of the block storage system would fall over. Thus why traditional iSCSI/Fiber Channel vendors present contiguous LUNs to their clients.

As a result when we've tried to do COW in the past, we did it at the filesystem level so that the filesystem could properly schedule the elevator. Thus ZFS and BTRFS. They manage their own redundancy rather than using RAID at the block layer to handle their redundancy, and ideally want to directly manage the block devices. Unfortunately that really doesn't map well to a block storage back end that is based on LUNs, and furthermore, doesn't map well to virtual machine block devices represented as files on the LUN -- virtual machines all have their own elevators doing what they think are sequential ordered writes, but the COW filesystems are writing at random places, so read performance inside the virtual machines becomes garbage. Thus VMware's VMFS, which is an extent-based clustered filesystem that, again, due to the tyranny of the elevator, keeps the blocks of a virtual machine's virtual disk file located largely contiguously on the underlying block storage so that the individual virtual machines' elevators can schedule properly.

So VMFS talking to clustered block storage is one way of handling things, but then you run into limits on the number of servers that can talk to a single LUN that in turn makes it difficult to manage because you end up with hundreds of LUN's for hundreds of physical compute servers and have to schedule the LUNs so they're only active on the compute servers that have virtual machines on that specific LUN (in order to avoid hitting the limits on number of servers allowed to access a single LUN). What is needed is the ability to allocate block storage on the back end on a per-virtual-machine basis, and have the same capabilities on that back end that VMFS gives us on a single LUN -- the ability to do snapshots, the ability to do sparse LUN's, the ability to copy snapshots as new volumes, and so forth. And have it all managed by the cloud infrastructure software. This was difficult back in the days of rotational storage because we were slaves of the elevator, because we had to make sure that all this storage ended up contiguous. But now we don't -- the writes have to be contiguous, due to the limitations of SSD, but reads don't. And it's the reads that forced the elevator -- scheduling contiguous streams of writes (from multiple virtual machines / multiple files on those virtual machines) has always been easy.

I suspect this difficulty in managing VMFS on top of block storage LUNs for large numbers of ESXi compute servers is why Tintri decided to write their own extent-based filesystem and serve it as a NFS datastore to ESXi boxes, rather than as block storage LUN's. NFS doesn't have the limits on number of computers that can connect. But I'm not convinced that, going forward, this is going to be the way to do things. VSphere is a mature product that has likely reached the limits of its penetration. New startups today are raised in the cloud, primarily on Amazon's cloud, and they want a degree of flexibility to spin virtual machines up and down that make life difficult with a product that has license limits. They want to be able to spin up entire test constellations of servers to run multi-day tests on large data sets, then destroy them with a keystroke. They can do this with Amazon's cloud. They want to be able to do this on their local clouds too. The future is likely to be based on the KVM/QEMU hypervisor and virtualization layer, which can use NFS data stores but they already have the ability to present an iSCSI LUN to a virtual machine as a block device. Add in some local SSD caching at the local hypervisor level to speed up writes (as I explained last month), and you have both the flexibility of the cloud and the speed of SSD. You have the future -- a future that few storage vendors today seem to see, but one that the block storage vendors in particular are well equipped to capture if they're willing and able to pivot.

Finally, there is a question as to whether storage and compute should be separate things altogether. Why not have compute in the same box as your storage? There's two problems with that though: 1) you want to upgrade compute capability to faster processors on a regular basis without disrupting your data storage, and b) density of compute servers is much higher than density of data servers, i.e., you can put four compute blades into the same 2U space as a 24-bay data server. And as pointed out above, compute power is now going to be the limiting factor for many applications, not IOPs. Finally, you want the operational capability to add more compute servers as needed. When our team used up the full capacity of our compute servers, I just added another compute server -- I had plenty of storage. Because the demand for compute and memory just keeps going up as our team has more combinations of customer hardware and software to test, it's likely I'm going to continue to have to scale compute servers far more often than I have to scale storage servers.

So this has gone on much too long but the last thing to cover is this: Will storage boxes go the way of the dodo bird, replaced by software-defined solutions like Ceph on top of large numbers of standard Linux storage servers serving individual disks as JBOD's? It's possible, I suppose -- but it seems unlikely due to the latency of having to locate disk blocks scattered across a network. I do believe that commodity hardware is going to win everything except the high end big iron database business in the end because the performance of commodity hardware has risen to the point where it's pointless to design your own hardware rather than purchase it off the shelf from a vendor like Supermicro. But there is still going to be a need for a storage stack tied to that hardware in the end because pure software defined solutions are unable to do rudimentary things like, e.g., use SES to blink the LED of a disk bay whose SSD has failed. In the end providing an iSCSI LUN directly to a virtual machine requires both a software support side that is clearly software defined, and a hardware support side where the hardware is managed by the solution. This in turn implies that we'll continue to have storage vendors shipping storage boxes in the future -- albeit storage boxes that will incorporate increasingly large amounts of software that runs on infrastructure servers to define important functions like, e.g., spinning up a virtual machine that has a volume attached of a given size and IOPs guarantee.

-ELG

Where does the future of enterprise storage lie?

2015-08-25T00:18:00.000-07:00

I've talked about how traditional block and NAS storage isn't going away for small businesses. So what about enterprise storage? In the past few years, we've seen the death of multiple vendors of scale-out block storage, two of which were of interest to me being Coraid and Intransa, both of which allowed chaining together large numbers of Ethernet-connected nodes to scale out storage across a very large array (the biggest cluster we built at Intransa had 16 nodes and a total of 1.5 petabytes of storage but the theoretical limits of the technology were significantly higher). Reality is that they had been on life support for years because the 1990's and 2000's were the decades of NAS, not of block storage. Oh, EMC was still heaving lots of big iron block storage over the wall to power big databases, but most applications of storage other than those big corporate data marts were NAS applications, whether it was Windows and Linux NAS servers at the low end or NetApp NAS servers at the high end.

NAS was pretty much a necessity back in the era of desktops and individual servers. You could mount people's home directories on a CIFS or NFS share (depending on their OS). People could share their files with each other by simply copying them to a shared directory. You saw block storage devices being exported to these desktops via iSCSI sometimes, but usually block storage devices were attached to physical servers in the back room on dedicated storage networks that were much faster than floor networks. The floor networks were fast enough to carry CIFS, but CIFS at its core is just putting and getting objects, not blocks, and can operate much more asynchronously than a block device and thus wasn't killed by latency the way iSCSI is.

But there's problems too. For one thing, every single device has to be part of a single login realm or domain of some sort, because that's how you secure connections to the NAS. Furthermore, people have to be put into groups, and access set on portions of the overall NAS cloud based on what groups a person belongs to. That was difficult enough in the days when you just had to worry about Linux servers and Windows desktops. But now you have all these devices

Which brings up the second issue with NAS -- it simply doesn't fit into a device-oriented world. Devices typically operate in a cloud world. They know how to push and pull objects via http, but they don't speak CIFS or NFS, and never will. What we are seeing is that increasingly we are operating in a world that isn't file based, it's object based. When you go into Google Docs to edit a spreadsheet, you aren't reading and writing a file. You're reading and writing an object. When you are running an internal business application, you are no longer loading a physical program and reading and writing files. You're going to a URL for a web app that most likely is talking to a back end database of some kind to load and store objects.

Now, finally, add in what has happened in the server room. You'll still see the big physical iron for things like database servers. But by and large the remainder of the server room has gone away, replaced by a private cloud, or pushed into a public cloud like Amazon's cloud. Now when people want to put up a server to run some service they don't call IT and work up a budget and wait months for an actual server to be procured etc., they work at the speed of the cloud -- they spin up a virtual machine, they attach block storage to it for the base image and for any database they need beyond object storage, and they implement whatever app they need to implement.

What this means is that block storage and object storage integrated with cloud management systems like OpenStack are the future of enterprise storage, a future that alas did not arrive soon enough for the vendors of scale-out block storage that survived the previous decade, who ended up without enough capital to enter this brave new world. NAS won't go away entirely, but it will increasingly be a departmental thing feeding desktops on the floor, not something that anything in the server room uses. And that is, in fact, what you see happening in the marketplace today. You see traditional Big Iron vendors like HDS increasingly pushing object storage, and the new solid-state storage vendors such as Pure Storage and Solidfire are predominantly block storage vendors selling into cloud environments.

So what does the future hold? For one thing, lower latencies via hypervisor integration. Exporting a share via iSCSI then mounting it via the hypervisor has all of the usual latency issues of iSCSI. Even with 10 gigabit networking now hitting affordability and 25 to 100 gigabit Ethernet in the future, latency is a killer if you're expecting a full round trip. What if writes were cached on a local SSD array, in order, and applied in order? For 99% of the applications out there this provides all the write consistency that you need. The cache will have to be turned off prior to migrating the virtual machine to a different box, of course -- thus the need for hypervisor integration -- but other than a catastrophic failure (where the virtual machine will go lights out also and thus not have inconsistent data when it is restarted on another node) you will, at best, have some minor data loss -- much better than inconsistent data.

So: Block storage with hypervisor and cloud management integration, and object storage. The question then becomes: Is there a place for the traditional dedicated storage device (or cluster of devices) in this brave new world? Maybe I'll talk about that next, because it's an interesting question, with issues of data density, storage usage, power consumption, and then what about that new buzzword, "software defined storage"? Is storage really going to be a commodity in the future where everybody's machine room has a bunch of generic server boxes loaded with someone's software? And what impact, exactly, is solid state storage having? Interesting things to think about there...

-ELG

The quest for an integrated storage stack

2015-08-01T22:18:00.000-07:00

In prior posts I've mentioned the multitude of problems with the standard Linux storage stack. It's inflexible -- once you've set up a stack (usually LV->VG->PV->MD->BLOCK) and opened a filesystem on it, you cannot modify it to, e.g., add a replication layer to the stac. It lacks the ability to do geographic replication in any reasonable fashion. The RAID layer in particular lacks the ability to write to (and replay) a battery-backed RAM cache to deal with the RAID 5 write hole (which, despite its name, also applies to other RAID levels and results in silently corrupted data). Throw iSCSI into this equation to provide block devices to virtual machines and, potentially, to do replication to block devices on other physical machines, and things get even more complex.

One method that has been proposed to deal with these issues is to simply not use a storage stack at all. Thus we have ZFS and BTRFS, which attempt to move the RAID layer and logical volume layers into the filesystem. This certainly solves the problem of corrupted data, but at a significant penalty in terms of performance, especially on magnetic media where the filesystem swiftly becomes fragmented. As a result running virtual machines using "block devices" that are actually files on a BTRFS filesystem results in extremely poor "disk" performance on the virtual machines. A file on a log-based subsystem is simply a poor substitute for an extent on a block device. Furthermore, use of these filesystems for databases has proven to be woefully slow compared to using a normal filesystem like XFS on top of a RAID-10 layer.

The other method that has been to abandon the Linux storage stack except as a provider of individual block devices and instead layer a distributed system like Ceph on top of it. My tests with Ceph have not been particularly promising. Performance of Ceph block devices at an individual virtual machine level were abysmal. There appears to be three reasons for this: 1) Overly pessimistic assumptions about writes on the part of Ceph, 2) The inherent latencies involved in a distributed storage stack, and 3) the fact that Ceph reads/writes via XFS filesystems layered on top of block devices, rather than to extents on raw block devices. For the latter, in my experience you will see *at least* a 10% degradation in virtual machine block device performance if the block device is implemented as a file on top of XFS rather than directly to a LVM extent.

In both cases, I wonder if we are throwing out the cart because the horse has asthma. I've worked as a software engineer for two of the pioneers of Linux-based storage -- Agami Systems, which did a NAS device with an integrated storage system, and Intransa Inc., which did scalable iSCSI storage systems with an integrated block storage subsystem. Both suffered the usual fate of pioneers -- i.e., face down dead with arrows in the back, though it took longer with Intransa than with Agami. Both wrote storage stacks for Linux which solved most of the problems of the current Linux storage stack, though each solved a different subset of those problems. There are still a significant number of businesses which do not need the expense and complexity of a full OpenStack data center in order to solve their problems, but which do need things like, e.g., logged geographic replication to replicate their data to an offsite location, something which Intransa solved ten years ago (but which, alas, died with Intransa), or real-time snapshots of virtual machine block devices at the host device level, or ...

In short: Despite the creation of distributed systems like CEPH and integrated storage management filesystems like BTRFS, there is a significant need for an integrated storage stack for Linux -- one that allows flexibility in configuring both block devices and network filesystems, which allows for easy scalability and management, which has modern features such as logged geographic replication, battery backed RAM cache support (or at least fast SSD log device support at the MD layer), and allows dynamic insertion of components into the software stack much as you could create a replication layer in the Intransa StorStac and have it sync then replicate to a remote device without ever unmounting any filesystem or making the iSCSI target inaccessible. There is simply a large number of businesses which just don't need the expense and complexity of a full OpenStack data center, which indeed don't need more than a pair of iSCSI / NAS storage appliances (a pair in order to handle replication and snapshotting), and the current Linux storage stack lacks fundamental functionality that was implemented over a decade ago but never integrated into Linux itself. It may not be possible to bring all the concepts that Agami and Intransa created into Linux (though I'll point out that all of Intransa's patents are now owned by a patent entity that allows free use for Open Source software), but we should attempt to bring as many of them as possible into the standard Linux storage stack -- because the cloud is the cloud, but most smaller businesses have no need for the cloud, they just need reliable local storage for their local physical and virtual machines.

-ELG

Network monitoring is not for amateurs

2015-04-04T10:53:00.000-07:00

So, another blogger recently posted that network monitoring was easy. All you needed to do was deploy off the shelf network monitoring tools X, Y, and Z, and your network would be monitored and you would be able to sleep well at night knowing that your network was working well.

At which point I have to stop and say... what... the.... bleep?

I've been doing this a *long* time. In this or prior jobs (where I was a staff engineer at a firewall company in particular) I've deployed Nagios, MRTG, Zenoss, OpenNMS, Big Brother, and a variety of proprietary network monitoring tools such as PRTG and combined firewall/IDS tools such as Checkpoint, as well as specialty tools such as smartd, arpwatch and snort. And what I have to say is that monitoring networks is *hard*. Pretty much any of the big network monitoring tools (other than Nagios, whose configuration is utterly non-automated) will take you 80% of the way to effective monitoring of your network with a few mouse-clicks. But to go to that extra 20% needed to make sure that you're immediately notified if anything your client relies on is not providing services, you end up having to install plugins, write sensors, and combine multiple tools together. It's a lengthy task that can require days of research to solve just 1% of that last 20%.

Here's an example. I was having issues where one of my iSCSI servers was dropping off the network for thirty seconds or so. I'd learn about it because my users would start howling that their Linux virtual machines were no longer functioning properly. I investigated, and found that when Linux cannot write to a volume for a certain amount of time, it then marks the volume read-only. Which, if the volume is the root volume on a Linux virtual machine, means things go to bleep in a handbasket.

And Nagios reported not a problem during all this time. All the normal Nagios sensors for system load, disk space free, memory free, and so forth showed that the virtual machine was operating completely normally. The virtual machine was pingable, it was responding to SNMP queries, it looked completely healthy to any network monitoring tool in the universe. So I ended up writing a Nagios sensor that 'touched' a file in /, and if it couldn't because the file system had been marked read-only, reported an error. I then deployed it to my virtual machines and added it to my list of sensors to monitor, and when my iSCSI server dropped offline and the filesystems went read-only, I'd get a Nagios alert. (Said iSCSI server, BTW, is not an Intransa iSCSI server, it's a generic SuperMicro box running Linux with the LIO iSCSI stack, but that's irrelevant to the conversation). After some time of getting alerted when things went AWOL I managed to zone in on exactly the set of log file entries on the iSCSI server that indicated what was going on at the time things went to bleep, and discovered that I had a failing disk in the iSCSI system that was not triggering the smartmond monitoring tool. Unfortunately figuring out which of those 24 disks were causing the SAS bus reset loop was non-trivial given that smartctl was showing no problems when I looked at them, but once I tracked it down and replaced the disk, the problem was solved.

So let's recap. I had a serious problem with my storage network. smartmond, the host-based disk monitoring tool on the iSCSI box, couldn't detect it, all the disks looked fine when you looked at them with the tool it uses, smartctl. Nagios and all its prepackaged sensors couldn't detect it until I wrote a custom sensor. And this other blogger says that monitoring complex networks including clients, storage boxes, video cameras, and so forth is *easy*? Just a matter of installing packages X, Y, and Z and sitting back and relaxing?

Crack. That's my only explanation for anybody making such a statement. Either that, or utter inexperience at monitoring complex heterogeneous networks with multiple points of failure, of which only 80% or so are covered by off-the-shelf network monitoring tools, and the remainder requiring writing custom sensors to detect domain-specific points of failure. Either way, anybody who listens to that blogger and believes that monitoring a network effectively is easy is a fool. And that's all I will say on that subject.

-ELG

User friendly

2015-01-30T13:50:00.001-08:00

Here's a tip:

If your Open Source project requires significant work to install, work that would be easily scriptable by a competent software engineer, I'm not going to use it. The reason I'm not going to use it is because I'm going to assume you're either an incompetent software engineer, or the project is in unfinished beta state not useful for a production environment. Because a competent software engineer would have written those scripts if the product was in a finished production-ready state. Meaning you're either incompetent, or your product is a toy of use for personal entertainment only.

Is this a fair assessment on my part? Probably not. But this is what 20 years of experience has taught me. Ignore that at your peril.

-ELG

The problem with standards

2014-12-15T22:14:00.000-08:00

As part of a long rant about programming vs engineering, Peter Welch makes the statement, "standards are unicorns." Uhm, no. Just no. Unicorns are mythical. They don't exist. Standards do exist. They're largely useless, but they exist. I have linear feet of shelf space lined with the bloody things to attest to that.

So what's a better analogy for standards? I have a good one: Hurricane stew.

After a hurricane, the power is out. On every block in hurricane country there's a guy who has a giant vat used to boil crabs and crawfish and shrimp, that sits on a large burner powered by a large propane tank. You know, That Guy. So That Guy pulls out his vat and sets it up on the driveway in front of his carporch and tosses in some water and the contents of his refrigerator before they can spoil and go bad. And his neighbors come by and toss in the contents of *their* refrigerators before they can spoil and go bad. And as people finish cleaning out their refrigerators they bring more and more random scraps and throw them in, until what you have is a big glop of soupy strangeness containing a little of every possible food item that could ever exist. And then people subsist on this hurricane stew, this random glop of indistinguishable odds and ends, for the next week or two as it continues to slowly bubble on the driveway of That Guy and continues to occasionally get new glop thrown in as people throw the contents of their freezers into it too.

That's standards -- just a big glop of indistinguishable mess created by everybody under the sun throwing their own scraps and odds and ends into it, in the end being of use to absolutely no one except the people who threw scraps into it, who end up subsisting off of it for the rest of their careers because nobody else has the slightest freakin' idea what's in that opaque bubbling mess with the oddly-colored mist wafting off of it. And every single implementation of this "standard" is different and does things in a different way in the thousands of edge cases that have been thrown into the standard as possible way to do things because that was what was in someone's refrigerator err code base at the time the standard was created and so they tossed it in before it could spoil, driving anybody who has to write software that's "standards-compliant" slowly insane to the point of stabbing a two-foot-tall printout of the standard with a knife repeatedly, over and over, while screaming "Die! Die! Die!" at the top of their lungs.

That's standards. That's what they're good for.

- ELG

Performance tips for Grails / Hibernate batch processing

2014-08-21T01:35:00.001-07:00

So I'm working on an application that does batch processing of records sent by client systems, and Groovy/Grails is the language/framework that the application is written with. This is a story of how it failed -- and how it was fixed.

Failure #1: Record sets sent via HTTP take too long to process, causing HTTP timeouts before a response can be returned to the client. Solution: Plop the record sets into a batch queue instead, and process them via a batch queue runner running as a Quartz job.

Failure #2: Hibernate/Grails optimistic locking is, well, overly optimistic. As in, if I have multiple EC2 instances processing batch queues, I have to hope and pray that two different instances don't attempt to process the same set of records at the same time, else they'll both fail and rollback at some point in time and my batch queue will never get emptied. Meanwhile, Hibernate fine-grained locking is too fine-grained, and ends up causing deadlocks.

Solution: Create a locking system (via your database or via memcached or whatever, doesn't matter as long as it serializes access) and divide your database records into logical non-overlapping sets. Then lock those logical sets at a higher level prior to processing a batch that touches that particular set. For example, if you're batch processing store records at Walmart central office, a logical set might be an individual store and all its individual inventory items.

Note that this requires *very* careful schema layout to insure that things that can be changed by the end user interface do not get overwritten by the batch processor, unless you *want* them to get overwritten by the batch processor. But it's doable.

Failure #3: The Hibernate session consumes all of memory, crashing the application.

Solution: We're doing batch processing, so each record set runs for a significant amount of time (30 seconds or more) with tens of thousands of operations. This means we can let each record set have its own session. For each record set processed by the application, create a new session. Flush that session then destroy it at the end of each record set. For example:

while (batch = getNextBatch()) { // returns non-Hibernate objects, typically parsed from JSON or EDI
   Store.withNewSession {  session -> 
       ... process batch here ...
       session.flush()
   }
}

Failure #4: Multi-threaded performance slammed into a brick wall at the Hibernate query cache.

Solution: In general, the Hibernate caches are a performance hinderance when batch processing. The number of records that you process over the course of running all of your queues is far larger than the amount of memory you have, so any cached database records from the beginning of the queue run are long gone by the time the queue gets re-filled and you start over at the beginning of the queue again. Furthermore, the query cache is single-threaded, so if you're running on a modern multi-threaded processor and using multiple threads to consume its resources, you might as well be running on an 80386, performance is going to top out at less than 2 threads worth of performance. So disable the caches in the 'hibernate' block in your config/DataSource.groovy file and instead manually cache any items that you need to cache within batches or across batches:

hibernate {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
      .... other options here ....
}

Failure #5: Lots of small queries kill performance.

For example, a store might send its nightly inventory records. The nightly inventory records update the quantities for each inventory item, which in turn create ordering alerts when inventory has fallen below a certain level. You know ahead of time that a) the number of inventory records is limited (figure 40,000 different items per store), and b) 75% of the items are going to be modified. So: doing things the inefficient way, you'd do:
inventory_batch.each { rec=Inventory.findByStoreAndItemNum(store,it.itemnum) ; rec.quantity=it.quantity; rec.save() }
But that results in 40,000 queries to the database, each of which has an enormous amount of Hibernate overhead associated with it.

Solution: Cache the entire set of items beforehand (using a HashMap and a cache class to wrap it), and fetch them from the cache instead. For example, assuming you've created a 'InvCache' class that caches inventory items:

  rec_set = Inventory.findAllByStore(store)
  inv_cache = new InvCache(rec_set)
   inventory_batch.each { 
           rec = inv_cache.findByItemNum(it) // looks it up in a hashmap, and if not there, adds it to the database.
           rec.quantity = it.quantity
           rec.save()  // in a real application, you'd check result of save and print validation errors.
           // in a real application would check quantity against limits and issue an inventory alert if inventory too low.
   }

Note that rec.save() does not immediately update the record, it merely marks the record as dirty and the next time Hibernate flushes, it will then issue a SQL query to do the update. You still end up issuing 35,000 update statements but that's still better than issuing 40,000 select + 35,000 update statements, and they're all issued in a single batch rather than via multiple Hibernate calls preparing statements and etc.

Failure #6: Flushes in big Hibernate sessions kill performance.

Some stores have a big inventory. It can take several seconds to flush the Hibernate session due to Hibernate's extremely inefficient algorithm for determining what needs to be flushed (it tries to trace the entire relationship structure multiple layers deep, so it is an exponential curve, not a linear line). The Hibernate session gets flushed before virtually every query that you make to the database by default, meaning that if you have to do 500 queries against the database in the course of processing to handle things not easily cached as above, you will have 500 flushes. 500 flushes times 5 seconds per flush is 41 minutes worth of flushing. EEP!

Solution #1: Don't use Hibernate's built in flushing and transaction ordering system. Do your own, because most of what you're doing is either batch appends of log records (where you're never going to query it back out again in the process of doing the batch thus don't care when it actually gets flushed), or updates of records where again you really don't care about when it's flushed. So: switch the flush mode to 'manual' and flush only when necessary to maintain relational ordering, and otherwise flush only at the end of logical batches. For example, if the store manager has added a new InventoryItem, and this new InventoryItem is referenced by a new InventoryAlert to note that this item needs to be ordered, the order will be to create the new InventoryItem, use item.save(flush:true) to flush the session, add it to the inventory cache if it's going to be used for other things, then create the new InventoryAlert. There is no need to use flush:true on the InventoryAlert because you don't care when it actually gets flushed, you care only that the InventoryItem gets saved before the InventoryAlert that references it. Hibernate is supposed to handle the dependency order here, if you properly set up your Grails objects... but sometimes it doesn't, as I've previously noted.

Note that setting the flush.mode in the hibernate{} block in DataSource.groovy will not set the flush mode to 'manual' in the session we created earlier. It will get set to 'auto' or 'commit' by Grails depending on whether you're in an @Transactional service when you create the new session, Grails ignores the Hibernate value. You'll need to explicitly set the flush mode when you create the new session:

import org.hibernate.FlushMode
...
Inventory.withNewSession { session ->
      session.setFlushMode(FlushMode.MANUAL)
        .... do processing here ....
}

Solution #2: In many cases, we are creating new records in batches. For example, cash register logs. So: Create a bunch of new records, flush them to disk, then discard them from the session in order to keep the session size down. For example:

registers.each { register ->
   LogEntry.withSession { session ->
     entry_list=[]
     register.logs.each { logentry -> 
         entry = new LogEntry(logentry)  // creates it from hash
         ... do any other processing / initialization for entry here ...
         entry.save() // would validate/check return val in real app
         entry_list.add(entry)
     }
     session.flush() // flush the 5,000 register logs for this register to disk.
     entry_list.each { entry -> 
        entry.discard() // get rid of the 5,000 register logs for this register.
     }
   }
}

Conclusion

Hibernate has a deserved reputation as an inefficient ORM that is not well suited for high performance operations. This is primarily because its standard settings are appropriate for only a small subset of the possible problem space, and are utterly inappropriate for batch processing. Its session management is incapable of handling sessions with large numbers of objects in a timely manner, and its caches actually make many applications slower rather than faster. However, by applying the above to the application in question, I successfully reduced the processing time for the target largest batch sent to our system from being over 60 minutes to 3 minutes, which is roughly five times faster than it's required to be in order to meet our performance requirements. Yes, a factor of 20 times improvement. You can make Hibernate perform. The batch processor could have been made even faster by dropping down to doing raw SQL in Java, but it would have taken a factor of 20 times longer to write too.

In the end it's all about tradeoffs. Hibernate sucks, but in this case, given the deadlines and time pressures and the fact that it was the back end of a large code base already written with Groovy/Grails/Hibernate, it was the best of a batch of poor solutions. The ideal is sometimes the enemy of the good enough. If we hit a problem set large enough that we cannot achieve it with the technology we're using, then we'll drop down to lower level / faster technologies such as using raw Java EE and raw SQL (probably via something like MyBatis to intermediate for sanity's sake). In the end, however, in most applications there's other problems worth solving once performance is "good enough". So don't let Hibernate's poor performance scare you off if it's the solution to getting a product out the door in a timely manner. That is, after all, the goal -- and for most applications, Hibernate can be made fast enough.

-ELG

Behold the XKCD Passphrase Generator

2014-05-02T17:13:00.001-07:00

Behold the XKCD Passphrase Generator. Copy and paste it into a file on your own Linux machine, and run (assuming you've installed the 'words' package, which is almost always the case). It'll pick five random words and concatenate them together. Should also run on other machines with Python installed, but you may need to find a words file somewhere and edit accordingly. If this were a real program I'd add paramaters yada yada, but since it's just a toy...
#!/usr/bin/python
# XKCD passphrase generator.
# See XKCD 936 http://xkcd.com/936/
# You'll have to provide your own 'words' file. One word per line.
# Unix based systems usually have /usr/share/dict/words but you'll need
# to get that from somewhere else for Windows or etc.
# After editing wordsfile, numwords, separator:
# Execute as: python genpf.py  

import os

wordsfile="/usr/share/dict/words"
numwords = 5
separator = ".%*#!|"

def gen_index(len):
    i=(ord(os.urandom(1)) << 16) + (ord(os.urandom(1)) << 8) + (ord(os.urandom(1)))
    return i%len

f=open(wordsfile)

pf=""
words = f.readlines()

i=gen_index(len(separator))
c=separator[i]
while (numwords > 0):
    i=gen_index(len(words))
    s=words[i].strip()
    if (pf==""):
        pf=s
    else:
        pf=pf+c+s
        pass
    numwords = numwords - 1

print pf

Hibernate gotchas: ordering of operations

2014-02-20T12:07:00.001-08:00

Grails / GORM was throwing a Hibernate error from time to time:

org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

What was confusing was that there was no update anywhere in the code in question, which was a queue runner. The answer to what was causing this was interesting, and says a lot about Hibernate and its (in)capabilities.

The error in question was being thrown by transaction flush in a queue runner. The queue is in a Postgres database. Each site gets locked in the Postgres database, then its queue run with each queued-up item in the Postgres database deleted after it is processed, then the site gets unlocked.

The first problem arose with the lock/unlock code. There was a race condition, clearly, when two EC2 instances tried to lock the same site at the same time. The way it was originally implemented was with Hibernate, the first would create its lock record, then flush the transaction, then re-query to see whether there was other lockers with a lower ID holding a lock on the object. If so, it'd release the lock by deleting its lock record. Meanwhile the other instance finished processing that queue and released all locks on that site. So the first instance would go to delete its lock, find that it'd already been deleted, and throw that exception.

Once that was resolved, the queue runner itself started throwing the exception occasionally when the transaction was flushed after the unlock. What was discovered by turning on Hibernate debug was that Hibernate was re-ordering operations so that the unlock got applied to the database *before* the deletes got applied to the database. So the site would get unlocked, another queue runner would then re-lock the site to itself and start processing the same records that previously got processed, then go to delete the records, and find that the records had already been deleted out from under it. Bam.

The solution, in this case, was to rewrite the code to use the Groovy SQL API rather than use GORM/Hibernate.

What this does emphasize is that you cannot rely on operations being executed by Hibernate in the same order in which you specified them. For the most part this isn't a big deal, because everything you're operating on is going to be applied in a reasonable order so that the dependencies get satisfied. E.g. if you create a site then a host inside a site, the site record will get created in the database before the host record gets created in the database. But if ordering matters... time to use a different tool. Hibernate isn't it.

-ELG

EBS, the killer app for Amazon EC2

2013-11-21T00:06:00.001-08:00

So I gave Ceph a try. I set up a Ceph cluster in my lab with three storage servers connected via 10 gigabit Ethernet. The data store on each machine is capable of approximately 300 megabytes per second streaming throughput. So I created a Ceph block device and ran streaming I/O with large writes to it. As in, 1 megabyte writes. If writing to the raw data store, I get roughly 300 megabytes per second throughput. If writing through ceph, I get roughly 30 megabytes per second throughput.

"But wait!" the folks on the Ceph mailing list said. "We promise good *aggregate* throughput, not *individual* throughput." So I created a second Ceph block device and striped it with the first. And did my write again. And got... 30 megabytes per second. Apparently Ceph aggregated all I/O coming from my virtual machine and serialized it. And this is what its limit was.

This seems to be a hard limit with Ceph if you're not logging to a SSD. And not any old SSD. One optimized for small writes. There seems to be some fundamental architectural issues with Ceph that keep it from performing at anywhere near hardware speed. The decision to log everything, regardless of whether your application needs that level of reliability, appears to be one of those decisions. So Ceph simply doesn't solve a problem that's interesting to me. My users are not going to tolerate running at 1/10th the speed they're accustomed to running, and my management is not going to appreciate me telling them that we need to buy some very pricy and expensive enterprise SSD hardware when the current hardware with the stock Linux storage stack on it runs like a scalded cat without said SSD hardware. There's no "there" there. It just won't work for me.

So that's Ceph. Looks great on paper, but the reality for me is that it's 1/10th the speed without any real benefits for me over just creating iSCSI volumes on my appliances, pointing my virtual machines at them, and then using MD RAID on the virtual machine to do replication between the iSCSI servers. Yes that's annoying to manage, but at least it works, works *fast*, and my users are happy.

At which point let's talk about Amazon EC2. Frankly, EC2 sucks. First, it isn't very elastic. I have autoscaling alerts set up to spin up new virtual machines when load on my cloud reaches a certain point. Okay, fine. That's elastic, right? But: While an image is booting up and configuring, it's using 100% CPU. Which means that your alarm goes off *again* and spin up more instances, ad infinitum, until you hit the upper limit you configured for the autoscaling group, *unless* you put hysteresis in there to wait until the instance is up before you spin up another instance. So: It took five minutes to spin up a new instance. That's not acceptable. If the load has kept going up in the meantime, that means you might never catch up. I then created a new AMI that had Tomcat and other necessary software already pre-loaded. That brought the spin up time to three minutes -- one minute for Amazon to actually create the instance, one minute for the instance to boot and run Puppet to pull in the application .war file payload from the puppetmaster, and one minute for Tomcat to actually start up. Acceptable... barely. This is elastic... if elastic is eons in computer time. The net result is to encourage you to spin up new instances before you need them, spin up *multiple* instances at a time when spinning up instances, and then take a while before tearing them back down again once load goes down. Not the best way to handle things by any means, unless you're Amazon and making money by forcing people to keep excess instances hanging around.

Then let's not talk about the fact that CloudFormation is configured via JSON. A format deliberately designed without comments for data interchange between computers, *not* for configuring application stacks. Whoever specified JSON as the configuration language needs to be taken out behind Amazon HQ and beat with a clue stick until well and bloody. XML is painful enough as a configuration language. JSON is pure torture. Waterboarding is too good for that unknown programmer.

And then there's expense. Amazon EC2 is amazingly expensive compared to somebody like, say, Digital Ocean or Linode. My little 15 virtual machine cloud would cost roughly $300/month at Linode, less at Digital Ocean. You can figure four times that amount for EC2.

So why use EC2? Well, my Ceph experiment should clue you in there: it's all about EBS, the Elastic Block Store. See, I have data that I need to store up in the cloud. A *lot* of data. And if you create EBS-optimized virtual machines and striped MD RAID arrays across multiple EBS volumes, your I/O is wicked fast. With just two EBS volumes I can easily exceed 1,000 IOPS and 100 megabytes per second when doing pg_restore of database files. With more EBS volumes I could do even better.

Digital Ocean has nothing like EBS. Linode has nothing like EBS. Rackspace and HP have something like EBS (in public beta for HP right now, so not sure how much to trust it), but they don't have a good instance size match with what I need and if you go the next size up, their pricing is even more ridiculous than Amazon's. My guess is that as OpenStack matures and budget providers adopt it, you're going to see prices come down for cloud computing and you're going to see more people providing EBS-like functionality. But right now OpenStack is chugging away furiously trying to match Amazon's feature set, and is unstable enough that only providers like HP and Linode who have hundreds of network engineers to throw at it could possibly do it right. Each iteration gets better so hopefully the next iteration will be better. (Note from 10 months later: nope. Still not ready for mere mortals). Finally, there's Microsoft's Azure. I've heard good things about it, oddly enough. But I'm still not trusting it too much for Linux hosting, given that Microsoft only recently started giving grudging support to Linux. Maybe in six months or so I'll return and look at it again. Or maybe not. We'll see.

So Amazon's cloud it is. Alas. We look at the Amazon bill every month and ask ourselves, "surely there is a better alternative?" But the answer to that question has remained the same for each of the past six months that I've asked it, and it remains the same for one reason: EBS, the Elastic Block Store.

-ELG

The Linux storage stack: is it ready for prime time yet?

2013-08-24T22:05:00.002-07:00

I've been playing with LIO quite a bit since rolling it into production for Viakoo's infrastructure (and at home for my personal experiments). It works quite a bit differently from the way that Intransa's Storstac worked. Storstac created a target for each volume being exported, while with LIO you have a single target that exports a LUN for each volume being exported. The underlying Linux kernel functionality is there to create a target per volume, but the configuration infrastructure is oriented around the LUN per volume paradigm.
Not a big deal, you might say. But it does make a difference when connecting with the Windows initiator. With the Windows initiator, the target per volume paradigm allows you to see what volume a particular LUN is connected to (assuming you give your targets descriptive names, which StorStac does). This in turn allows you to easily coordinate management of a specific target. For example to resize it you can offline it in Windows, stop exporting it on the storage service, rescan in Windows, expand the volume on your storage server, re-export it on your storage server, then online it in Windows and expand your filesystem to fill the newly opened up space. Still, this is not a big deal. LIO does perform quite well and does have the underlying capabilities that can serve the enterprise. So what's missing to keep the Linux storage stack from being prime time? Well, here's what I see:

Ability to set up replication without taking down your filesystems / iSCSI exports. Intransa StorStac had replication built in, you simply set up a volume the same size on the remote target, told the source machine to replicate the volume to the remote target, and it started replicating. Right now replication is handled by DRBD in the Linux storage stack. DRBD works very well for its problem set -- local area high availability replication -- but to set up a replication after the fact on a LVM volume simply isn't possible. You have to create a drbd volume on top of a LVM volume, then copy your data into the new drbd volume. One way around this would be to automatically create a drbd volume on top of each LVM volume in your storage manager, but that adds overhead (and clutters your device table) and presents problems for udev at device assembly time. And still does not solve the problem of:
Geographic replication: StorStac at one time had the ability to do logged replication across a WAN. That is, assuming that your average WAN bandwidth is high enough to handle the number of writes done during the course of a workday, a log volume will collect the writes and ship them across the WAN in the correct order to be applied at the remote end. If you must do a geographic failover due to, say, California falling into the sea, you lose at most whatever log entries have not yet been applied at the remote end. Most filesystems will handle that in a recoverable manner as long as the writes are being applied in the correct order (which they are). DRBD *sort of* has the ability to do geographic replication via an external program, "drbd-proxy", that functions in much the same way as StorStac replication (that is, it keeps a log of writes in a disk volume and replays them to the remote server), but it's not at all integrated into the solution and is excruciatingly difficult to set up (which is true of drbd in general).
Note that LVM also has replication (of a sort) built in, via its mirror capability. You can create a replication storage pool on the remote server as a LVM volume, export it via LIO, import it via open-iscsi, create a physical volume on it, then create mirror volumes specifying this second physical volume as the place you want to put the mirror. LVM also does write logging so can handle the geographic situation. The problem comes with recovery, since what you have on the remote end is a logical volume that has a physical volume inside it that has one or more logical volumes inside it. The circumlocutions needed to actually mount and use those logical volumes inside that physical volume inside that logical volume are non-trivial, it may in fact be necessary to mount the logical volume as a loopback device then do pvscan/lvscan on the loopback device to get at those volumes. It is decidedly *not* as easy as with StorStac, where a target is a target, whether it's the target for a replication or for a client computer.

So clearly replication in the Linux storage stack is a mess, nowhere near the level of ease of use or functionality as the antiquated ten-year-old Intransa StorStac storage stack. The question is, how do we fix it? I'll think about that for a while, but meanwhile there's another issue: Linux doesn't know about SES. This is a Big Deal for big servers. SES is the SCSI Enclosure Services protocol that is implemented by most SAS fanout chips and allows control of, amongst other things, the blinky lights that can be used to identify a drive (okay, so mdmonitor told you that /dev/sdax died, where the heck is that physically located?!) . There are basically two variants extant nowadays, SAS and SAS2, that are very slightly different (alas, I had to modify StorStac to talk to the LSI SAS2X24 expander chip which very slightly changed a mode page that we depended upon to find the slot addresses). Linux itself has no notion that things like SAS disk enclosures even exist, much less any idea how to blink lights in them.

And finally, there is the RAID5/RAID6 write hole issue. Right now the only reliable way to have RAID5/RAID6 on Linux is with a hardware RAID controller that has a battery-backed stripe cache. Unfortunately once you do this, you can no longer monitor drives via smartd to catch failures before they happened (yes, I do this, and yes, it works -- I caught several drives in my infrastructure that were doing bad things before they actually failed and replaced them before I had to deal with a disaster recovery situation), you can no longer take advantage of your server's gigabytes of memory to keep a large stripe cache so that you don't have to keep thrashing the disks to load stripes in the case of random writes (if the stripe is already in cache, you just update the cache and write the dirty blocks back to the drives, rather than have to reload the entire stripe) and you can also no longer take advantage of the much faster RAID stripe computations allowed by modern server hardware (it's amazing how much faster you can do RAID stripe calculations with a 2.4Ghz Xeon than you can with an old embedded MIPS processor running at much slower speeds). In addition it is often very difficult to manage these hardware RAID controllers from within Linux. For these reasons (and other historical issues not of interest at the moment) StorStac always used software RAID. Historically, StorStac used battery-backed RAM logs for its software RAID to cache outstanding writes and recover from outages, but such battery-backed RAM log devices don't exist for modern commodity hardware such as the 12-disk Supermicro server that's sitting next to my desk. It doesn't matter anyhow, because even if it did exist, there's no provision in the current Linux RAID stack to use it.

So what's the meaning of all this? Well, the replication issue is... troubling. I will discuss that more in the future. On the other hand, things like Ceph are handling it at the filesystem level now, so perhaps block level replication via iSCSI or other block-level protocols isn't as important as it used to be. For the rest, it appears that the only thing lacking is a management framework and a utility to handle SES expander chips. The RAID[56] write hole is troublesome, but in reality data loss from that is quite rare, so I won't call it a showstopper. It appears that we can get 90% of what the Intransa StorStac storage stack used to do by using current Linux kernel functionality and a management framework on top of that, and the parts that are missing are parts that few people care about.

What does that mean for the future? Well, your guess is as good as mine. But to answer the question about the Linux storage stack: Yes, it IS ready for prime time -- with important caveats, and only if a decent management infrastructure is written to control it (because the current md/lvm tools are a complete and utter fail as anything other than tools to be used by higher-level management tools). The most important caveat being, of course, that no enterprise Linux distribution has been released yet with LIO (I am using Fedora 18 currently, which is most decidedly *not* what I want to use long-term for obvious reasons). Assuming that Red Hat 7 / Centos 7 will be based on Fedora 18, though, it appears that the Linux storage stack is the closest to being ready for prime time as it's ever been, and proprietary storage stacks are going to end up migrating to the current Linux functionality or else fall victim to being too expensive and fragile to compete.

-ELG

The killer app for virtualization

2013-08-11T21:12:00.003-07:00

The killer application for virtualization is... running legacy operating systems.

This isn't a new thought on my part. When I was designing the Intransa StorStac 7.20 storage appliance platform I deliberately put virtualization drivers into it so that we could run Intransa StorStac as a virtual appliance on some future hardware platform not supported by the 2.6.32 kernel. And yes, that works (no joke, I tried it out of course, the only thing that didn't work was sensors but if Viakoo ever wants to deliver a virtualized IntransaBrand appliance I know how to fix the sensors). My thought was future-proofing -- I could tell from the layoffs and from the unsold equipment piled up everywhere that Intransa was not long for the world, so I decided to leave whoever bought the carcass a platform that had some legs on it. So it has drivers for the network chips in the X9 series SuperMicro motherboards (Sandy/Ivy Bridge) as well as the virtualization drivers. So there's now a pretty reasonable migration path to keep StorStac running into the next decade... first migrate it to Sandy/Ivy Bridge physical hardware, then once that's EOL'ed migrate it to running on top of a virtual platform on top of Haswell or its successors.

But what brought it to mind today was ZFS. I need some of the features of the LIO iSCSI stack and some of the newer features of libvirtd for some things I am doing, so have ended up needing to run a recent Fedora on my big home server (which is now up to 48 gigabytes of memory and 14 terabytes of storage). The problem is that two of those storage drives are offsite backups from work (ZFS replication, duh) and I need to use ZFS to apply the ZFS diffsets that I haul home from work. That was not a problem for Linux kernels up to 3.9, but now Fedora 18/19 have rolled out 3.10, and ZFSonLinux won't compile against the 3.10 kernel. I found that out the hard way when the new kernel came in, and DKMS spit up all over the floor because of ZFS.

The solution? Virtualization to the rescue! I rolled up a Centos 6.4 virtual machine, pushed all the ZFS drives into it, gave it a fair chunk of memory, and voila. One legacy platform that can sit there happily for the next few years doing its thing, while the Fedora underneath it changes with the seasons.

Of course that is nothing new. A lot of the infrastructure that I migrated from Intransa's equipment onto Viakoo's equipment was virtualized servers dating in some cases all the way back to physical servers that Intransa bought in 2003 when they got their huge infusion of VC money. Still, it's just a practical reminder of the killer app for virtualization -- the fact that it allows your OS and software to survive despite underlying drivers and architectures changing with the seasons. Now making your computer work faster can be done without changing anything at all about it -- just buy a couple of new virtualization servers with the very latest fastest hardware and then migrate your virtual machines to them. Quick, easy, and terrifies OS vendors (especially Microsoft) like crazy because now you no longer need to buy a new OS to run on the new hardware, you can just keep using your old reliable OS forever.

-ELG

No magic bullet, I guess

2013-07-25T17:37:00.001-07:00

Amazon introduced a new service, Opsworks, back in the spring. This is supposed to make cloud stack creation, cloud formation, and instance management easier to handle than their older Puppet-based CloudFormation service. Being able to fail over gracefully from master to slave database, for example, would be a Very Good Thing, and they appear to have hooks that can allow that to happen (via running Chef code when a master fails). Similarly, if load gets too high on web servers they can automagically spawn new ones.

Great idea. Sort of. Except it seems to have two critical issues: 1) it doesn't appear to have any way to handle our staging->production cycle, where the transactions coming into production are replicated to staging during our testing cycle, then eventually staging is rolled to production via mechanisms I won't go into right now, and 2) it doesn't appear to actually work -- it claims that the default security groups that are needed for its version of Chef to work don't exist, and they never appeared later on either. Which isn't because I lack permission to create security groups, because I created one for an earlier prototype of my deployment. This appears to be a sporadic bug that multiple people have reported, where the default security groups aren't created for one reason or another.

Eh. Half baked and not ready for production. Oh well. Amazon *did* say it was Beta, after all. They're right.

ORM Not Considered Harmful

2013-07-24T21:44:00.000-07:00

Recently I had the task of moving a program from one database engine to another. The program primarily used Hibernate. The actual job of moving it from one database to another was... locating the JDBC module for the other database, locating the dialect name for the other database, telling Hibernate about both, and, mostly, it just worked. Except. Except there were six places where the program issued genuine, real actual SQL. Those six queries had to be re-written because they used a feature of the original database engine that didn't exist on the other database engine. Still, six queries are a lot easier to re-write than hundreds of queries. I still consider Spring/Hibernate to be evil. But this demonstrates that an ORM with a query dialect engine does have significant advantages in making your program portable into different environments. Being able to prototype your program on your personal desktop with MySQL then deploy it against a real database like Oracle without changing anything about the program other than a couple of configuration settings... that is really cool. And useful, since it causes productivity to go sky high. Now to see if there's any decent ORM for Java... -ELG

When tiny changes come back to bite

2013-05-30T18:00:00.000-07:00

A Java network application that connected to a storage array's control port via an SSL socket mysteriously quit working when moved from Java 6 to Java 7. Not *all* the time, but just in certain configurations. The application was written based on Oracle's own example code which hasn't changed since the mid 'oughts, so everybody was baffled. My assignment: Figure out what was going on.

The first thing I did was create a test framework with Java 6, Java 7, and a Python re-implementation and validate that yes, the problem was something that Java 7 was doing. Java 6 worked. Python via the cPython engine worked. Java 7 didn't work, it logged in (the first transaction) then the next transaction (actually sending a command) failed. Python via the jYthon engine on top of Java 7 didn't work. I tried on both Oracle's official Java 7 JDK and Red Hat's OpenJDK that came with Fedora 18, neither worked. So it clearly had something to do with differences in the Java 7 SSL stack.

I then went to Oracle's site to look at the change notices on difference between Java 6 and Java 7. There was nothing that looked interesting. Setting sun.security.ssl.allowUnsafeRenegotiation=true and sse.enableSNIExtension=false were two things that looked like they might account for some difference. Maybe the old storage array was using an old protocol. So I ran the program under Java 6 with -Djavax.net.debug=all, found out what protocol it was using, then hardwired my SSL ciphers list to use that protocol as default. I tested again under Java 7, and it still didn't work.

Then I ran the program on the two different environments with -Djavax.net.debug=all on my main method on both Java 6 and Java 7. The Java 6 was the OpenJDK from the Centos 6.4 distribution. The Java 7 was the OPenJDK on Fedora 18. Two screens, one on each computer, later, and the output on both of them was identical -- up until the transmission of the second SSL packet. The right screen (the Fedora 18 screen) split it into *two* SSL packets.

But why? I downloaded the source code to OpenJDK 6 and OpenJDK 7 and extracted them, and set out to figure out what was going on here. The end result: A tiny bit of code in SSLSocketImpl.java called needToSplitPayload() was called from the SSL engine, and if that code says yes, it splits the packet. So... is the cipher a CBC mode? Yes. is it the first app output record? No. Is Record.enableCBCProtection set? Wait, where is that set?! I head over to Record.java, and find that it's set from the property "jsse.enableCBCProtection" at JVM startup.

The end result: I call System.setProperty("jsse.enableCBCProtection", "false"); in my initializer (or set it on the command line) and everything works.

So what's the point of CBC protection? Well, the point is to deal with traffic analysis attacks against web servers. With snooping on the traffic you can figure out what transactions are happening and figure out important details of the session. The problem is that in my application, the receiving storage array apparently wants packets to stay, well, packets, not multiple packets. It's not a problem for the code I wrote, I look for begin and end markers and don't care how many packets it takes to collect a full set of data to shove into my XML parser, but I have no control over the legacy storage array end of things. It apparently does *not* look for begin and end markers, it just grabs a packet and expects it to be the whole tamale.

So all in all, CBC protection is a Good Thing -- except in my application, where a legacy storage array cannot handle it. Or in a wide variety of other applications where it slows down the application to the point of uselessness by breaking up nice big packets into teeny little packets that clog the network. But at least Oracle gave us a way to disable it for these applications. The only thing I'll whine about is this: Oracle apparently implemented this functionality *without* adding it to their list of changes. So I had to actually reverse-engineer the Java runtime to figure out that one obscure system property is apparently set to "false" by default in OpenJDK 6 and set to "true" by default in OpenJDK 7. Which is not The Way It Spozed To Be, but at least thanks to the power of Open Source it was not a fatal problem -- just an issue of reading through source code of the runtime until locating the part that broke up SSL packets, then tracing backwards through the decision tree up to the root. Chalk up one more reason to use Open Source -- if something mysterious is happening, you can pretty much always figure it out. It might take a while, but that's why we make the big bucks, right?

-ELG

ORM

2013-05-01T21:57:00.000-07:00

I'm going to speak heresy here: Object-Relational Mappers such as Hibernate are evil. I say that as someone who wrote an object-relational mapper back in 2000 -- the BRU Server server is written as a master class that maps objects to a record in a MySQL database, which is then inherited by child classes that implement the specific record types in the MySQL database. The master class takes care of adding the table to the database if it doesn't exist, as well as populating the Python objects on queries. The child classes take care of business logic and generating queries that don't map well to the ORM, but pass the query results to generators to produce proper object sets out of SQL data sets.

So why did this approach work so well for BRU Server, allowing us to concentrate on the business logic rather than the database logic and allowing its current owners to maintain the software for ten years now, while it fails so harshly for atrocities like Hibernate? One word: complexity. Hibernate attempts to handle all possible cases, and thus ends up producing terrible SQL queries while making things that should be easy difficult, but that's because it's a general purpose mapper. The BRU Server team -- all four of us -- understood that if we were going to create a complete Unix network backup solution within the six months allotted to us, complexity was the enemy. We understood the compromises needed between the object model and the relational model, and the fact that Python was capable of expressing sets of objects as easily as it was capable of expressing individual objects meant that the "object=record" paradigm was fairly easy to handle. We wrote as much ORM as we needed -- and no more. In some cases we had to go to raw relational database programming, but because our ORM was so simple we had no problems with that. There were no exceptions being thrown because of an ORM caching data that no longer existed in the database, and the actual objects for things like users and servers could do their thing without worrying about how to actually read and write the database.

In the meantime, I have not run into any Spring/Hibernate project that actually managed to produce usable code performing acceptably well in any reasonable time frame with a team of a reasonable size. I was at one company that decided to use the PHP-based code that three of us had written in four weeks' time as the prototype for the "real" software, which of course was going to be Java and Spring and Hibernate and Restful and all the right buzzwords, because real software isn't written in PHP of course (though our code solved the problem and didn't require advanced degrees to understand). Six months later and a cast of almost a dozen and no closer to release than at the beginning of the project, the entire project was canned and the project team fired (not me, I was on another project, but I had a friend on that project and she was not a happy camper). I don't know how much money was wasted on that project, but undoubtedly it hurried the demise of that company.

But maybe I'm just not well informed. It wouldn't be the first time, after all. So can anybody point me to a Spring/Hibernate project that is, say, around 80,000 lines of code written in under five months' time by a team of four people, that not only does database access but also does some hard-core hardware-level work slinging massive amounts of data around in a three-box client-server-agent architecture with multiple user interfaces (CLI and GUI/Web minimum)? That can handle writing hundreds of thousands of records per hour then doing complex queries on those records with MySQL without falling over? We did this with BRU Server, thanks to Python and choosing just enough ORM for what we needed (not to mention re-using around 120,000 lines of "C" code for the actual backup engine components), and no more (and no less). The ORM took me a whole five (5) days to write. Five. Days. That's it. Granted, half of that is because of Python and the introspection it allows as part of the very definition of the language. But. Five days. That's how much you save by using Spring/Hibernate over using a language such as Ruby or Python that has proper introspection and doing your own object-relational mapping. Five days. And I submit that the costs of Spring/Hibernate are far, far worse, especially for the 20% of projects that don't map well onto the Spring/Hibernate model, such as virtually everything that I do (since I'm all about system level operations).

-ELG

Configuring shared access for KVM/libvirt VM's

2013-04-27T13:28:00.000-07:00

Libvirt has some nice migration features in the latest RHEL/Centos 6.4 to let you move virtual machines from one server to the other, assuming that you . But if you try it with VM's set to auto-start on server startup, you'll swiftly run into problems the next time you reboot your compute servers -- the same VM will try to start up on multiple compute servers.

The reality is that unlike ESXi, which by default locks the VMDK file so that only a single virtual machine can use it at a time, thus meaning that the same VM set to start up on multiple servers will only start on one (that wins the race), libvirtd by default does *not* include any sort of locking. You have to configure a lock manager to do so. In my case, I configured 'sanlock', which has integration with libvirtd. So on each KVM host configured to access shared VM datastore /shared/datastore :

yum install sanlock
yum install libvirt-lock-sanlock

Now set up sanlock to start at system boot, and start it up:

chkconfig wdmd on
chkconfig sanlock on
service wdmd start
service sanlock start

On the shared datastore, create a locking directory and give it username/ID sanlock:sanlock and permissions for anybody who is in group sanlock to write to it:

cd /shared/datastore
mkdir sanlock
chown sanlock:sanlock sanlock
chmod 775 sanlock

Finally, you have to update the libvirtd configuration to use the new locking directory. Edit /etc/libvirt/qemu_sanlock.conf with the following:

auto_disk_leases = 1
disk_lease_dir = /shared/datastore/sanlock
host_id = 1
user = "sanlock"
group = "sanlock"

Everything else in the file should be commented out or a blank line. Host ID must be different for each compute host, I started counting at 1 and counted up for each compute host. And edit /etc/libvirt/qemu.conf to set the lock manager:

lock_manager = "sanlock"

(the line is probably already there, just commented out. Un-comment it). At this point, stop all your VM's on this host (or migrate them to another host), and either reboot (to make sure all comes up properly) or just restart libvirtd with

service libvirtd restart

Once you've done this on all servers, try starting up a virtual machine you don't care about on two different servers at the same time. The second attempt should fail with a locking error., At the end of the process it's always wise to shut down all your virtual machines and re-start your entire compute infrastructure that's using the sanlock locking to make sure everything comes up correctly. So-called "bounce tests" are painful, but the only way to be *sure* things won't go AWOL at system boot. If you have more than three compute servers I instead *strongly* suggest that you go to an OpenStack cloud instead, because things become unmanageable swiftly using this mechanism. At present the easiest way to deploy OpenStack appears to be Ubuntu, which has pre-compiled binaries on both their LTS and current distribution releases for OpenStack Grizzly, the latest production release of OpenStack as of this writing. OpenStack takes care of VM startup and shutdown cluster-wide and simply won't start a VM on two different servers at the same time. But that's something for another post. -ELG