Wednesday, August 10, 2016

Security fail: FedEx.com

One of the first rules of security on the Internet for avoiding phishing attacks is that you never, ever, enter your user credentials into a web site unless you know that you're talking to that web site, and that web site alone -- not some spoofed web site. The way we do this in the modern era is with SSL encryption. https provides not only encryption of the content being transmitted between you and the web site, it also provides authentication. Only one server (well, tightly controlled constellation of servers for bigger web sites) has the private key whose public key is being served to you (and which can be validated against the global public key infrastructure). And that's the server that you think you're talking to. If you type, say, https://www.google.com, you will be on the Google.com homepage. Up in your browser URL bar will be a green lock. Click on that lock, and with some clicking around (depends on browser) you will be able to view the certificate and verify that you are, in fact, connected to the one and only Google.com home page owned by the one and only Google. Then, and only then, is it okay to hit the 'Login' link up at the top right of the page and log in.

So, this is how any of us who are concerned about security operate. We don't put in a user name and password unless we see that green lock, click on it, and it says we're talking to who we think we're talking to. This is because of a hacker technique called DNS poisoning, where hackers can manage to convince your local name servers that their server, not the real Google.com's web server, is where you should go to get to https://www.google.com. They then intercept the user name and ID that you enter. Well, they can convince your DNS to give the wrong address, but they don't have Google's private key, so they can't impersonate Google. So you won't get that green lock. But they hope you won't notice. You should.

This attack is called phishing, and is used to filch your user name and password, which are then used in an automated fashion on other web sites. Usually after your DNS is poisoned, you get an email telling you to go to http://some.web.site.com because your password is expiring and your account will be deleted. So I got what looked like a phishing email from Fedex.com saying that my account was going to be deleted because I hadn't logged in for over a year, unless I logged in within the next two weeks. This actually would be normal for FedEx -- the only reason to ever log in to their web site is to set up your notification preferences that tell you that a package is on the way, has been delivered, and so forth. So with the possibility that it might actually be a valid email, I manually type https://www.fedex.com into my browser's URL bar (*never* click on a URL in email! Never!), hit the ENTER key... and immediately got kicked out to a non-encrypted site.

At which point my reaction was, "WTF? Have hackers hacked the FedEx web site and are grabbing user credentials?" But it appears that's not the case. Using the host and whois systems to resolve the IP address, it goes into Akamai's site acceleration service. Instead, it looks like pure rank incompetence. FedEx is deliberately putting their customers' user names and passwords at risk because... why? Well, because they're too stupid to know how to implement SSL in an Akamai-distributed architecture, apparently. Despite the fact that Akamai has explicitly supported SSL for years.

So anyhow, I use LastPass so my password was random gibberish in the first place, so after examining the source code of the web page to see if there were obvious problems, I logged in. At that point the web site did put me into a proper SSL-encrypted web page. But the point... the point... I should have never had to enter my user name and password into a plain text unencrypted web page in the first place. There's no -- zero -- way to authenticate that you are actually talking to the site you thought you were talking to, if you're talking to a web site that's not https. DNS poisoning attacks are ridiculously easy and could have sent me *anywhere*. The only reason I felt even halfway safe talking to this web site was because LastPass had generated me a random gibberish password for this web site a year ago, so if they *did* steal my FedEx credentials, at least they could only be used to hack my FedEx account, which would be no big deal (I don't have credit card information or billing information associated with the account, it's strictly an informational account). But still. Bad FedEx. Bad, bad FedEx. Bad DevOps team, no cookie, go to your room!

The takeaway from this:

  1. Check those green locks. They're important. It tells you a) whether you're talking to the web site you think you're talking to or not (if it's there, you know you are, if it's not, you don't know), and b) tells you that any user names and passwords that you enter will go to the web site via an encrypted connection.
  2. Any web site that your company provides should only ask for user names and passwords on an SSL-encrypted https page. If someone tries to go there with a plain http: url, it should immediately be forwarded to the SSL site.
  3. If you don't follow that last rule, you will be publicly shamed, if not by me, by someone. And a public shaming is never good for your brand.
  4. Furthermore, if you don't follow that last rule, there's many potential customers who will simply refuse to use your web site. This perhaps is not a big deal for FedEx, since most of their customers have no choice but to use their web site to schedule package pickups, but if you're providing a web service to the general public? Dude. You are leaving money on the table if you do something stupid like ask for a username and password on an unencrypted page.
This isn't rocket science, people. This is just basic Web Services 101. Get it right, or choose a different profession. I understand that McDonalds is hiring. Just sayin'.

-ELG

Sunday, September 27, 2015

SSD: This changes everything

So someone commented on my last post where I predicted that providing block storage to VM's and object storage for apps was going to be the future of storage, and he pointed out some of the other ramifications of SSD. To whit: Because SSD removes a lot of the I/O restrictions that have held back applications in the past, we are now at the point where CPU in many cases is the restriction. This is especially true since Moore's Law has seemingly gone AWOL. The Westmere Xeon processors in my NAS box on the file cabinet beside my desk aren't much slower than the latest Ivy Bridge Xeon processors. The slight bump in CPU speed is far exceeded by the enormous bump in IOPS that comes with replacing rotational storage with SSD's.

I have seen that personally, myself, in watching a Grails application max out eight CPU cores while not budging the iometer on a database server running off of SSD's. What that implies is that the days of simply throwing CPU at inefficient frameworks like Grails are limited. In the future efficient algorithms and languages are going to come back in fashion to use all this fast storage that is taking over the world.

But that's not what excites me about SSD's. That's just a shuffling of priorities. What excites me about SSD's is that they free us from the tyranny of the elevator. The elevator is the requirement that we sweep the disk drive heads from bottom to top, then from top to bottom, in order to optimize reads. This in turn puts some severe restrictions on how we lay out storage for block storage -- the storage must be stored contiguously so that filesystems layered on top of the block storage can properly schedule I/O out of their buffers to satisfy the elevator. This in turn means we're stuck with the RAID write hole unless we have battery backed cache -- we can't do COW RAID stripe block replacement (that is, write altered blocks of a RAID stripe at some new location on the device then alter a stripe map table to point at those new locations and add the old locations to a free list) because a filesystem on top of the block device would not be able to schedule the elevator properly. The performance of the block storage system would fall over. Thus why traditional iSCSI/Fiber Channel vendors present contiguous LUNs to their clients.

As a result when we've tried to do COW in the past, we did it at the filesystem level so that the filesystem could properly schedule the elevator. Thus ZFS and BTRFS. They manage their own redundancy rather than using RAID at the block layer to handle their redundancy, and ideally want to directly manage the block devices. Unfortunately that really doesn't map well to a block storage back end that is based on LUNs, and furthermore, doesn't map well to virtual machine block devices represented as files on the LUN -- virtual machines all have their own elevators doing what they think are sequential ordered writes, but the COW filesystems are writing at random places, so read performance inside the virtual machines becomes garbage. Thus VMware's VMFS, which is an extent-based clustered filesystem that, again, due to the tyranny of the elevator, keeps the blocks of a virtual machine's virtual disk file located largely contiguously on the underlying block storage so that the individual virtual machines' elevators can schedule properly.

So VMFS talking to clustered block storage is one way of handling things, but then you run into limits on the number of servers that can talk to a single LUN that in turn makes it difficult to manage because you end up with hundreds of LUN's for hundreds of physical compute servers and have to schedule the LUNs so they're only active on the compute servers that have virtual machines on that specific LUN (in order to avoid hitting the limits on number of servers allowed to access a single LUN). What is needed is the ability to allocate block storage on the back end on a per-virtual-machine basis, and have the same capabilities on that back end that VMFS gives us on a single LUN -- the ability to do snapshots, the ability to do sparse LUN's, the ability to copy snapshots as new volumes, and so forth. And have it all managed by the cloud infrastructure software. This was difficult back in the days of rotational storage because we were slaves of the elevator, because we had to make sure that all this storage ended up contiguous. But now we don't -- the writes have to be contiguous, due to the limitations of SSD, but reads don't. And it's the reads that forced the elevator -- scheduling contiguous streams of writes (from multiple virtual machines / multiple files on those virtual machines) has always been easy.

I suspect this difficulty in managing VMFS on top of block storage LUNs for large numbers of ESXi compute servers is why Tintri decided to write their own extent-based filesystem and serve it as a NFS datastore to ESXi boxes, rather than as block storage LUN's. NFS doesn't have the limits on number of computers that can connect. But I'm not convinced that, going forward, this is going to be the way to do things. VSphere is a mature product that has likely reached the limits of its penetration. New startups today are raised in the cloud, primarily on Amazon's cloud, and they want a degree of flexibility to spin virtual machines up and down that make life difficult with a product that has license limits. They want to be able to spin up entire test constellations of servers to run multi-day tests on large data sets, then destroy them with a keystroke. They can do this with Amazon's cloud. They want to be able to do this on their local clouds too. The future is likely to be based on the KVM/QEMU hypervisor and virtualization layer, which can use NFS data stores but they already have the ability to present an iSCSI LUN to a virtual machine as a block device. Add in some local SSD caching at the local hypervisor level to speed up writes (as I explained last month), and you have both the flexibility of the cloud and the speed of SSD. You have the future -- a future that few storage vendors today seem to see, but one that the block storage vendors in particular are well equipped to capture if they're willing and able to pivot.

Finally, there is a question as to whether storage and compute should be separate things altogether. Why not have compute in the same box as your storage? There's two problems with that though: 1) you want to upgrade compute capability to faster processors on a regular basis without disrupting your data storage, and b) density of compute servers is much higher than density of data servers, i.e., you can put four compute blades into the same 2U space as a 24-bay data server. And as pointed out above, compute power is now going to be the limiting factor for many applications, not IOPs. Finally, you want the operational capability to add more compute servers as needed. When our team used up the full capacity of our compute servers, I just added another compute server -- I had plenty of storage. Because the demand for compute and memory just keeps going up as our team has more combinations of customer hardware and software to test, it's likely I'm going to continue to have to scale compute servers far more often than I have to scale storage servers.

So this has gone on much too long but the last thing to cover is this: Will storage boxes go the way of the dodo bird, replaced by software-defined solutions like Ceph on top of large numbers of standard Linux storage servers serving individual disks as JBOD's? It's possible, I suppose -- but it seems unlikely due to the latency of having to locate disk blocks scattered across a network. I do believe that commodity hardware is going to win everything except the high end big iron database business in the end because the performance of commodity hardware has risen to the point where it's pointless to design your own hardware rather than purchase it off the shelf from a vendor like Supermicro. But there is still going to be a need for a storage stack tied to that hardware in the end because pure software defined solutions are unable to do rudimentary things like, e.g., use SES to blink the LED of a disk bay whose SSD has failed. In the end providing an iSCSI LUN directly to a virtual machine requires both a software support side that is clearly software defined, and a hardware support side where the hardware is managed by the solution. This in turn implies that we'll continue to have storage vendors shipping storage boxes in the future -- albeit storage boxes that will incorporate increasingly large amounts of software that runs on infrastructure servers to define important functions like, e.g., spinning up a virtual machine that has a volume attached of a given size and IOPs guarantee.

-ELG

Tuesday, August 25, 2015

Where does the future of enterprise storage lie?

I've talked about how traditional block and NAS storage isn't going away for small businesses. So what about enterprise storage? In the past few years, we've seen the death of multiple vendors of scale-out block storage, two of which were of interest to me being Coraid and Intransa, both of which allowed chaining together large numbers of Ethernet-connected nodes to scale out storage across a very large array (the biggest cluster we built at Intransa had 16 nodes and a total of 1.5 petabytes of storage but the theoretical limits of the technology were significantly higher). Reality is that they had been on life support for years because the 1990's and 2000's were the decades of NAS, not of block storage. Oh, EMC was still heaving lots of big iron block storage over the wall to power big databases, but most applications of storage other than those big corporate data marts were NAS applications, whether it was Windows and Linux NAS servers at the low end or NetApp NAS servers at the high end.

NAS was pretty much a necessity back in the era of desktops and individual servers. You could mount people's home directories on a CIFS or NFS share (depending on their OS). People could share their files with each other by simply copying them to a shared directory. You saw block storage devices being exported to these desktops via iSCSI sometimes, but usually block storage devices were attached to physical servers in the back room on dedicated storage networks that were much faster than floor networks. The floor networks were fast enough to carry CIFS, but CIFS at its core is just putting and getting objects, not blocks, and can operate much more asynchronously than a block device and thus wasn't killed by latency the way iSCSI is.

But there's problems too. For one thing, every single device has to be part of a single login realm or domain of some sort, because that's how you secure connections to the NAS. Furthermore, people have to be put into groups, and access set on portions of the overall NAS cloud based on what groups a person belongs to. That was difficult enough in the days when you just had to worry about Linux servers and Windows desktops. But now you have all these devices

Which brings up the second issue with NAS -- it simply doesn't fit into a device-oriented world. Devices typically operate in a cloud world. They know how to push and pull objects via http, but they don't speak CIFS or NFS, and never will. What we are seeing is that increasingly we are operating in a world that isn't file based, it's object based. When you go into Google Docs to edit a spreadsheet, you aren't reading and writing a file. You're reading and writing an object. When you are running an internal business application, you are no longer loading a physical program and reading and writing files. You're going to a URL for a web app that most likely is talking to a back end database of some kind to load and store objects.

Now, finally, add in what has happened in the server room. You'll still see the big physical iron for things like database servers. But by and large the remainder of the server room has gone away, replaced by a private cloud, or pushed into a public cloud like Amazon's cloud. Now when people want to put up a server to run some service they don't call IT and work up a budget and wait months for an actual server to be procured etc., they work at the speed of the cloud -- they spin up a virtual machine, they attach block storage to it for the base image and for any database they need beyond object storage, and they implement whatever app they need to implement.

What this means is that block storage and object storage integrated with cloud management systems like OpenStack are the future of enterprise storage, a future that alas did not arrive soon enough for the vendors of scale-out block storage that survived the previous decade, who ended up without enough capital to enter this brave new world. NAS won't go away entirely, but it will increasingly be a departmental thing feeding desktops on the floor, not something that anything in the server room uses. And that is, in fact, what you see happening in the marketplace today. You see traditional Big Iron vendors like HDS increasingly pushing object storage, and the new solid-state storage vendors such as Pure Storage and Solidfire are predominantly block storage vendors selling into cloud environments.

So what does the future hold? For one thing, lower latencies via hypervisor integration. Exporting a share via iSCSI then mounting it via the hypervisor has all of the usual latency issues of iSCSI. Even with 10 gigabit networking now hitting affordability and 25 to 100 gigabit Ethernet in the future, latency is a killer if you're expecting a full round trip. What if writes were cached on a local SSD array, in order, and applied in order? For 99% of the applications out there this provides all the write consistency that you need. The cache will have to be turned off prior to migrating the virtual machine to a different box, of course -- thus the need for hypervisor integration -- but other than a catastrophic failure (where the virtual machine will go lights out also and thus not have inconsistent data when it is restarted on another node) you will, at best, have some minor data loss -- much better than inconsistent data.

So: Block storage with hypervisor and cloud management integration, and object storage. The question then becomes: Is there a place for the traditional dedicated storage device (or cluster of devices) in this brave new world? Maybe I'll talk about that next, because it's an interesting question, with issues of data density, storage usage, power consumption, and then what about that new buzzword, "software defined storage"? Is storage really going to be a commodity in the future where everybody's machine room has a bunch of generic server boxes loaded with someone's software? And what impact, exactly, is solid state storage having? Interesting things to think about there...

-ELG

Saturday, August 1, 2015

The quest for an integrated storage stack

In prior posts I've mentioned the multitude of problems with the standard Linux storage stack. It's inflexible -- once you've set up a stack (usually LV->VG->PV->MD->BLOCK) and opened a filesystem on it, you cannot modify it to, e.g., add a replication layer to the stac. It lacks the ability to do geographic replication in any reasonable fashion. The RAID layer in particular lacks the ability to write to (and replay) a battery-backed RAM cache to deal with the RAID 5 write hole (which, despite its name, also applies to other RAID levels and results in silently corrupted data). Throw iSCSI into this equation to provide block devices to virtual machines and, potentially, to do replication to block devices on other physical machines, and things get even more complex.

One method that has been proposed to deal with these issues is to simply not use a storage stack at all. Thus we have ZFS and BTRFS, which attempt to move the RAID layer and logical volume layers into the filesystem. This certainly solves the problem of corrupted data, but at a significant penalty in terms of performance, especially on magnetic media where the filesystem swiftly becomes fragmented. As a result running virtual machines using "block devices" that are actually files on a BTRFS filesystem results in extremely poor "disk" performance on the virtual machines. A file on a log-based subsystem is simply a poor substitute for an extent on a block device. Furthermore, use of these filesystems for databases has proven to be woefully slow compared to using a normal filesystem like XFS on top of a RAID-10 layer.

The other method that has been to abandon the Linux storage stack except as a provider of individual block devices and instead layer a distributed system like Ceph on top of it. My tests with Ceph have not been particularly promising. Performance of Ceph block devices at an individual virtual machine level were abysmal. There appears to be three reasons for this: 1) Overly pessimistic assumptions about writes on the part of Ceph, 2) The inherent latencies involved in a distributed storage stack, and 3) the fact that Ceph reads/writes via XFS filesystems layered on top of block devices, rather than to extents on raw block devices. For the latter, in my experience you will see *at least* a 10% degradation in virtual machine block device performance if the block device is implemented as a file on top of XFS rather than directly to a LVM extent.

In both cases, I wonder if we are throwing out the cart because the horse has asthma. I've worked as a software engineer for two of the pioneers of Linux-based storage -- Agami Systems, which did a NAS device with an integrated storage system, and Intransa Inc., which did scalable iSCSI storage systems with an integrated block storage subsystem. Both suffered the usual fate of pioneers -- i.e., face down dead with arrows in the back, though it took longer with Intransa than with Agami. Both wrote storage stacks for Linux which solved most of the problems of the current Linux storage stack, though each solved a different subset of those problems. There are still a significant number of businesses which do not need the expense and complexity of a full OpenStack data center in order to solve their problems, but which do need things like, e.g., logged geographic replication to replicate their data to an offsite location, something which Intransa solved ten years ago (but which, alas, died with Intransa), or real-time snapshots of virtual machine block devices at the host device level, or ...

In short: Despite the creation of distributed systems like CEPH and integrated storage management filesystems like BTRFS, there is a significant need for an integrated storage stack for Linux -- one that allows flexibility in configuring both block devices and network filesystems, which allows for easy scalability and management, which has modern features such as logged geographic replication, battery backed RAM cache support (or at least fast SSD log device support at the MD layer), and allows dynamic insertion of components into the software stack much as you could create a replication layer in the Intransa StorStac and have it sync then replicate to a remote device without ever unmounting any filesystem or making the iSCSI target inaccessible. There is simply a large number of businesses which just don't need the expense and complexity of a full OpenStack data center, which indeed don't need more than a pair of iSCSI / NAS storage appliances (a pair in order to handle replication and snapshotting), and the current Linux storage stack lacks fundamental functionality that was implemented over a decade ago but never integrated into Linux itself. It may not be possible to bring all the concepts that Agami and Intransa created into Linux (though I'll point out that all of Intransa's patents are now owned by a patent entity that allows free use for Open Source software), but we should attempt to bring as many of them as possible into the standard Linux storage stack -- because the cloud is the cloud, but most smaller businesses have no need for the cloud, they just need reliable local storage for their local physical and virtual machines.

-ELG

Saturday, April 4, 2015

Network monitoring is not for amateurs

So, another blogger recently posted that network monitoring was easy. All you needed to do was deploy off the shelf network monitoring tools X, Y, and Z, and your network would be monitored and you would be able to sleep well at night knowing that your network was working well.

At which point I have to stop and say... what... the.... bleep?

I've been doing this a *long* time. In this or prior jobs (where I was a staff engineer at a firewall company in particular) I've deployed Nagios, MRTG, Zenoss, OpenNMS, Big Brother, and a variety of proprietary network monitoring tools such as PRTG and combined firewall/IDS tools such as Checkpoint, as well as specialty tools such as smartd, arpwatch and snort. And what I have to say is that monitoring networks is *hard*. Pretty much any of the big network monitoring tools (other than Nagios, whose configuration is utterly non-automated) will take you 80% of the way to effective monitoring of your network with a few mouse-clicks. But to go to that extra 20% needed to make sure that you're immediately notified if anything your client relies on is not providing services, you end up having to install plugins, write sensors, and combine multiple tools together. It's a lengthy task that can require days of research to solve just 1% of that last 20%.

Here's an example. I was having issues where one of my iSCSI servers was dropping off the network for thirty seconds or so. I'd learn about it because my users would start howling that their Linux virtual machines were no longer functioning properly. I investigated, and found that when Linux cannot write to a volume for a certain amount of time, it then marks the volume read-only. Which, if the volume is the root volume on a Linux virtual machine, means things go to bleep in a handbasket.

And Nagios reported not a problem during all this time. All the normal Nagios sensors for system load, disk space free, memory free, and so forth showed that the virtual machine was operating completely normally. The virtual machine was pingable, it was responding to SNMP queries, it looked completely healthy to any network monitoring tool in the universe. So I ended up writing a Nagios sensor that 'touched' a file in /, and if it couldn't because the file system had been marked read-only, reported an error. I then deployed it to my virtual machines and added it to my list of sensors to monitor, and when my iSCSI server dropped offline and the filesystems went read-only, I'd get a Nagios alert. (Said iSCSI server, BTW, is not an Intransa iSCSI server, it's a generic SuperMicro box running Linux with the LIO iSCSI stack, but that's irrelevant to the conversation). After some time of getting alerted when things went AWOL I managed to zone in on exactly the set of log file entries on the iSCSI server that indicated what was going on at the time things went to bleep, and discovered that I had a failing disk in the iSCSI system that was not triggering the smartmond monitoring tool. Unfortunately figuring out which of those 24 disks were causing the SAS bus reset loop was non-trivial given that smartctl was showing no problems when I looked at them, but once I tracked it down and replaced the disk, the problem was solved.

So let's recap. I had a serious problem with my storage network. smartmond, the host-based disk monitoring tool on the iSCSI box, couldn't detect it, all the disks looked fine when you looked at them with the tool it uses, smartctl. Nagios and all its prepackaged sensors couldn't detect it until I wrote a custom sensor. And this other blogger says that monitoring complex networks including clients, storage boxes, video cameras, and so forth is *easy*? Just a matter of installing packages X, Y, and Z and sitting back and relaxing?

Crack. That's my only explanation for anybody making such a statement. Either that, or utter inexperience at monitoring complex heterogeneous networks with multiple points of failure, of which only 80% or so are covered by off-the-shelf network monitoring tools, and the remainder requiring writing custom sensors to detect domain-specific points of failure. Either way, anybody who listens to that blogger and believes that monitoring a network effectively is easy is a fool. And that's all I will say on that subject.

-ELG

Friday, January 30, 2015

User friendly

Here's a tip:

If your Open Source project requires significant work to install, work that would be easily scriptable by a competent software engineer, I'm not going to use it. The reason I'm not going to use it is because I'm going to assume you're either an incompetent software engineer, or the project is in unfinished beta state not useful for a production environment. Because a competent software engineer would have written those scripts if the product was in a finished production-ready state. Meaning you're either incompetent, or your product is a toy of use for personal entertainment only.

Is this a fair assessment on my part? Probably not. But this is what 20 years of experience has taught me. Ignore that at your peril.

-ELG

Monday, December 15, 2014

The problem with standards

As part of a long rant about programming vs engineering, Peter Welch makes the statement, "standards are unicorns." Uhm, no. Just no. Unicorns are mythical. They don't exist. Standards do exist. They're largely useless, but they exist. I have linear feet of shelf space lined with the bloody things to attest to that.

So what's a better analogy for standards? I have a good one: Hurricane stew.

After a hurricane, the power is out. On every block in hurricane country there's a guy who has a giant vat used to boil crabs and crawfish and shrimp, that sits on a large burner powered by a large propane tank. You know, That Guy. So That Guy pulls out his vat and sets it up on the driveway in front of his carporch and tosses in some water and the contents of his refrigerator before they can spoil and go bad. And his neighbors come by and toss in the contents of *their* refrigerators before they can spoil and go bad. And as people finish cleaning out their refrigerators they bring more and more random scraps and throw them in, until what you have is a big glop of soupy strangeness containing a little of every possible food item that could ever exist. And then people subsist on this hurricane stew, this random glop of indistinguishable odds and ends, for the next week or two as it continues to slowly bubble on the driveway of That Guy and continues to occasionally get new glop thrown in as people throw the contents of their freezers into it too.

That's standards -- just a big glop of indistinguishable mess created by everybody under the sun throwing their own scraps and odds and ends into it, in the end being of use to absolutely no one except the people who threw scraps into it, who end up subsisting off of it for the rest of their careers because nobody else has the slightest freakin' idea what's in that opaque bubbling mess with the oddly-colored mist wafting off of it. And every single implementation of this "standard" is different and does things in a different way in the thousands of edge cases that have been thrown into the standard as possible way to do things because that was what was in someone's refrigerator err code base at the time the standard was created and so they tossed it in before it could spoil, driving anybody who has to write software that's "standards-compliant" slowly insane to the point of stabbing a two-foot-tall printout of the standard with a knife repeatedly, over and over, while screaming "Die! Die! Die!" at the top of their lungs.

That's standards. That's what they're good for.

- ELG

Thursday, August 21, 2014

Performance tips for Grails / Hibernate batch processing

So I'm working on an application that does batch processing of records sent by client systems, and Groovy/Grails is the language/framework that the application is written with. This is a story of how it failed -- and how it was fixed.

Failure #1: Record sets sent via HTTP take too long to process, causing HTTP timeouts before a response can be returned to the client. Solution: Plop the record sets into a batch queue instead, and process them via a batch queue runner running as a Quartz job.

Failure #2: Hibernate/Grails optimistic locking is, well, overly optimistic. As in, if I have multiple EC2 instances processing batch queues, I have to hope and pray that two different instances don't attempt to process the same set of records at the same time, else they'll both fail and rollback at some point in time and my batch queue will never get emptied. Meanwhile, Hibernate fine-grained locking is too fine-grained, and ends up causing deadlocks.

Solution: Create a locking system (via your database or via memcached or whatever, doesn't matter as long as it serializes access) and divide your database records into logical non-overlapping sets. Then lock those logical sets at a higher level prior to processing a batch that touches that particular set. For example, if you're batch processing store records at Walmart central office, a logical set might be an individual store and all its individual inventory items.

Note that this requires *very* careful schema layout to insure that things that can be changed by the end user interface do not get overwritten by the batch processor, unless you *want* them to get overwritten by the batch processor. But it's doable.

Failure #3: The Hibernate session consumes all of memory, crashing the application.

Solution: We're doing batch processing, so each record set runs for a significant amount of time (30 seconds or more) with tens of thousands of operations. This means we can let each record set have its own session. For each record set processed by the application, create a new session. Flush that session then destroy it at the end of each record set. For example:

while (batch = getNextBatch()) { // returns non-Hibernate objects, typically parsed from JSON or EDI
   Store.withNewSession {  session -> 
       ... process batch here ...
       session.flush()
   }
}

Failure #4: Multi-threaded performance slammed into a brick wall at the Hibernate query cache.

Solution: In general, the Hibernate caches are a performance hinderance when batch processing. The number of records that you process over the course of running all of your queues is far larger than the amount of memory you have, so any cached database records from the beginning of the queue run are long gone by the time the queue gets re-filled and you start over at the beginning of the queue again. Furthermore, the query cache is single-threaded, so if you're running on a modern multi-threaded processor and using multiple threads to consume its resources, you might as well be running on an 80386, performance is going to top out at less than 2 threads worth of performance. So disable the caches in the 'hibernate' block in your config/DataSource.groovy file and instead manually cache any items that you need to cache within batches or across batches:

hibernate {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
      .... other options here ....
}
Failure #5: Lots of small queries kill performance.

For example, a store might send its nightly inventory records. The nightly inventory records update the quantities for each inventory item, which in turn create ordering alerts when inventory has fallen below a certain level. You know ahead of time that a) the number of inventory records is limited (figure 40,000 different items per store), and b) 75% of the items are going to be modified. So: doing things the inefficient way, you'd do:
inventory_batch.each { rec=Inventory.findByStoreAndItemNum(store,it.itemnum) ; rec.quantity=it.quantity; rec.save() }
But that results in 40,000 queries to the database, each of which has an enormous amount of Hibernate overhead associated with it.

Solution: Cache the entire set of items beforehand (using a HashMap and a cache class to wrap it), and fetch them from the cache instead. For example, assuming you've created a 'InvCache' class that caches inventory items:

  rec_set = Inventory.findAllByStore(store)
  inv_cache = new InvCache(rec_set)
   inventory_batch.each { 
           rec = inv_cache.findByItemNum(it) // looks it up in a hashmap, and if not there, adds it to the database.
           rec.quantity = it.quantity
           rec.save()  // in a real application, you'd check result of save and print validation errors.
           // in a real application would check quantity against limits and issue an inventory alert if inventory too low.
   }

Note that rec.save() does not immediately update the record, it merely marks the record as dirty and the next time Hibernate flushes, it will then issue a SQL query to do the update. You still end up issuing 35,000 update statements but that's still better than issuing 40,000 select + 35,000 update statements, and they're all issued in a single batch rather than via multiple Hibernate calls preparing statements and etc.

Failure #6: Flushes in big Hibernate sessions kill performance.

Some stores have a big inventory. It can take several seconds to flush the Hibernate session due to Hibernate's extremely inefficient algorithm for determining what needs to be flushed (it tries to trace the entire relationship structure multiple layers deep, so it is an exponential curve, not a linear line). The Hibernate session gets flushed before virtually every query that you make to the database by default, meaning that if you have to do 500 queries against the database in the course of processing to handle things not easily cached as above, you will have 500 flushes. 500 flushes times 5 seconds per flush is 41 minutes worth of flushing. EEP!

Solution #1: Don't use Hibernate's built in flushing and transaction ordering system. Do your own, because most of what you're doing is either batch appends of log records (where you're never going to query it back out again in the process of doing the batch thus don't care when it actually gets flushed), or updates of records where again you really don't care about when it's flushed. So: switch the flush mode to 'manual' and flush only when necessary to maintain relational ordering, and otherwise flush only at the end of logical batches. For example, if the store manager has added a new InventoryItem, and this new InventoryItem is referenced by a new InventoryAlert to note that this item needs to be ordered, the order will be to create the new InventoryItem, use item.save(flush:true) to flush the session, add it to the inventory cache if it's going to be used for other things, then create the new InventoryAlert. There is no need to use flush:true on the InventoryAlert because you don't care when it actually gets flushed, you care only that the InventoryItem gets saved before the InventoryAlert that references it. Hibernate is supposed to handle the dependency order here, if you properly set up your Grails objects... but sometimes it doesn't, as I've previously noted.

Note that setting the flush.mode in the hibernate{} block in DataSource.groovy will not set the flush mode to 'manual' in the session we created earlier. It will get set to 'auto' or 'commit' by Grails depending on whether you're in an @Transactional service when you create the new session, Grails ignores the Hibernate value. You'll need to explicitly set the flush mode when you create the new session:

import org.hibernate.FlushMode
...
Inventory.withNewSession { session ->
      session.setFlushMode(FlushMode.MANUAL)
        .... do processing here ....
}

Solution #2: In many cases, we are creating new records in batches. For example, cash register logs. So: Create a bunch of new records, flush them to disk, then discard them from the session in order to keep the session size down. For example:

registers.each { register ->
   LogEntry.withSession { session ->
     entry_list=[]
     register.logs.each { logentry -> 
         entry = new LogEntry(logentry)  // creates it from hash
         ... do any other processing / initialization for entry here ...
         entry.save() // would validate/check return val in real app
         entry_list.add(entry)
     }
     session.flush() // flush the 5,000 register logs for this register to disk.
     entry_list.each { entry -> 
        entry.discard() // get rid of the 5,000 register logs for this register.
     }
   }
}

Conclusion

Hibernate has a deserved reputation as an inefficient ORM that is not well suited for high performance operations. This is primarily because its standard settings are appropriate for only a small subset of the possible problem space, and are utterly inappropriate for batch processing. Its session management is incapable of handling sessions with large numbers of objects in a timely manner, and its caches actually make many applications slower rather than faster. However, by applying the above to the application in question, I successfully reduced the processing time for the target largest batch sent to our system from being over 60 minutes to 3 minutes, which is roughly five times faster than it's required to be in order to meet our performance requirements. Yes, a factor of 20 times improvement. You can make Hibernate perform. The batch processor could have been made even faster by dropping down to doing raw SQL in Java, but it would have taken a factor of 20 times longer to write too.

In the end it's all about tradeoffs. Hibernate sucks, but in this case, given the deadlines and time pressures and the fact that it was the back end of a large code base already written with Groovy/Grails/Hibernate, it was the best of a batch of poor solutions. The ideal is sometimes the enemy of the good enough. If we hit a problem set large enough that we cannot achieve it with the technology we're using, then we'll drop down to lower level / faster technologies such as using raw Java EE and raw SQL (probably via something like MyBatis to intermediate for sanity's sake). In the end, however, in most applications there's other problems worth solving once performance is "good enough". So don't let Hibernate's poor performance scare you off if it's the solution to getting a product out the door in a timely manner. That is, after all, the goal -- and for most applications, Hibernate can be made fast enough.

-ELG

Friday, May 2, 2014

Behold the XKCD Passphrase Generator

Behold the XKCD Passphrase Generator. Copy and paste it into a file on your own Linux machine, and run (assuming you've installed the 'words' package, which is almost always the case). It'll pick five random words and concatenate them together. Should also run on other machines with Python installed, but you may need to find a words file somewhere and edit accordingly. If this were a real program I'd add paramaters yada yada, but since it's just a toy...
#!/usr/bin/python
# XKCD passphrase generator.
# See XKCD 936 http://xkcd.com/936/
# You'll have to provide your own 'words' file. One word per line.
# Unix based systems usually have /usr/share/dict/words but you'll need
# to get that from somewhere else for Windows or etc.
# After editing wordsfile, numwords, separator:
# Execute as: python genpf.py  

import os

wordsfile="/usr/share/dict/words"
numwords = 5
separator = ".%*#!|"

def gen_index(len):
    i=(ord(os.urandom(1)) << 16) + (ord(os.urandom(1)) << 8) + (ord(os.urandom(1)))
    return i%len

f=open(wordsfile)

pf=""
words = f.readlines()

i=gen_index(len(separator))
c=separator[i]
while (numwords > 0):
    i=gen_index(len(words))
    s=words[i].strip()
    if (pf==""):
        pf=s
    else:
        pf=pf+c+s
        pass
    numwords = numwords - 1

print pf

Thursday, February 20, 2014

Hibernate gotchas: ordering of operations

Grails / GORM was throwing a Hibernate error from time to time:

org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

What was confusing was that there was no update anywhere in the code in question, which was a queue runner. The answer to what was causing this was interesting, and says a lot about Hibernate and its (in)capabilities.

The error in question was being thrown by transaction flush in a queue runner. The queue is in a Postgres database. Each site gets locked in the Postgres database, then its queue run with each queued-up item in the Postgres database deleted after it is processed, then the site gets unlocked.

The first problem arose with the lock/unlock code. There was a race condition, clearly, when two EC2 instances tried to lock the same site at the same time. The way it was originally implemented was with Hibernate, the first would create its lock record, then flush the transaction, then re-query to see whether there was other lockers with a lower ID holding a lock on the object. If so, it'd release the lock by deleting its lock record. Meanwhile the other instance finished processing that queue and released all locks on that site. So the first instance would go to delete its lock, find that it'd already been deleted, and throw that exception.

Once that was resolved, the queue runner itself started throwing the exception occasionally when the transaction was flushed after the unlock. What was discovered by turning on Hibernate debug was that Hibernate was re-ordering operations so that the unlock got applied to the database *before* the deletes got applied to the database. So the site would get unlocked, another queue runner would then re-lock the site to itself and start processing the same records that previously got processed, then go to delete the records, and find that the records had already been deleted out from under it. Bam.

The solution, in this case, was to rewrite the code to use the Groovy SQL API rather than use GORM/Hibernate.

What this does emphasize is that you cannot rely on operations being executed by Hibernate in the same order in which you specified them. For the most part this isn't a big deal, because everything you're operating on is going to be applied in a reasonable order so that the dependencies get satisfied. E.g. if you create a site then a host inside a site, the site record will get created in the database before the host record gets created in the database. But if ordering matters... time to use a different tool. Hibernate isn't it.

-ELG