Monday, December 15, 2014

The problem with standards

As part of a long rant about programming vs engineering, Peter Welch makes the statement, "standards are unicorns." Uhm, no. Just no. Unicorns are mythical. They don't exist. Standards do exist. They're largely useless, but they exist. I have linear feet of shelf space lined with the bloody things to attest to that.

So what's a better analogy for standards? I have a good one: Hurricane stew.

After a hurricane, the power is out. On every block in hurricane country there's a guy who has a giant vat used to boil crabs and crawfish and shrimp, that sits on a large burner powered by a large propane tank. You know, That Guy. So That Guy pulls out his vat and sets it up on the driveway in front of his carporch and tosses in some water and the contents of his refrigerator before they can spoil and go bad. And his neighbors come by and toss in the contents of *their* refrigerators before they can spoil and go bad. And as people finish cleaning out their refrigerators they bring more and more random scraps and throw them in, until what you have is a big glop of soupy strangeness containing a little of every possible food item that could ever exist. And then people subsist on this hurricane stew, this random glop of indistinguishable odds and ends, for the next week or two as it continues to slowly bubble on the driveway of That Guy and continues to occasionally get new glop thrown in as people throw the contents of their freezers into it too.

That's standards -- just a big glop of indistinguishable mess created by everybody under the sun throwing their own scraps and odds and ends into it, in the end being of use to absolutely no one except the people who threw scraps into it, who end up subsisting off of it for the rest of their careers because nobody else has the slightest freakin' idea what's in that opaque bubbling mess with the oddly-colored mist wafting off of it. And every single implementation of this "standard" is different and does things in a different way in the thousands of edge cases that have been thrown into the standard as possible way to do things because that was what was in someone's refrigerator err code base at the time the standard was created and so they tossed it in before it could spoil, driving anybody who has to write software that's "standards-compliant" slowly insane to the point of stabbing a two-foot-tall printout of the standard with a knife repeatedly, over and over, while screaming "Die! Die! Die!" at the top of their lungs.

That's standards. That's what they're good for.

- ELG

Thursday, August 21, 2014

Performance tips for Grails / Hibernate batch processing

So I'm working on an application that does batch processing of records sent by client systems, and Groovy/Grails is the language/framework that the application is written with. This is a story of how it failed -- and how it was fixed.

Failure #1: Record sets sent via HTTP take too long to process, causing HTTP timeouts before a response can be returned to the client. Solution: Plop the record sets into a batch queue instead, and process them via a batch queue runner running as a Quartz job.

Failure #2: Hibernate/Grails optimistic locking is, well, overly optimistic. As in, if I have multiple EC2 instances processing batch queues, I have to hope and pray that two different instances don't attempt to process the same set of records at the same time, else they'll both fail and rollback at some point in time and my batch queue will never get emptied. Meanwhile, Hibernate fine-grained locking is too fine-grained, and ends up causing deadlocks.

Solution: Create a locking system (via your database or via memcached or whatever, doesn't matter as long as it serializes access) and divide your database records into logical non-overlapping sets. Then lock those logical sets at a higher level prior to processing a batch that touches that particular set. For example, if you're batch processing store records at Walmart central office, a logical set might be an individual store and all its individual inventory items.

Note that this requires *very* careful schema layout to insure that things that can be changed by the end user interface do not get overwritten by the batch processor, unless you *want* them to get overwritten by the batch processor. But it's doable.

Failure #3: The Hibernate session consumes all of memory, crashing the application.

Solution: We're doing batch processing, so each record set runs for a significant amount of time (30 seconds or more) with tens of thousands of operations. This means we can let each record set have its own session. For each record set processed by the application, create a new session. Flush that session then destroy it at the end of each record set. For example:

while (batch = getNextBatch()) { // returns non-Hibernate objects, typically parsed from JSON or EDI
   Store.withNewSession {  session -> 
       ... process batch here ...
       session.flush()
   }
}

Failure #4: Multi-threaded performance slammed into a brick wall at the Hibernate query cache.

Solution: In general, the Hibernate caches are a performance hinderance when batch processing. The number of records that you process over the course of running all of your queues is far larger than the amount of memory you have, so any cached database records from the beginning of the queue run are long gone by the time the queue gets re-filled and you start over at the beginning of the queue again. Furthermore, the query cache is single-threaded, so if you're running on a modern multi-threaded processor and using multiple threads to consume its resources, you might as well be running on an 80386, performance is going to top out at less than 2 threads worth of performance. So disable the caches in the 'hibernate' block in your config/DataSource.groovy file and instead manually cache any items that you need to cache within batches or across batches:

hibernate {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
      .... other options here ....
}
Failure #5: Lots of small queries kill performance.

For example, a store might send its nightly inventory records. The nightly inventory records update the quantities for each inventory item, which in turn create ordering alerts when inventory has fallen below a certain level. You know ahead of time that a) the number of inventory records is limited (figure 40,000 different items per store), and b) 75% of the items are going to be modified. So: doing things the inefficient way, you'd do:
inventory_batch.each { rec=Inventory.findByStoreAndItemNum(store,it.itemnum) ; rec.quantity=it.quantity; rec.save() }
But that results in 40,000 queries to the database, each of which has an enormous amount of Hibernate overhead associated with it.

Solution: Cache the entire set of items beforehand (using a HashMap and a cache class to wrap it), and fetch them from the cache instead. For example, assuming you've created a 'InvCache' class that caches inventory items:

  rec_set = Inventory.findAllByStore(store)
  inv_cache = new InvCache(rec_set)
   inventory_batch.each { 
           rec = inv_cache.findByItemNum(it) // looks it up in a hashmap, and if not there, adds it to the database.
           rec.quantity = it.quantity
           rec.save()  // in a real application, you'd check result of save and print validation errors.
           // in a real application would check quantity against limits and issue an inventory alert if inventory too low.
   }

Note that rec.save() does not immediately update the record, it merely marks the record as dirty and the next time Hibernate flushes, it will then issue a SQL query to do the update. You still end up issuing 35,000 update statements but that's still better than issuing 40,000 select + 35,000 update statements, and they're all issued in a single batch rather than via multiple Hibernate calls preparing statements and etc.

Failure #6: Flushes in big Hibernate sessions kill performance.

Some stores have a big inventory. It can take several seconds to flush the Hibernate session due to Hibernate's extremely inefficient algorithm for determining what needs to be flushed (it tries to trace the entire relationship structure multiple layers deep, so it is an exponential curve, not a linear line). The Hibernate session gets flushed before virtually every query that you make to the database by default, meaning that if you have to do 500 queries against the database in the course of processing to handle things not easily cached as above, you will have 500 flushes. 500 flushes times 5 seconds per flush is 41 minutes worth of flushing. EEP!

Solution #1: Don't use Hibernate's built in flushing and transaction ordering system. Do your own, because most of what you're doing is either batch appends of log records (where you're never going to query it back out again in the process of doing the batch thus don't care when it actually gets flushed), or updates of records where again you really don't care about when it's flushed. So: switch the flush mode to 'manual' and flush only when necessary to maintain relational ordering, and otherwise flush only at the end of logical batches. For example, if the store manager has added a new InventoryItem, and this new InventoryItem is referenced by a new InventoryAlert to note that this item needs to be ordered, the order will be to create the new InventoryItem, use item.save(flush:true) to flush the session, add it to the inventory cache if it's going to be used for other things, then create the new InventoryAlert. There is no need to use flush:true on the InventoryAlert because you don't care when it actually gets flushed, you care only that the InventoryItem gets saved before the InventoryAlert that references it. Hibernate is supposed to handle the dependency order here, if you properly set up your Grails objects... but sometimes it doesn't, as I've previously noted.

Note that setting the flush.mode in the hibernate{} block in DataSource.groovy will not set the flush mode to 'manual' in the session we created earlier. It will get set to 'auto' or 'commit' by Grails depending on whether you're in an @Transactional service when you create the new session, Grails ignores the Hibernate value. You'll need to explicitly set the flush mode when you create the new session:

import org.hibernate.FlushMode
...
Inventory.withNewSession { session ->
      session.setFlushMode(FlushMode.MANUAL)
        .... do processing here ....
}

Solution #2: In many cases, we are creating new records in batches. For example, cash register logs. So: Create a bunch of new records, flush them to disk, then discard them from the session in order to keep the session size down. For example:

registers.each { register ->
   LogEntry.withSession { session ->
     entry_list=[]
     register.logs.each { logentry -> 
         entry = new LogEntry(logentry)  // creates it from hash
         ... do any other processing / initialization for entry here ...
         entry.save() // would validate/check return val in real app
         entry_list.add(entry)
     }
     session.flush() // flush the 5,000 register logs for this register to disk.
     entry_list.each { entry -> 
        entry.discard() // get rid of the 5,000 register logs for this register.
     }
   }
}

Conclusion

Hibernate has a deserved reputation as an inefficient ORM that is not well suited for high performance operations. This is primarily because its standard settings are appropriate for only a small subset of the possible problem space, and are utterly inappropriate for batch processing. Its session management is incapable of handling sessions with large numbers of objects in a timely manner, and its caches actually make many applications slower rather than faster. However, by applying the above to the application in question, I successfully reduced the processing time for the target largest batch sent to our system from being over 60 minutes to 3 minutes, which is roughly five times faster than it's required to be in order to meet our performance requirements. Yes, a factor of 20 times improvement. You can make Hibernate perform. The batch processor could have been made even faster by dropping down to doing raw SQL in Java, but it would have taken a factor of 20 times longer to write too.

In the end it's all about tradeoffs. Hibernate sucks, but in this case, given the deadlines and time pressures and the fact that it was the back end of a large code base already written with Groovy/Grails/Hibernate, it was the best of a batch of poor solutions. The ideal is sometimes the enemy of the good enough. If we hit a problem set large enough that we cannot achieve it with the technology we're using, then we'll drop down to lower level / faster technologies such as using raw Java EE and raw SQL (probably via something like MyBatis to intermediate for sanity's sake). In the end, however, in most applications there's other problems worth solving once performance is "good enough". So don't let Hibernate's poor performance scare you off if it's the solution to getting a product out the door in a timely manner. That is, after all, the goal -- and for most applications, Hibernate can be made fast enough.

-ELG

Friday, May 2, 2014

Behold the XKCD Passphrase Generator

Behold the XKCD Passphrase Generator. Copy and paste it into a file on your own Linux machine, and run (assuming you've installed the 'words' package, which is almost always the case). It'll pick five random words and concatenate them together. Should also run on other machines with Python installed, but you may need to find a words file somewhere and edit accordingly. If this were a real program I'd add paramaters yada yada, but since it's just a toy...
#!/usr/bin/python
# XKCD passphrase generator.
# See XKCD 936 http://xkcd.com/936/
# You'll have to provide your own 'words' file. One word per line.
# Unix based systems usually have /usr/share/dict/words but you'll need
# to get that from somewhere else for Windows or etc.
# After editing wordsfile, numwords, separator:
# Execute as: python genpf.py  

import os

wordsfile="/usr/share/dict/words"
numwords = 5
separator = ".%*#!|"

def gen_index(len):
    i=(ord(os.urandom(1)) << 16) + (ord(os.urandom(1)) << 8) + (ord(os.urandom(1)))
    return i%len

f=open(wordsfile)

pf=""
words = f.readlines()

i=gen_index(len(separator))
c=separator[i]
while (numwords > 0):
    i=gen_index(len(words))
    s=words[i].strip()
    if (pf==""):
        pf=s
    else:
        pf=pf+c+s
        pass
    numwords = numwords - 1

print pf

Thursday, February 20, 2014

Hibernate gotchas: ordering of operations

Grails / GORM was throwing a Hibernate error from time to time:

org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

What was confusing was that there was no update anywhere in the code in question, which was a queue runner. The answer to what was causing this was interesting, and says a lot about Hibernate and its (in)capabilities.

The error in question was being thrown by transaction flush in a queue runner. The queue is in a Postgres database. Each site gets locked in the Postgres database, then its queue run with each queued-up item in the Postgres database deleted after it is processed, then the site gets unlocked.

The first problem arose with the lock/unlock code. There was a race condition, clearly, when two EC2 instances tried to lock the same site at the same time. The way it was originally implemented was with Hibernate, the first would create its lock record, then flush the transaction, then re-query to see whether there was other lockers with a lower ID holding a lock on the object. If so, it'd release the lock by deleting its lock record. Meanwhile the other instance finished processing that queue and released all locks on that site. So the first instance would go to delete its lock, find that it'd already been deleted, and throw that exception.

Once that was resolved, the queue runner itself started throwing the exception occasionally when the transaction was flushed after the unlock. What was discovered by turning on Hibernate debug was that Hibernate was re-ordering operations so that the unlock got applied to the database *before* the deletes got applied to the database. So the site would get unlocked, another queue runner would then re-lock the site to itself and start processing the same records that previously got processed, then go to delete the records, and find that the records had already been deleted out from under it. Bam.

The solution, in this case, was to rewrite the code to use the Groovy SQL API rather than use GORM/Hibernate.

What this does emphasize is that you cannot rely on operations being executed by Hibernate in the same order in which you specified them. For the most part this isn't a big deal, because everything you're operating on is going to be applied in a reasonable order so that the dependencies get satisfied. E.g. if you create a site then a host inside a site, the site record will get created in the database before the host record gets created in the database. But if ordering matters... time to use a different tool. Hibernate isn't it.

-ELG