Showing posts with label database. Show all posts
Showing posts with label database. Show all posts

Wednesday, May 15, 2019

"But how can I be waiting on a lock when I'm writing an entirely different table?"

Anybody who has run long transactions on PostgreSQL has run into situations where an update on one table, let's call it sphere_config, is waiting on a write from another table, let's call it component_state, to complete.

But how, you ask? How can that be? Those are two different tables, there are no rows in common between the two writes, thus no lock to wait upon!

Well, here's the deal: The transaction that's writing component_state *earlier* wrote to sphere_config. But it's still chugging away doing writes to component_state.

But wait, you say, yes, that transaction wrote to sphere_config, but wrote to records that are not in common with my transaction, so what gives? Well, what gives is this: Uniqueness constraints. Transaction A grabbed a share lock on sphere_state when it wrote to sphere_state because there's a uniqueness constraint. Until transaction A, that's writing to component_state, finishes chugging away doing its things and COMMITs its transaction, its writes to sphere_config aren't visible to transaction B, your transaction that's trying to write to sphere_config. After all, it might ROLLBACK instead because of an error. Thus your transaction B can't update the row it's trying to update because it doesn't know yet whether it would be violating a uniqueness constraint because it doesn't know yet whether transaction A wrote that same value (because transaction A hasn't committed or rolled back yet).

Now I hear you saying, "but transaction A is operating on a completely different subset of the table, there's no way that anything it's writing can violate uniqueness for transaction B." Well, Postgres doesn't know that because of the way Postgres time travels. So when transaction B goes to write to that table, it tries to grab a share lock on it... but can't, because transaction A still has it. Until transaction A finishes, transaction B is stuck.

So, how can you deal with this? Well:

  1. If you know you're never going to violate the uniqueness constraint, remove it.
  2. If you need the uniqueness constraint, try to break up your big long transactions into smaller transactions. This will clear those locks faster.
  3. Make sure your transactions get terminated by either a COMMIT or a ROLLBACK. Unfortunately some programs in languages that do exceptions have very bad error handling and, when an exception occurs that terminates execution of the database trasnaction, don't clean up the database connection behind them by issuing a ROLLBACK to the database session. Note that most people are using a connection pool between themselves and Postgres, and that connection pool may not automatically drop a connection when the program exceptions. Instead, the still-open connection (still unterminated by a ROLLBACK) may be simply put back into the pool for re-use, keeping that lock open on the . So: Try your darndest to use scaffolding like Spring's Service scaffolding that will automatically roll back transactions when there is an uncaught exception that terminates execution of a transaction. And do a lot of praying.
  4. If you are absolutely positively sure you will never have a long-running transaction, you can add a idle_in_transaction_session_timeout (integer) parameter to your postgresql.conf . Be very careful here though, if your connection pool doesn't do checking for dropped connections you can get *very* ugly user interaction with this!
That's pretty much it. Uniqueness checking in a time-traveling database like Postgres is pretty hard to do. In MySQL, if you try to write a duplicate into the index, it will collide with the record already there and your transaction will fail at that point, but Postgres doesn't ever actually update records, it just adds a new record with a higher transaction number and lets the record with a lower transaction number eventually get garbage-collected once all transactions referring to it are finished executing. This allows multiple transactions to have multiple views of the record as of the time that each transaction started, thus insuring chronological integrity (as well as adding a potential for deadlock but that's another story), but definitely makes uniqueness harder to accomplish.

So now you know how your transaction can be waiting on a lock when you're writing an entirely different table -- uniqueness checking. How... unique. And now that you know, now you know how to fix it.

-ELG

Thursday, October 26, 2017

Amazon Aurora Postgres: First thoughts

Well, I have to say that this was a bit frustrating. I never actually got my database installed into Aurora Postgres because of some serious limits of Amazon's implementation. Once I found those limits, I found that they limited my operational flexibility to the point where, for my workload, it simply doesn't work.

The biggest limits are based on the fact that Aurora Postgres doesn't use a filesystem. Rather, Amazon has created a block-based back end for Postgres that allows clustered access to the data store. The data store itself, like EBS, is replicated for performance and redundancy. This has some interesting side effects. Postgres was built around the assumption that the filesystem cache was the primary block cache. You allocate a fairly limited amount of memory to the internal Postgres shared memory pool and leave the rest to be used by the filesystem block cache. Aurora Postgres, on the other hand, must assign that memory to the internal Postgres shared memory pool in order to serve as cache since there is no filesystem and thus no filesystem block cache. Unlike the filesystem block cache pool, Postgres jobs cannot take memory away from the internal shared memory pool in order to accomplish whatever task they are doing. The end result is that internal jobs that require a lot of memory can die with out of memory errors since there's not enough memory outside the Postgres shared memory pool to allocate for that job.

The other big limitation is that Aurora Postgres has limited space for handling large sorts or indexing operations. Regular Postgres uses a directory, pgsql_tmp, in a tablespace to store temporary heap results for sorts and indexes too big to fit in work_mem (which by default is 2gb). This can be as big as your filesystem allows. If, for example, I have 500gb free in my tablespace, I have no trouble sorting an entire 150gb table into an arbitrary order then exporting it to an external consumer.

But remember, Aurora Postgres doesn't have a filesystem for its tablespace. It has a block store. Instead, Aurora Postgres instances that are doing large sorts or indexing large files use local storage, which is currently 2x the size of memory. That is, if an Aurora database instance has 72gb of memory, you only have 144gb of temporary space. Good luck sorting that 150gb table.

What this means for me is that Aurora Postgres has some interesting scalability limits when dealing with very large data sets. I'm currently managing about 2 billion rows in Postgres. Needless to say, this requires a lot of very large indexes in order to segment this data space into usable consumable subsets. Creating these indexes is a slow and tedious problem on Aurora Postgres because you have to do them one at a time, you can't do them in parallel, due to the lack of temporary space to use for the sort heaps. And if I'm querying and sorting significant subsets of this database, again Aurora Postgres has some serious limits due to the inability to expand pgsql_tmp.

My guess is that people who are dealing with much fewer rows, but who are querying those rows with much greater frequency, will have a more successful experience with Aurora Postgres. But then they'll run into the IOPS costs. Basically, to get the same IOPS that would cost me $1200/month on EBS, I'd end up paying around $4,000/month on Aurora.

So: What's the point of Aurora? Aurora does have a couple of positives. You can create additional read replicas virtually instantly, since they're just pointed at the same shared block storage. Failover simply happens, and happens almost instantly. And from a management point of view, Aurora makes the database administrator's job far simpler since you no longer have to closely monitor your tablespaces and expand your block storage as needed (and reallocate tables and indexes across multiple tablespaces using pg_repack) in order to handle growing your dataset. Still, in its current state of development, given its limitations and high costs compared to running your own cluster, I really cannot recommend Aurora Postgres.

-ELG

Thursday, August 21, 2014

Performance tips for Grails / Hibernate batch processing

So I'm working on an application that does batch processing of records sent by client systems, and Groovy/Grails is the language/framework that the application is written with. This is a story of how it failed -- and how it was fixed.

Failure #1: Record sets sent via HTTP take too long to process, causing HTTP timeouts before a response can be returned to the client. Solution: Plop the record sets into a batch queue instead, and process them via a batch queue runner running as a Quartz job.

Failure #2: Hibernate/Grails optimistic locking is, well, overly optimistic. As in, if I have multiple EC2 instances processing batch queues, I have to hope and pray that two different instances don't attempt to process the same set of records at the same time, else they'll both fail and rollback at some point in time and my batch queue will never get emptied. Meanwhile, Hibernate fine-grained locking is too fine-grained, and ends up causing deadlocks.

Solution: Create a locking system (via your database or via memcached or whatever, doesn't matter as long as it serializes access) and divide your database records into logical non-overlapping sets. Then lock those logical sets at a higher level prior to processing a batch that touches that particular set. For example, if you're batch processing store records at Walmart central office, a logical set might be an individual store and all its individual inventory items.

Note that this requires *very* careful schema layout to insure that things that can be changed by the end user interface do not get overwritten by the batch processor, unless you *want* them to get overwritten by the batch processor. But it's doable.

Failure #3: The Hibernate session consumes all of memory, crashing the application.

Solution: We're doing batch processing, so each record set runs for a significant amount of time (30 seconds or more) with tens of thousands of operations. This means we can let each record set have its own session. For each record set processed by the application, create a new session. Flush that session then destroy it at the end of each record set. For example:

while (batch = getNextBatch()) { // returns non-Hibernate objects, typically parsed from JSON or EDI
   Store.withNewSession {  session -> 
       ... process batch here ...
       session.flush()
   }
}

Failure #4: Multi-threaded performance slammed into a brick wall at the Hibernate query cache.

Solution: In general, the Hibernate caches are a performance hinderance when batch processing. The number of records that you process over the course of running all of your queues is far larger than the amount of memory you have, so any cached database records from the beginning of the queue run are long gone by the time the queue gets re-filled and you start over at the beginning of the queue again. Furthermore, the query cache is single-threaded, so if you're running on a modern multi-threaded processor and using multiple threads to consume its resources, you might as well be running on an 80386, performance is going to top out at less than 2 threads worth of performance. So disable the caches in the 'hibernate' block in your config/DataSource.groovy file and instead manually cache any items that you need to cache within batches or across batches:

hibernate {
    cache.use_second_level_cache = false
    cache.use_query_cache = false
      .... other options here ....
}
Failure #5: Lots of small queries kill performance.

For example, a store might send its nightly inventory records. The nightly inventory records update the quantities for each inventory item, which in turn create ordering alerts when inventory has fallen below a certain level. You know ahead of time that a) the number of inventory records is limited (figure 40,000 different items per store), and b) 75% of the items are going to be modified. So: doing things the inefficient way, you'd do:
inventory_batch.each { rec=Inventory.findByStoreAndItemNum(store,it.itemnum) ; rec.quantity=it.quantity; rec.save() }
But that results in 40,000 queries to the database, each of which has an enormous amount of Hibernate overhead associated with it.

Solution: Cache the entire set of items beforehand (using a HashMap and a cache class to wrap it), and fetch them from the cache instead. For example, assuming you've created a 'InvCache' class that caches inventory items:

  rec_set = Inventory.findAllByStore(store)
  inv_cache = new InvCache(rec_set)
   inventory_batch.each { 
           rec = inv_cache.findByItemNum(it) // looks it up in a hashmap, and if not there, adds it to the database.
           rec.quantity = it.quantity
           rec.save()  // in a real application, you'd check result of save and print validation errors.
           // in a real application would check quantity against limits and issue an inventory alert if inventory too low.
   }

Note that rec.save() does not immediately update the record, it merely marks the record as dirty and the next time Hibernate flushes, it will then issue a SQL query to do the update. You still end up issuing 35,000 update statements but that's still better than issuing 40,000 select + 35,000 update statements, and they're all issued in a single batch rather than via multiple Hibernate calls preparing statements and etc.

Failure #6: Flushes in big Hibernate sessions kill performance.

Some stores have a big inventory. It can take several seconds to flush the Hibernate session due to Hibernate's extremely inefficient algorithm for determining what needs to be flushed (it tries to trace the entire relationship structure multiple layers deep, so it is an exponential curve, not a linear line). The Hibernate session gets flushed before virtually every query that you make to the database by default, meaning that if you have to do 500 queries against the database in the course of processing to handle things not easily cached as above, you will have 500 flushes. 500 flushes times 5 seconds per flush is 41 minutes worth of flushing. EEP!

Solution #1: Don't use Hibernate's built in flushing and transaction ordering system. Do your own, because most of what you're doing is either batch appends of log records (where you're never going to query it back out again in the process of doing the batch thus don't care when it actually gets flushed), or updates of records where again you really don't care about when it's flushed. So: switch the flush mode to 'manual' and flush only when necessary to maintain relational ordering, and otherwise flush only at the end of logical batches. For example, if the store manager has added a new InventoryItem, and this new InventoryItem is referenced by a new InventoryAlert to note that this item needs to be ordered, the order will be to create the new InventoryItem, use item.save(flush:true) to flush the session, add it to the inventory cache if it's going to be used for other things, then create the new InventoryAlert. There is no need to use flush:true on the InventoryAlert because you don't care when it actually gets flushed, you care only that the InventoryItem gets saved before the InventoryAlert that references it. Hibernate is supposed to handle the dependency order here, if you properly set up your Grails objects... but sometimes it doesn't, as I've previously noted.

Note that setting the flush.mode in the hibernate{} block in DataSource.groovy will not set the flush mode to 'manual' in the session we created earlier. It will get set to 'auto' or 'commit' by Grails depending on whether you're in an @Transactional service when you create the new session, Grails ignores the Hibernate value. You'll need to explicitly set the flush mode when you create the new session:

import org.hibernate.FlushMode
...
Inventory.withNewSession { session ->
      session.setFlushMode(FlushMode.MANUAL)
        .... do processing here ....
}

Solution #2: In many cases, we are creating new records in batches. For example, cash register logs. So: Create a bunch of new records, flush them to disk, then discard them from the session in order to keep the session size down. For example:

registers.each { register ->
   LogEntry.withSession { session ->
     entry_list=[]
     register.logs.each { logentry -> 
         entry = new LogEntry(logentry)  // creates it from hash
         ... do any other processing / initialization for entry here ...
         entry.save() // would validate/check return val in real app
         entry_list.add(entry)
     }
     session.flush() // flush the 5,000 register logs for this register to disk.
     entry_list.each { entry -> 
        entry.discard() // get rid of the 5,000 register logs for this register.
     }
   }
}

Conclusion

Hibernate has a deserved reputation as an inefficient ORM that is not well suited for high performance operations. This is primarily because its standard settings are appropriate for only a small subset of the possible problem space, and are utterly inappropriate for batch processing. Its session management is incapable of handling sessions with large numbers of objects in a timely manner, and its caches actually make many applications slower rather than faster. However, by applying the above to the application in question, I successfully reduced the processing time for the target largest batch sent to our system from being over 60 minutes to 3 minutes, which is roughly five times faster than it's required to be in order to meet our performance requirements. Yes, a factor of 20 times improvement. You can make Hibernate perform. The batch processor could have been made even faster by dropping down to doing raw SQL in Java, but it would have taken a factor of 20 times longer to write too.

In the end it's all about tradeoffs. Hibernate sucks, but in this case, given the deadlines and time pressures and the fact that it was the back end of a large code base already written with Groovy/Grails/Hibernate, it was the best of a batch of poor solutions. The ideal is sometimes the enemy of the good enough. If we hit a problem set large enough that we cannot achieve it with the technology we're using, then we'll drop down to lower level / faster technologies such as using raw Java EE and raw SQL (probably via something like MyBatis to intermediate for sanity's sake). In the end, however, in most applications there's other problems worth solving once performance is "good enough". So don't let Hibernate's poor performance scare you off if it's the solution to getting a product out the door in a timely manner. That is, after all, the goal -- and for most applications, Hibernate can be made fast enough.

-ELG

Thursday, February 20, 2014

Hibernate gotchas: ordering of operations

Grails / GORM was throwing a Hibernate error from time to time:

org.hibernate.StaleStateException: Batch update returned unexpected row count from update [0]; actual row count: 0; expected: 1

What was confusing was that there was no update anywhere in the code in question, which was a queue runner. The answer to what was causing this was interesting, and says a lot about Hibernate and its (in)capabilities.

The error in question was being thrown by transaction flush in a queue runner. The queue is in a Postgres database. Each site gets locked in the Postgres database, then its queue run with each queued-up item in the Postgres database deleted after it is processed, then the site gets unlocked.

The first problem arose with the lock/unlock code. There was a race condition, clearly, when two EC2 instances tried to lock the same site at the same time. The way it was originally implemented was with Hibernate, the first would create its lock record, then flush the transaction, then re-query to see whether there was other lockers with a lower ID holding a lock on the object. If so, it'd release the lock by deleting its lock record. Meanwhile the other instance finished processing that queue and released all locks on that site. So the first instance would go to delete its lock, find that it'd already been deleted, and throw that exception.

Once that was resolved, the queue runner itself started throwing the exception occasionally when the transaction was flushed after the unlock. What was discovered by turning on Hibernate debug was that Hibernate was re-ordering operations so that the unlock got applied to the database *before* the deletes got applied to the database. So the site would get unlocked, another queue runner would then re-lock the site to itself and start processing the same records that previously got processed, then go to delete the records, and find that the records had already been deleted out from under it. Bam.

The solution, in this case, was to rewrite the code to use the Groovy SQL API rather than use GORM/Hibernate.

What this does emphasize is that you cannot rely on operations being executed by Hibernate in the same order in which you specified them. For the most part this isn't a big deal, because everything you're operating on is going to be applied in a reasonable order so that the dependencies get satisfied. E.g. if you create a site then a host inside a site, the site record will get created in the database before the host record gets created in the database. But if ordering matters... time to use a different tool. Hibernate isn't it.

-ELG

Wednesday, May 1, 2013

ORM

I'm going to speak heresy here: Object-Relational Mappers such as Hibernate are evil. I say that as someone who wrote an object-relational mapper back in 2000 -- the BRU Server server is written as a master class that maps objects to a record in a MySQL database, which is then inherited by child classes that implement the specific record types in the MySQL database. The master class takes care of adding the table to the database if it doesn't exist, as well as populating the Python objects on queries. The child classes take care of business logic and generating queries that don't map well to the ORM, but pass the query results to generators to produce proper object sets out of SQL data sets.

So why did this approach work so well for BRU Server, allowing us to concentrate on the business logic rather than the database logic and allowing its current owners to maintain the software for ten years now, while it fails so harshly for atrocities like Hibernate? One word: complexity. Hibernate attempts to handle all possible cases, and thus ends up producing terrible SQL queries while making things that should be easy difficult, but that's because it's a general purpose mapper. The BRU Server team -- all four of us -- understood that if we were going to create a complete Unix network backup solution within the six months allotted to us, complexity was the enemy. We understood the compromises needed between the object model and the relational model, and the fact that Python was capable of expressing sets of objects as easily as it was capable of expressing individual objects meant that the "object=record" paradigm was fairly easy to handle. We wrote as much ORM as we needed -- and no more. In some cases we had to go to raw relational database programming, but because our ORM was so simple we had no problems with that. There were no exceptions being thrown because of an ORM caching data that no longer existed in the database, and the actual objects for things like users and servers could do their thing without worrying about how to actually read and write the database.

In the meantime, I have not run into any Spring/Hibernate project that actually managed to produce usable code performing acceptably well in any reasonable time frame with a team of a reasonable size. I was at one company that decided to use the PHP-based code that three of us had written in four weeks' time as the prototype for the "real" software, which of course was going to be Java and Spring and Hibernate and Restful and all the right buzzwords, because real software isn't written in PHP of course (though our code solved the problem and didn't require advanced degrees to understand). Six months later and a cast of almost a dozen and no closer to release than at the beginning of the project, the entire project was canned and the project team fired (not me, I was on another project, but I had a friend on that project and she was not a happy camper). I don't know how much money was wasted on that project, but undoubtedly it hurried the demise of that company.

But maybe I'm just not well informed. It wouldn't be the first time, after all. So can anybody point me to a Spring/Hibernate project that is, say, around 80,000 lines of code written in under five months' time by a team of four people, that not only does database access but also does some hard-core hardware-level work slinging massive amounts of data around in a three-box client-server-agent architecture with multiple user interfaces (CLI and GUI/Web minimum)? That can handle writing hundreds of thousands of records per hour then doing complex queries on those records with MySQL without falling over? We did this with BRU Server, thanks to Python and choosing just enough ORM for what we needed (not to mention re-using around 120,000 lines of "C" code for the actual backup engine components), and no more (and no less). The ORM took me a whole five (5) days to write. Five. Days. That's it. Granted, half of that is because of Python and the introspection it allows as part of the very definition of the language. But. Five days. That's how much you save by using Spring/Hibernate over using a language such as Ruby or Python that has proper introspection and doing your own object-relational mapping. Five days. And I submit that the costs of Spring/Hibernate are far, far worse, especially for the 20% of projects that don't map well onto the Spring/Hibernate model, such as virtually everything that I do (since I'm all about system level operations).

-ELG