Hibernate Cache Is Fundamentally Broken

Simple as the title says, I dare say Hibernate second level (L2) cache is fundamentally broken. At least with clustered cache, which appears to be supported from the official docs.

There are two strong reasons for clustering, that is spreading your load over multiple servers: Scaling out and availability.

What does availability mean? Among other things, zero downtime on updates. You take one server down, update, start it up and then continue with the next. With reasonable load balancer, proxy, middleware or what have you, the clients won’t notice a thing and be seamlessly redirected to whichever server is up at the moment.

Now, what if during an update you change definition of an entity? Add or remove a field? You have old servers with old definition, updated servers with new definition, all sharing the same clustered cache. Java has serialVersionUID for it. If you serialize an object, and then deserialize it on another node with a different version, it will fail with an exception.

Since Hibernate openly advertises clustered caching, one would expect it to work just fine with this case. Unfortunately, this is not so.

What Hibernate puts in cache is a plain old array of values of individual fields. That is submitted to clustered cache. Then a node with different version loads this array from cache and tries to populate entity with different definition with it, simply copying field by field by their numeric index.

When that happens, you’re screwed.

Suppose you have this entity definition:

class User {
  int id;
  String password;
  String email;
  String login;

… and you’re updating to this:

class User {
  int id;
  String password;
  Timestamp passwordExpires;
  String email;
  String login;

The best thing that can happen is an outage. For instance, the 3rd field used to be a String. You added a Timestamp field before it and in the new definition the 3rd field is a Timestamp. Hibernate on new nodes fails on load() with ClassCastException from String to Timestamp, because cache still has the old definition.

The much worse case is data corruption. Suppose you need a User for the following transaction:

User user = session.load(User.class, 4);

Let’s say that email was null. load() does not yield a ClassCastException, because null is a perfectly valid Timestamp. But when Hibernate loads such entity from cache, the cached entry only has 4 fields. Login is not restored and remains null. When you update(), you’re doomed. In this made up example here this would hopefully fail on a DB not-null constraint. In real life, though, it can silently save corrupted data in database and guarantee hours of very interesting debugging and restoring from backups, if not physical damage caused by your application’s misbehavior.

There’s this old piece of music called “Careful with that Axe, Eugene”. If you don’t know it, don’t bother googling. Don’t even get me started about YouTube, it needs proper sound setup and dynamics. So, here’s how it goes. Apparently boring, monotonic bass softly playing “bing, bang, bing, bang” (or D, D, D, D as Wikipedia says). Nothing happens for a few minutes, except for just as soft ambientish keyboard tones. And so on for one minute, another, then another. Then, out of the blue, an air-shattering scream. For the first time in my life I heard it in Australians’ concert, and it literally made me jump with a shot of adrenaline and panic.

That’s the experience with clustered cache and Hibernate. It’s very robust and stable. Boring. Unnoticeable. Until one day it makes you scream hard and tear your hair out.

Handle with care. Be ware. Be prepared.

Little piece of disclaimer: I don’t know if this exact example here reproduces the problem. It’s merely a made up illustration. The fields might be ordered by name, and Hibernate may refuse to restore from an array that has fewer fields than the current definition. But I have witnessed both issues in real life, and they caused much pain and cost time and money.

4 thoughts on “Hibernate Cache Is Fundamentally Broken

  1. Good thing, but the time you’ve spent diagnosing this problem and writing this, could have been spent writing a patch. That’s how Open Source is supposed to work.

  2. Marcos: You’re right. Indeed, I dived so deep that I think I could create a patch. I did not mostly because I see a few possible resolutions:

    1. “Won’t fix” – Hibernate team can decide it’s not a bug, or is not important enough, and you have to take full responsibility for this kind of cache “poisoning”.

    2. Treat entities with different schema as misses, and fetch from DB:
    2a. Use serialVersionUID
    2b. Use a new class-level attribute (@MappingVersion, <mapping-version>)
    2c. Use both: 2b is preferred, but if not present use 2a.

    While I think I am capable of implementing it myself, I don’t think such an important and disputable change in core can be done by a lone developer who’s never contributed and may not see the whole big picture.

  3. Have you heard any update on this? We’re going to be using Hibernate L2 cache and absolutely want to be able to do zero-downtime rolling bounces when we deploy new versions. Seems like a pretty important issue …

  4. Darrell, I haven’t heard a word. The bug report is at https://hibernate.onjira.com/browse/HHH-6600.

    The issue may depend on what L2 implementation you use. In our case (RMI-replicated Ehcache 1.x) the only way seems to be to launch updated servers in new cache cluster (e.g. on different port). It needs effort and the cache needs to be populated from scratch, but anyway with some manual work it is possible to create a reliable zero-downtime solution for updates.

Leave a Reply

Your email address will not be published. Required fields are marked *

Spam protection by WP Captcha-Free