The tale of three databases
In the olden days Summa used a Postgresql database as a backend for its caching/presentation engine, also called the Storage. Postgresql served us well – good performance and good reliability. The one drawback was that it required additional setup if one wanted to deploy Summa on a new machine.
In our fierce hunt for easy deployments of the Summa search engine we decided that we wanted to try out using an embedded Java database. This way we don’t depend on external processes or setup. Since Apache Derby was recently included in the Sun’s Java 6 we decided to go down that route. And for a long while it seemed that Derby was the hammer of all nails.
However, when we started doing scalability tests Derby started to play tricks with us. You see, Summa can do incremental updates of its internal indexes and for this to work we need to be able to determine efficiently if any of the indexed data has been changed and needs an update in the index. With a couple million records in the database Derby seemed to get extremely slow. I did some honest attempts at tweaking our data models and talking to the Derby community about enhancing the performance, but at the end of the day we couldn’t make it meet out needs.
After a few days with out sufficient progress (I was able to make small performance imrpovements, but not enough) it was decided to try and re-enable the olde Postgresql backend. Behold; everything was snappy and we had acceptable performance. The only problem was that now we had not solved our initial goal which was easy setup…
Enter H2. After Googling around a bit I stumbled upon the H2 embedded Java database. After some initial testing its performance seemed very promising – faster than Postgresql even though Postgresql is written in C and H2 in Java. This once again proves that Java is not slow (a prejudice that some people will probably retain to the day they die).
The Devilish Details
Despite H2 rocking our socks off, all was not well in paradise. Like Derby (particularly Derby I might add) H2 did not perform well on SQL JOINs. In our case we are talking LEFT JOINs on huge result sets. To fight this I added a mode to our abstract database layer that would avoid the JOINs and instead perform direct lookups of the stuff we JOINed in. This gave us a good performance boost, but still not good enough.
Next thing we had to learn the hard way was that when Thomas Mueller (the über cool H2 main hacker) says that H2 does not perform well on large result sets you better believe him🙂 So again I added a new mode to the abstract database layer to fetch the result set as a sequence of SQL queries using the LIMIT and OFFSET keywords. In essence it is really just client side paging of the result set. With this in place I am proud to say that we have H2 delivering stellar performance.
It is likely that the same client side paging would make the Derby backend acceptably fast as well, but unfortunately Derby does not support manual paging very well (ok, you can doit, but it is not really nice).
Instead of the single Derby backend for our storage component that we had initially anticipated we now have three backends. The H2 and Postgresql ones perform very well while the Derby one could use some love (and is workable for smaller colelctions).
I also learned a lot more about JDBC and SQL than I had ever hoped to ever know🙂