Archive for the ‘Database’ Category

Efficient sorting and iteration on large databases

June 15, 2009

Before you read on, heed my words that this post might be a wee bit technical… If not extremely technical – caveat emptor

The Problem

In our continuous quest for a blazingly fast Summa, we ran into a performance problem extracting and sorting huge result sets from our caching database. Concretely we store ~9M rows in a H2 database, all records are annotated with a modification time (henceforth mtime) and we use this timestamp to determine if we need to update the index. When updating the index we read records from the database, sorted by this mtime column.

This means that for the initial indexing we create a sorted result set of 9M records. The first observation is that we should definitely have an index on the mtime column. Even with that, many databases will take some time for such queries and it might lead to big memory allocations or temporary tables being set up. We don’t want any of that. We want lightweight transactions and speed!

Take One, LIMIT and OFFSET

The naive approach (atleast, the first thing that I tried!) is to use the LIMIT and OFFSET Sql statements to create small result sets, of size 1000, and then do client side paging, something alá:

  SELECT * FROM records ORDER BY mtime OFFSET $last_offset LIMIT 1000

Here we increment last_offset by 1000 each time we request a new page. However this solution will perform extremely bad. The database server will need to discard the first last_offset records before it can return the next 1000 records to you, when we are talking millions of records this can be quite an overhead. The database can not apply any smart tricks to make this fast because it has no a priori way to find out where the record with offset last_offset into the result set begins.

Take Two, Salted Timestamps

So what can we do? The thing that databases are fast at is looking stuff up in indexes. We need to make it use some indexes to calculate the pages… 

The idea is to use the index on the mtime column to calculate the offset, then when we request a new page we use the mtime of the last record in the last result set. This may just work out because we sort everything by mtime. Maybe like:

  SELECT * FROM records WHERE mtime>$last_mtime ORDER BY mtime LIMIT 1000

Alas, this contains a subtle bug. Since we might insert more than one record per millisecond the mtime of a record might now be unique. This means that we might skip some records in between pages or include some records in multiple pages.

If we somehow force the mtimes to be unique the above query would actually work. One solution is to always ensure that there is at minimum 1ms between each insertion – this is way too slow for us, so we deviced what we have dubbed Salted Timestamps.

Instead of using 32 bit INTEGERs for mtime we use 64 bit integers (a BIGINT on most Sql servers). We move the actual timestamp to the most significant 44 bits and then store a salt in the least significant 20 bits. The salt is basically just a counter that is reset each millisecond, meaning that we can add 1048576 records per millisecond before we run out of salts. With this construct way we get a “timestamp” that still sorts correctly and we can even create a UNIQUE index on the mtime column.

Conclusion

We have adopted the approach with salted timestamps as described above for Summa and so far it has proven to perform quite well (avg. ~2000 records/s). An added bonus is that we only put very light load on the db, because the transactions are small and fast. You can find an implementation of this scheme in the DatabaseStorage* class and the timestamp handling in the UniqueTimestampGenerator* class in the Storage module in the Summa sources.

*) Most sorry that these links require a login (which is freely available, but anyway) – we are working on a solution with anonymous access. More on that later.

Adventures in the land of databases

March 4, 2009

The tale of three databases

In the olden days Summa used a Postgresql database as a backend for its caching/presentation engine, also called the Storage. Postgresql served us well – good performance and good reliability. The one drawback was that it required additional setup if one wanted to deploy Summa on a new machine.

In our fierce hunt for easy deployments of the Summa search engine we decided that we wanted to try out using an embedded Java database. This way we don’t depend on external processes or setup. Since Apache Derby was recently included in the Sun’s Java 6 we decided to go down that route. And for a long while it seemed that Derby was the hammer of all nails.

However, when we started doing scalability tests Derby started to play tricks with us. You see, Summa can do incremental updates of its internal indexes and for this to work we need to be able to determine efficiently if any of the indexed data has been changed and needs an update in the index. With a couple million records in the database Derby seemed to get extremely slow. I did some honest attempts at tweaking our data models and talking to the Derby community about enhancing the performance, but at the end of the day we couldn’t make it meet out needs.

After a few days with out sufficient progress (I was able to make small performance imrpovements, but not enough) it was decided to try and re-enable the olde Postgresql backend. Behold; everything was snappy and we had acceptable performance. The only problem was that now we had not solved our initial goal which was easy setup…

Enter H2. After Googling around a bit I stumbled upon the H2 embedded Java database. After some initial testing its performance seemed very promising – faster than Postgresql even though Postgresql is written in C and H2 in Java. This once again proves that Java is not slow (a prejudice that some people will probably retain to the day they die).

The Devilish Details

Despite H2 rocking our socks off, all was not well in paradise. Like Derby (particularly Derby I might add) H2 did not perform well on SQL JOINs. In our case we are talking LEFT JOINs on huge result sets. To fight this I added a mode to our abstract database layer that would avoid the JOINs and instead perform direct lookups of the stuff we JOINed in. This gave us a good performance boost, but still not good enough.

Next thing we had to learn the hard way was that when Thomas Mueller (the über cool H2 main hacker) says that H2 does not perform well on large result sets you better believe him :-) So again I added a new mode to the abstract database layer to fetch the result set as a sequence of SQL queries using the LIMIT and OFFSET keywords. In essence it is really just client side paging of the result set. With this in place I am proud to say that we have H2 delivering stellar performance.

It is likely that the same client side paging would make the Derby backend acceptably fast as well, but unfortunately Derby does not support manual paging very well (ok, you can doit, but it is not really nice).

Conclusions

Instead of the single Derby backend for our storage component that we had initially anticipated we now have three backends. The H2 and Postgresql ones perform very well while the Derby one could use some love (and is workable for smaller colelctions).

I also learned a lot more about JDBC and SQL than I had ever hoped to ever know :-)