String sorting in Lucene the fast way

This is a follow-up to the previous blog-post on String sorting in Lucene without the memory penalty, but can be read independently. Anyone interested in the technical details can follow JIRA-issue LUCENE-2369.

Currently, sorting in Lucene is done by loading all sort-Strings into RAM. When a locale-based sorted search is called, the order of the hits is determined by comparing the Strings for the found documents with Java’s build-in Collator.

The advantage to this approach is that start up is fast. The primary disadvantage is that it takes a lot of RAM – sorting on “author” in an index with 10 million titles easily eats 1 gigabyte or more. The memory load can be traded for longer start up time by using a structure called exposed, which was the subject of the previous article. However, there is a very substantial secondary bonus to preprocessing: Sorting speed.

Java’s build in Collator does know how to sort Strings for a bunch of different Locales and to my admittedly limited knowledge on the subject, it does an okay job. However, it does take its time about the whole business: Getting the top-20 hits in sorted order takes a second or two if the query matches 5 million documents.

Part of the exposed approach is to do the sorting up front. After that, the order of the documents are stored explicitely and sorting speed is increased. By an order of magnitude. We did some experiments on our own index, which is 40GB / 7.5 million documents. It took 26 seconds to open the standard way and a whopping 7 minutes when pre-sorting, but the gains were easy to measure when we requested the top-20 documents in sorted order:

Fully warmed searches, approximate mean:
6.5M hits: standard 2500 ms, exposed 240 ms
4.1M hits: standard 1600 ms, exposed 190 ms
2.1M hits: standard 900 ms, exposed 90 ms
1.2M hits: standard 500 ms, exposed 45 ms
0.5M hits: standard 220 ms, exposed 40 ms
0.1M hits: standard 80 ms, exposed 6 ms
1.7K hits: standard 3 ms, exposed <1 ms

Couple this with the fact that the standard sorter required 1800MB to handle the index while the exposed implementation took 350MB and it’s starting to look real attractive in a lot of scenarios.

For an third approach, which seems like a good compromise, take a look at Unicode Collation in Solr.

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in Hacking, Low-level, Lucene, open source and tagged , , . Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s