String sorting in Lucene without the memory penalty

Caveat lector: This is highly experimental. Don’t raise your arms too high, before my findings has been verified by more competent people than me.

Secondary warning: Geek level at high.

The problem

Collator (read: Locale) based String sorting in Lucene is normally handled by keeping the Strings from a given field in memory. We know that this is quite costly (~50*#strings + 2*#chars bytes or something like 100MB for 1 million Strings with an average length of 25 characters). One known workaround is preprocessing with a parallel persistent structure, but this introduces a more complicated workflow and slows indexing speed.

A solution and a new problem

When we look at the SegmentReader from Lucene 3.0 trunk (20100318), we see that the terms can be accesses by ordinal by tweaking the code just a little bit. All that is needed to make a segment-based memory efficient sort is to make an array sortedTerms with the ordinals and do a Collections.sort with a comparator that performs a lookup for the Term for the ordinal. When we have the sorted terms, we run through them to determine the docIDs and create an array sortedIDs where sortedIDs[docID] == sortedTermIndex.

When a sorted search is started, the order of two docIDs can be determined by comparing sortedIDs[docIDA] with sortedIDs[docIDB]. Delivering the actual String is done with SegmentReader.getTerm(sortedTerms[sortedIDs[docID]]).

Using the PackedInts structure from LUCENE-1990, we arrive at a memory overhead for the structures of (#terms*log2(#terms) + #documents*log2(#terms)) / 8 bytes.
For an index of 1 million documents with a unique Term of 25 characters for each document, this is 5MB. For 10 million, it is 55MB, where holding the Strings fully in memory would take 1GB. Temporarily we also need something like 2 * #terms * 4 bytes for sorting plus a few MB for caching.

Nice theory, but… I tried implementing this and it failed miserably. Even with a 20,000 term cache, it took 20 minutes for opening 10 million unique terms. On a machine with i7 processor and solid state drive. The operation was a success, but the patient died.

The problem, of course, is that random access to the Terms is not very fast due to the half-streaming approach that Lucene uses. This is not a critique, as it saves a lot of space. The term sorter was Java’s merge sort and although it has high spatial locality, 20% of the lookups was cache misses. More than enough to confine usage of the method to cases where the index is rarely re-opened.

A solution to the new problem

So, how do we go about switching random access sorting with sequential access sorting? It turns out that this is really simple. Okay, simple as balancing an egg on the pointy end, I guess.

  1. Sort the array of ordinals in chunks that fit the cache size exactly. The Java merge sort is beautiful as it never (okay, this is just a guess – I need to verify) requests a term at position x before it has requested the term at position x-1 has been requested. This means perfect sequential access and very fast sorting.
    We now have #terms / cacheSize chunks, each sorted in Collator order.
  2. Merge the chunks all at a time. Use a heap to determine which the chunk to take the next term from.
    In theory this means a lot of random access. However, the terms in a Lucene index is already sorted in Unicode order, so in reality there is very little switching between chunks and a lot of sequential access.

Does it work?

Initializing the sort structure for 10 million unique terms takes about 6 minutes in the current tree-day-hack implementation. Using the laptop’s conventional 7200 RPM harddisk.

This is still a non-trivial amount of time, but here’s the real eye opener: Loading all the Strings into RAM and calling Arrays.sort(myStrings, myCollator) takes 10 minutes.

Storage-based sorting faster than in-memory? How is that possible? First of all, retrieving the Strings from storage takes comparatively little time (think a few seconds) as long as we do it sequentially. Second, when we merge the chunks using the heap, we observe that values pushed on the heap very rarely goes deeper than first or second level: The Strings are already in unicode-order which is fairly well in sync with locale-based sorting (for danish at least, but I'll bet a cake that it holds for most locales). The heap-sorting step is in reality very close to O(n). By exploiting the existing order, we gain 40% speed over the generic sort (YMMW).

Where’s the catch?

As the terms are not in-memory, a high number of requests for the actual Strings takes its toll on IO. For a simple search, where we just want to return the top 20 hits, there are no problems at all. For a faceting system delivering thousands of scattered terms from different fields, this hurts on a conventional harddisk. But hey, we’re all using SSDs from now on, right?

A secondary catch is that the processing time payment is up front. However, if we don’t do it at startup, the users has to pay the price when they search. Keeping in mind that Lucene also uses a heap for collecting hits (and not the 10-minute mergesort mentioned above), this means we’re talking about ~600 ms for doing a sorted search with 1 million hits on my machine. Using the low-memory approach, it takes ~60 ms. Slow sorted searches or long startup time? You decide.

Don’t tell, show!

I’ve created a Lucene fork at http://github.com/tokee/lucene/ with the tag lucene_3_0_exposed. Check it out, do an ant and run java -cp /home/te/projects/lucene/build/lucene-core-3.1-dev.jar org.apache.lucene.index.ExposedPOC. Or you could just download the jar and skip the compiling.

The POC stands for Proof Of Concept. It is guaranteed to be buggy and only supports optimized indexes. The code is ugly with obvious optimizations not implemented. [Insert more warnings at will]. It should be able to demonstrate the point though. Try it on a index with millions of terms to sort on to really see the action.

The sky is the limit

The implications of the sorted-ordinal-array approach is not limited to sorting. Doing a little dance and hitting the collector right there means that we can use it for faceting and index lookup too. But that’s a story for another blog post.

Moving towards the real world

It’ll take a lot of work to expose this way of sorting to the upper-most API. Before I venture further on this project, I want to be double sure that it is feasible and that I haven’t missed something obvious. Thus this project is on the back-burner until the greater minds at Lucene-dev has had their say.

Thank you

Thanks to Mads Villadsen and other co-workers for listening to my ramblings on the subject and giving valuable input. Thanks to the State and University Library for allowing me to make contributions to Open Source projects in work-time.

Toke Eskildsen, IT-developer at the State and University Library, Denmark

About Toke Eskildsen

IT-Developer at statsbiblioteket.dk with a penchant for hacking Lucene/Solr.
This entry was posted in eskildsen, Hacking, Low-level, Lucene, open source, Uncategorized. Bookmark the permalink.

6 Responses to String sorting in Lucene without the memory penalty

  1. Solr 1.4 does an amazingly good job of returning facet information quickly even for million+ facet values. (Without needing a solid state disk big enough for your entire index!) I wonder how much of the Solr wheel you’re reinventing by doing your own thing on top of lucene instead of using Solr?

  2. eskildsen says:

    Solr 1.4 does this by keeping the Strings in RAM. We did this too some years ago, before deciding that we were better off saving (let me check…) 2,2GB+ of RAM by offloading it to SSD. Funnily enough I did a talk about the switch on code4lib2009, around the time Solr 1.4 was released.

    Now, with my ego out of the way, I concede that you do have a very valid point in general. While Summa and Solr are not quite the same (long story), some of the things we’re doing would be usable for Solr and a lot of Solr’s very nice and polished things would be usable for us. Some years ago, Solr did not meet our needs, but it is getting close now. It is something we’re actively discussing, so I cannot tell you more at this point.

    To get back to the issue at hand, what I am talking about is not at all re-inventing Solr technology (modulo that I missed something big). Solr does fast sorted searches by using RAM to hold collation keys. Robert Muir kindly pointed me to UnicodeCollation that explains it quite well. My proposal is more low-level and has wider implications, if it can be made to work.

  3. I’m not entirely sure that’s true. I don’t know the details, but I can tell you I get pretty fast facet display on 3 million documents with 2 million unique facet values, and I’ve only given Solr 500 megs of RAM.

    I know facetting over many values was improved significantly in 1.4; are you sure you’re familiar with 1.4?

    But I don’t know the inner details of Solr, it’s more or less a black to me, I just know it works for me, with reasonable performance, not that much RAM, and several million unique facet values. If I’m missing something, and you’ve found a different way to do things that works bettter for you, great!

    (I wouldn’t be surprised if the KEYS themselves are in RAM (or possibly an even shorter hash representation); even a million keys doesn’t take up that much RAM, just the keys, does it? I may be mis-understanding the problem you are trying to solve).

  4. eskildsen says:

    I stand corrected. Solr 1.4 introduced the UnInverted class, which offloads terms for faceting to storage. This is an explicit offload, which involves the creation of a new file (if I read the JavaDoc correctly) – it’s the same principle we use BTW.

    Now, if the Exposed-idea can be made to work, the step of creating an extra persistent structure can be skipped. We would like that and I guess the Solr guys would too. The beauty of getting it into Lucene is that everyone would benefit. I hope that answers your valid concern about re-inventing the wheel.

    Thanks for the correction and sorry about the confusion.

  5. Pingback: String sorting in Lucene the fast way « Software Development at Statsbiblioteket

  6. Pingback: Sorting, faceting and index lookup with Lucene « Software Development at Statsbiblioteket

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s